Staging Models¶
Auto-generated dbt model templates for raw data.
Overview¶
Staging models are the first transformation layer in Dango. They provide a starting template that references your raw data, which you can then customize with cleaning logic.
What's auto-generated:
- A
SELECT *template referencing the raw source table - Source configuration YAML files
- Basic schema tests (unique, not_null for ID columns)
What you add (as needed):
- Column selection
- Deduplication logic
- Type casting
- Column renaming
Key Points:
- Auto-generated by
dango sync(also runs during sync) - Materialized as tables (required for Metabase compatibility)
- One staging model per raw table
- Follow dbt naming conventions (
stg_<source>_<table>)
Template Design
Staging templates are intentionally minimal. This gives you full control over cleaning logic while still providing a working starting point.
Quick Start¶
Generate Staging Models¶
Staging templates are auto-generated when you sync data:
# Load raw data AND generate staging templates
dango sync --source stripe_payments
# Run transformations
dango run
Automatic generation
You don't need to run dango generate separately - dango sync handles it. Only use dango generate if you want to regenerate templates without syncing new data.
This creates:
dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml
Generated File Structure¶
SQL Models¶
Each raw table gets a corresponding staging template:
-- dbt/models/staging/stg_stripe_charges.sql
{{ config(
materialized='table',
schema='staging'
) }}
SELECT * FROM {{ source('stripe', 'charges') }}
This minimal template is ready to run immediately. You can customize it by adding column selection, type casting, or deduplication logic as needed (see Customizing Staging Models below).
Sources Configuration¶
_stg_<source>__sources.yml defines raw table references:
version: 2
sources:
- name: stripe
description: Stripe payment data loaded via dlt
schema: raw_stripe
loaded_at_field: _dlt_extracted_at
tables:
- name: charges
description: Payment charges from Stripe API
- name: customers
description: Customer records from Stripe API
- name: subscriptions
description: Subscription records from Stripe API
Schema Documentation¶
_stg_<source>__schema.yml documents staging models with auto-generated tests:
version: 2
models:
- name: stg_stripe_charges
description: Staging model for stripe charges
columns:
- name: id
description: Primary key
tests:
- unique
- not_null
Auto-generated tests
Dango automatically adds unique and not_null tests for columns that appear to be primary keys (columns named id, uuid, or ending in _id/_key). Add additional tests manually as needed.
How Generation Works¶
1. Scan Raw Layer¶
During dango sync (or dango generate), Dango inspects DuckDB for raw tables:
-- Finds all tables in raw schemas
SELECT
table_schema,
table_name
FROM information_schema.tables
WHERE table_schema LIKE 'raw%'
2. Analyze Schema¶
For each table, examines:
- Column names and types
- Primary key patterns (columns named
id,uuid, or ending in_id/_key) - dlt metadata columns (
_dlt_load_id,_dlt_extracted_at)
3. Generate Files¶
Creates:
- SQL template:
SELECT *from source table - Sources YAML: Reference to raw schema/table
- Schema YAML: Basic tests for primary key columns
Materialization Strategy¶
Tables (Default)¶
Staging models use materialized='table' for Metabase compatibility:
Why tables instead of views?
Metabase requires tables for reliable schema discovery. While views work in some cases, tables provide consistent behavior across all Metabase features.
For very large tables, consider incremental materialization:
{{ config(
materialized='incremental',
unique_key='id',
schema='staging'
) }}
SELECT * FROM {{ source('stripe', 'charges') }}
{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}
Customizing Staging Models¶
The auto-generated template is a starting point. Common customizations include:
Adding Deduplication¶
If your source data may contain duplicates (e.g., from multiple API pulls), add deduplication logic:
-- dbt/models/staging/stg_stripe_customers.sql
{{ config(materialized='table', schema='staging') }}
WITH deduplicated AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY _dlt_extracted_at DESC
) as _rn
FROM {{ source('stripe', 'customers') }}
)
SELECT * FROM deduplicated
WHERE _rn = 1
This keeps only the most recently extracted version of each record.
Selecting Specific Columns¶
Instead of SELECT *, explicitly list columns you need:
{{ config(materialized='table', schema='staging') }}
SELECT
id,
email,
name,
created,
_dlt_load_id,
_dlt_extracted_at
FROM {{ source('stripe', 'customers') }}
Type Casting¶
Cast columns to appropriate types:
{{ config(materialized='table', schema='staging') }}
SELECT
id,
CAST(amount AS DECIMAL(10,2)) / 100.0 as amount_dollars,
CAST(created AS TIMESTAMP) as created_at,
status
FROM {{ source('stripe', 'charges') }}
No Deduplication Needed¶
For append-only event logs, the template works as-is:
-- Events are unique, no deduplication needed
{{ config(materialized='table', schema='staging') }}
SELECT * FROM {{ source('analytics', 'events') }}
Column Naming Conventions¶
When customizing staging models, follow dbt best practices:
Recommended Patterns¶
SELECT
id,
customer as customer_id, -- Add _id suffix for foreign keys
amount_cents / 100.0 as amount, -- Convert to useful units
created as created_at, -- Standardize timestamp names
is_deleted as is_deleted -- Keep boolean prefixes
FROM {{ source('stripe', 'charges') }}
Reserved Names¶
Quote SQL keywords and rename them:
Standardized Timestamps¶
SELECT
created as created_at,
updated as updated_at,
deleted as deleted_at
FROM {{ source('crm', 'contacts') }}
Working with Generated Models¶
Customizing vs. Creating Intermediate Models¶
You have two options for adding logic on top of raw data:
Option 1: Customize staging directly (recommended for data cleaning)
Edit the staging model to add column selection, type casting, or deduplication:
-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}
WITH deduplicated AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _dlt_extracted_at DESC) as _rn
FROM {{ source('stripe', 'charges') }}
)
SELECT
id,
customer as customer_id,
amount / 100.0 as amount_usd,
status,
created as created_at
FROM deduplicated
WHERE _rn = 1
Option 2: Create intermediate models (recommended for business logic)
Keep staging simple, add business logic in intermediate:
-- dbt/models/intermediate/int_stripe_charges_enriched.sql
{{ config(materialized='table', schema='intermediate') }}
SELECT
c.*,
CASE
WHEN c.amount >= 100000 THEN 'enterprise'
WHEN c.amount >= 10000 THEN 'professional'
ELSE 'starter'
END as customer_tier,
DATE_TRUNC('month', c.created_at) as month
FROM {{ ref('stg_stripe_charges') }} c
WHERE c.status = 'succeeded'
Regeneration overwrites staging
Running dango generate or dango sync will overwrite staging models. If you've customized staging, either:
- Back up your changes before regenerating
- Or use intermediate models for complex logic
Regeneration Workflow¶
When Staging is Regenerated¶
Staging templates are regenerated:
- Automatically: During
dango sync(every sync regenerates staging) - Manually: When you run
dango generate
Handling Customized Staging Models¶
If you've customized staging models:
# 1. Check for uncommitted changes
git status
# 2. Backup if needed
cp dbt/models/staging/stg_stripe_charges.sql dbt/models/staging/stg_stripe_charges.sql.backup
# 3. Sync (will regenerate staging)
dango sync --source stripe_payments
# 4. Restore customizations
# Either manually restore from backup or use git checkout for individual files
# 5. Run transformations
dango run
# 6. Test
dbt test --profiles-dir dbt --project-dir dbt --select staging.*
Avoid regeneration conflicts
If you have significant staging customizations, consider putting business logic in intermediate models instead. Intermediate models are NOT overwritten by regeneration.
Testing Staging Models¶
Auto-generated Tests¶
Schema YAML includes basic tests for ID columns:
models:
- name: stg_stripe_charges
columns:
- name: id
tests:
- unique # No duplicate IDs
- not_null # ID always present
Run tests:
# Test all staging models
dbt test --profiles-dir dbt --project-dir dbt --select staging.*
# Test specific model
dbt test --profiles-dir dbt --project-dir dbt --select stg_stripe_charges
Add Custom Tests¶
Extend _stg_<source>__schema.yml:
models:
- name: stg_stripe_charges
columns:
- name: id
tests:
- unique
- not_null
- name: amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0" # Amount never negative
- name: currency
tests:
- accepted_values:
values: ['usd', 'eur', 'gbp']
Custom tests may be overwritten
Tests in _stg_<source>__schema.yml are regenerated during sync. For tests you want to preserve, create a separate schema.yml file.
Advanced Patterns¶
Multi-Source Staging¶
When same entity comes from multiple sources, create a combined model in intermediate:
-- dbt/models/intermediate/int_all_customers.sql
{{ config(materialized='table', schema='intermediate') }}
SELECT
'stripe' as source,
id as customer_id,
email,
created_at
FROM {{ ref('stg_stripe_customers') }}
UNION ALL
SELECT
'hubspot' as source,
id as customer_id,
email,
created_at
FROM {{ ref('stg_hubspot_contacts') }}
Incremental Staging¶
For very large tables:
{{ config(
materialized='incremental',
unique_key='id',
schema='staging',
on_schema_change='append_new_columns'
) }}
SELECT
id,
event_type,
timestamp,
_dlt_extracted_at
FROM {{ source('analytics', 'events') }}
{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}
Troubleshooting¶
Generation Fails¶
Problem: dango generate errors or skips tables
Solutions:
-
Check raw data exists:
-
Verify source configuration:
-
Check directory permissions:
Duplicate Records in Staging¶
Problem: Same record appears multiple times
Cause: Raw data contains duplicates (common with API sources that return the same record across multiple syncs)
Solution: Add deduplication logic to your staging model:
{{ config(materialized='table', schema='staging') }}
WITH deduplicated AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY _dlt_extracted_at DESC
) as _rn
FROM {{ source('stripe', 'charges') }}
)
SELECT * FROM deduplicated WHERE _rn = 1
Performance Issues¶
Problem: Staging models slow to run
Solutions:
-
Use incremental for large tables:
-
Limit historical data at source:
-
Select only needed columns:
Best Practices¶
1. Keep Staging Simple¶
Staging should only focus on data cleaning:
- Select columns (optional)
- Rename columns (optional)
- Cast types (optional)
- Deduplicate (if needed)
Don't add business logic in staging:
-- Bad - business logic in staging
SELECT
id,
CASE WHEN amount > 1000 THEN 'high_value' ELSE 'standard' END as tier
FROM {{ source('stripe', 'charges') }}
-- Good - staging stays simple
SELECT
id,
amount
FROM {{ source('stripe', 'charges') }}
-- Put business logic in intermediate or marts
2. Preserve dlt Metadata¶
Always keep dlt columns:
SELECT
-- Business columns
id,
email,
created,
-- dlt metadata (important for debugging)
_dlt_load_id,
_dlt_extracted_at
FROM {{ source('stripe', 'customers') }}
3. Document Column Mappings¶
When renaming:
models:
- name: stg_stripe_charges
columns:
- name: customer_id
description: "Stripe customer ID (renamed from 'customer')"
4. Test Primary Keys¶
Always test uniqueness:
5. Version Control Schema Files¶
Commit generated YAML to track schema changes:
Comparison: Staging vs. Intermediate vs. Marts¶
| Aspect | Staging | Intermediate | Marts |
|---|---|---|---|
| Purpose | Clean raw data | Reusable logic | Final metrics |
| Generated | Yes (auto template) | No (dango model add) | No (dango model add) |
| Materialization | Table | Table | Table |
| Complexity | Simple | Medium | Complex |
| Business Logic | None | Some | Full |
| Dependencies | Raw tables | Staging | Intermediate |
Reference: Generated SQL Template¶
Dango generates this template for staging models:
{{ config(
materialized='table',
schema='staging'
) }}
SELECT * FROM {{ source('<source_name>', '<table_name>') }}
This minimal template is intentionally simple. You customize it by adding:
- Column selection
- Deduplication logic
- Type casting
- Column renaming
See Customizing Staging Models for examples.
Next Steps¶
- Custom Models - Build marts and intermediate layers
- Testing - Comprehensive data quality testing
- dbt Basics - Learn dbt fundamentals
- Data Layers - Understand the full architecture