Staging Models¶

Auto-generated dbt model templates for raw data.

Overview¶

Staging models are the first transformation layer in Dango. They provide a starting template that references your raw data, which you can then customize with cleaning logic.

What's auto-generated:

A SELECT * template referencing the raw source table
Source configuration YAML files
Basic schema tests (unique, not_null for ID columns)

What you add (as needed):

Column selection
Deduplication logic
Type casting
Column renaming

Key Points:

Auto-generated by dango sync (also runs during sync)
Materialized as tables (required for Metabase compatibility)
One staging model per raw table
Follow dbt naming conventions (stg_<source>_<table>)

Template Design

Staging templates are intentionally minimal. This gives you full control over cleaning logic while still providing a working starting point.

Quick Start¶

Generate Staging Models¶

Staging templates are auto-generated when you sync data:

# Load raw data AND generate staging templates
dango sync --source stripe_payments

# Run transformations
dango run

Automatic generation

You don't need to run dango generate separately - dango sync handles it. Only use dango generate if you want to regenerate templates without syncing new data.

This creates:

dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml

Generated File Structure¶

SQL Models¶

Each raw table gets a corresponding staging template:

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(
    materialized='table',
    schema='staging'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}

This minimal template is ready to run immediately. You can customize it by adding column selection, type casting, or deduplication logic as needed (see Customizing Staging Models below).

Sources Configuration¶

_stg_<source>__sources.yml defines raw table references:

version: 2

sources:
  - name: stripe
    description: Stripe payment data loaded via dlt
    schema: raw_stripe
    loaded_at_field: _dlt_extracted_at

    tables:
      - name: charges
        description: Payment charges from Stripe API

      - name: customers
        description: Customer records from Stripe API

      - name: subscriptions
        description: Subscription records from Stripe API

Schema Documentation¶

_stg_<source>__schema.yml documents staging models with auto-generated tests:

version: 2

models:
  - name: stg_stripe_charges
    description: Staging model for stripe charges
    columns:
      - name: id
        description: Primary key
        tests:
          - unique
          - not_null

Auto-generated tests

Dango automatically adds unique and not_null tests for columns that appear to be primary keys (columns named id, uuid, or ending in _id/_key). Add additional tests manually as needed.

How Generation Works¶

1. Scan Raw Layer¶

During dango sync (or dango generate), Dango inspects DuckDB for raw tables:

-- Finds all tables in raw schemas
SELECT
    table_schema,
    table_name
FROM information_schema.tables
WHERE table_schema LIKE 'raw%'

2. Analyze Schema¶

For each table, examines:

Column names and types
Primary key patterns (columns named id, uuid, or ending in _id/_key)
dlt metadata columns (_dlt_load_id, _dlt_extracted_at)

3. Generate Files¶

Creates:

SQL template: SELECT * from source table
Sources YAML: Reference to raw schema/table
Schema YAML: Basic tests for primary key columns

Materialization Strategy¶

Tables (Default)¶

Staging models use materialized='table' for Metabase compatibility:

{{ config(materialized='table', schema='staging') }}

Why tables instead of views?

Metabase requires tables for reliable schema discovery. While views work in some cases, tables provide consistent behavior across all Metabase features.

For very large tables, consider incremental materialization:

{{ config(
    materialized='incremental',
    unique_key='id',
    schema='staging'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}

{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}

Customizing Staging Models¶

The auto-generated template is a starting point. Common customizations include:

Adding Deduplication¶

If your source data may contain duplicates (e.g., from multiple API pulls), add deduplication logic:

-- dbt/models/staging/stg_stripe_customers.sql
{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as _rn
    FROM {{ source('stripe', 'customers') }}
)

SELECT * FROM deduplicated
WHERE _rn = 1

This keeps only the most recently extracted version of each record.

Selecting Specific Columns¶

Instead of SELECT *, explicitly list columns you need:

{{ config(materialized='table', schema='staging') }}

SELECT
    id,
    email,
    name,
    created,
    _dlt_load_id,
    _dlt_extracted_at
FROM {{ source('stripe', 'customers') }}

Type Casting¶

Cast columns to appropriate types:

{{ config(materialized='table', schema='staging') }}

SELECT
    id,
    CAST(amount AS DECIMAL(10,2)) / 100.0 as amount_dollars,
    CAST(created AS TIMESTAMP) as created_at,
    status
FROM {{ source('stripe', 'charges') }}

No Deduplication Needed¶

For append-only event logs, the template works as-is:

-- Events are unique, no deduplication needed
{{ config(materialized='table', schema='staging') }}

SELECT * FROM {{ source('analytics', 'events') }}

Column Naming Conventions¶

When customizing staging models, follow dbt best practices:

Recommended Patterns¶

SELECT
    id,
    customer as customer_id,           -- Add _id suffix for foreign keys
    amount_cents / 100.0 as amount,    -- Convert to useful units
    created as created_at,             -- Standardize timestamp names
    is_deleted as is_deleted           -- Keep boolean prefixes
FROM {{ source('stripe', 'charges') }}

Reserved Names¶

Quote SQL keywords and rename them:

-- Bad (reserved keyword)
"order",

-- Good
"order" as order_id,

Standardized Timestamps¶

SELECT
    created as created_at,
    updated as updated_at,
    deleted as deleted_at
FROM {{ source('crm', 'contacts') }}

Working with Generated Models¶

Customizing vs. Creating Intermediate Models¶

You have two options for adding logic on top of raw data:

Option 1: Customize staging directly (recommended for data cleaning)

Edit the staging model to add column selection, type casting, or deduplication:

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY _dlt_extracted_at DESC) as _rn
    FROM {{ source('stripe', 'charges') }}
)

SELECT
    id,
    customer as customer_id,
    amount / 100.0 as amount_usd,
    status,
    created as created_at
FROM deduplicated
WHERE _rn = 1

Option 2: Create intermediate models (recommended for business logic)

Keep staging simple, add business logic in intermediate:

-- dbt/models/intermediate/int_stripe_charges_enriched.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    c.*,
    CASE
        WHEN c.amount >= 100000 THEN 'enterprise'
        WHEN c.amount >= 10000 THEN 'professional'
        ELSE 'starter'
    END as customer_tier,
    DATE_TRUNC('month', c.created_at) as month
FROM {{ ref('stg_stripe_charges') }} c
WHERE c.status = 'succeeded'

Regeneration overwrites staging

Running dango generate or dango sync will overwrite staging models. If you've customized staging, either:

Back up your changes before regenerating
Or use intermediate models for complex logic

Regeneration Workflow¶

When Staging is Regenerated¶

Staging templates are regenerated:

Automatically: During dango sync (every sync regenerates staging)
Manually: When you run dango generate

Handling Customized Staging Models¶

If you've customized staging models:

# 1. Check for uncommitted changes
git status

# 2. Backup if needed
cp dbt/models/staging/stg_stripe_charges.sql dbt/models/staging/stg_stripe_charges.sql.backup

# 3. Sync (will regenerate staging)
dango sync --source stripe_payments

# 4. Restore customizations
# Either manually restore from backup or use git checkout for individual files

# 5. Run transformations
dango run

# 6. Test
dbt test --profiles-dir dbt --project-dir dbt --select staging.*

Avoid regeneration conflicts

If you have significant staging customizations, consider putting business logic in intermediate models instead. Intermediate models are NOT overwritten by regeneration.

Testing Staging Models¶

Auto-generated Tests¶

Schema YAML includes basic tests for ID columns:

models:
  - name: stg_stripe_charges
    columns:
      - name: id
        tests:
          - unique        # No duplicate IDs
          - not_null      # ID always present

Run tests:

# Test all staging models
dbt test --profiles-dir dbt --project-dir dbt --select staging.*

# Test specific model
dbt test --profiles-dir dbt --project-dir dbt --select stg_stripe_charges

Add Custom Tests¶

Extend _stg_<source>__schema.yml:

models:
  - name: stg_stripe_charges
    columns:
      - name: id
        tests:
          - unique
          - not_null

      - name: amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"  # Amount never negative

      - name: currency
        tests:
          - accepted_values:
              values: ['usd', 'eur', 'gbp']

Custom tests may be overwritten

Tests in _stg_<source>__schema.yml are regenerated during sync. For tests you want to preserve, create a separate schema.yml file.

Advanced Patterns¶

Multi-Source Staging¶

When same entity comes from multiple sources, create a combined model in intermediate:

-- dbt/models/intermediate/int_all_customers.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    'stripe' as source,
    id as customer_id,
    email,
    created_at
FROM {{ ref('stg_stripe_customers') }}

UNION ALL

SELECT
    'hubspot' as source,
    id as customer_id,
    email,
    created_at
FROM {{ ref('stg_hubspot_contacts') }}

Incremental Staging¶

For very large tables:

{{ config(
    materialized='incremental',
    unique_key='id',
    schema='staging',
    on_schema_change='append_new_columns'
) }}

SELECT
    id,
    event_type,
    timestamp,
    _dlt_extracted_at
FROM {{ source('analytics', 'events') }}

{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}

Troubleshooting¶

Generation Fails¶

Problem: dango generate errors or skips tables

Solutions:

Check raw data exists:

duckdb data/warehouse.duckdb "SHOW TABLES;"

Verify source configuration:
```
dango source list
```
Check directory permissions:
```
ls -la dbt/models/staging/
```

Duplicate Records in Staging¶

Problem: Same record appears multiple times

Cause: Raw data contains duplicates (common with API sources that return the same record across multiple syncs)

Solution: Add deduplication logic to your staging model:

{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as _rn
    FROM {{ source('stripe', 'charges') }}
)

SELECT * FROM deduplicated WHERE _rn = 1

Performance Issues¶

Problem: Staging models slow to run

Solutions:

Use incremental for large tables:

{{ config(materialized='incremental', unique_key='id', schema='staging') }}

Limit historical data at source:

sources:
  - name: stripe_payments
    stripe:
      start_date: 2024-01-01  # Limit historical data

Select only needed columns:

SELECT id, customer, amount, status, created
FROM {{ source('stripe', 'charges') }}
-- Instead of SELECT *

Best Practices¶

1. Keep Staging Simple¶

Staging should only focus on data cleaning:

Select columns (optional)
Rename columns (optional)
Cast types (optional)
Deduplicate (if needed)

Don't add business logic in staging:

-- Bad - business logic in staging
SELECT
    id,
    CASE WHEN amount > 1000 THEN 'high_value' ELSE 'standard' END as tier
FROM {{ source('stripe', 'charges') }}

-- Good - staging stays simple
SELECT
    id,
    amount
FROM {{ source('stripe', 'charges') }}

-- Put business logic in intermediate or marts

2. Preserve dlt Metadata¶

Always keep dlt columns:

SELECT
    -- Business columns
    id,
    email,
    created,

    -- dlt metadata (important for debugging)
    _dlt_load_id,
    _dlt_extracted_at
FROM {{ source('stripe', 'customers') }}

3. Document Column Mappings¶

When renaming:

models:
  - name: stg_stripe_charges
    columns:
      - name: customer_id
        description: "Stripe customer ID (renamed from 'customer')"

4. Test Primary Keys¶

Always test uniqueness:

columns:
  - name: id
    tests:
      - unique
      - not_null

5. Version Control Schema Files¶

Commit generated YAML to track schema changes:

git add dbt/models/staging/_stg_*__schema.yml
git commit -m "Update staging schema for Stripe"

Comparison: Staging vs. Intermediate vs. Marts¶

Aspect	Staging	Intermediate	Marts
Purpose	Clean raw data	Reusable logic	Final metrics
Generated	Yes (auto template)	No (`dango model add`)	No (`dango model add`)
Materialization	Table	Table	Table
Complexity	Simple	Medium	Complex
Business Logic	None	Some	Full
Dependencies	Raw tables	Staging	Intermediate

Reference: Generated SQL Template¶

Dango generates this template for staging models:

{{ config(
    materialized='table',
    schema='staging'
) }}

SELECT * FROM {{ source('<source_name>', '<table_name>') }}

This minimal template is intentionally simple. You customize it by adding:

Column selection
Deduplication logic
Type casting
Column renaming

See Customizing Staging Models for examples.

Next Steps¶

Custom Models - Build marts and intermediate layers
Testing - Comprehensive data quality testing
dbt Basics - Learn dbt fundamentals
Data Layers - Understand the full architecture