Skip to content

Staging Models

Auto-generated dbt model templates for raw data.


Overview

Staging models are the first transformation layer in Dango. They provide a starting template that references your raw data, which you can then customize with cleaning logic.

What's auto-generated:

  • A SELECT * template referencing the raw source table
  • Source configuration YAML files
  • Basic schema tests (unique, not_null for ID columns)

What you add (as needed):

  • Column selection
  • Deduplication logic
  • Type casting
  • Column renaming

Key Points:

  • Auto-generated by dango sync (also runs during sync)
  • Materialized as tables (required for Metabase compatibility)
  • One staging model per raw table
  • Follow dbt naming conventions (stg_<source>_<table>)

Template Design

Staging templates are intentionally minimal. This gives you full control over cleaning logic while still providing a working starting point.


Quick Start

Generate Staging Models

Staging templates are auto-generated when you sync data:

# Load raw data AND generate staging templates
dango sync --source stripe_payments

# Run transformations
dango run

Automatic generation

You don't need to run dango generate separately - dango sync handles it. Only use dango generate if you want to regenerate templates without syncing new data.

This creates:

dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml

Generated File Structure

SQL Models

Each raw table gets a corresponding staging template:

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(
    materialized='table',
    schema='staging'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}

This minimal template is ready to run immediately. You can customize it by adding column selection, type casting, or deduplication logic as needed (see Customizing Staging Models below).

Sources Configuration

_stg_<source>__sources.yml defines raw table references:

version: 2

sources:
  - name: stripe
    description: Stripe payment data loaded via dlt
    schema: raw_stripe
    loaded_at_field: _dlt_extracted_at

    tables:
      - name: charges
        description: Payment charges from Stripe API

      - name: customers
        description: Customer records from Stripe API

      - name: subscriptions
        description: Subscription records from Stripe API

Schema Documentation

_stg_<source>__schema.yml documents staging models with auto-generated tests:

version: 2

models:
  - name: stg_stripe_charges
    description: Staging model for stripe charges
    columns:
      - name: id
        description: Primary key
        tests:
          - unique
          - not_null

Auto-generated tests

Dango automatically adds unique and not_null tests for columns that appear to be primary keys (columns named id, uuid, or ending in _id/_key). Add additional tests manually as needed.


How Generation Works

1. Scan Raw Layer

During dango sync (or dango generate), Dango inspects DuckDB for raw tables:

-- Finds all tables in raw schemas
SELECT
    table_schema,
    table_name
FROM information_schema.tables
WHERE table_schema LIKE 'raw%'

2. Analyze Schema

For each table, examines:

  • Column names and types
  • Primary key patterns (columns named id, uuid, or ending in _id/_key)
  • dlt metadata columns (_dlt_load_id, _dlt_extracted_at)

3. Generate Files

Creates:

  • SQL template: SELECT * from source table
  • Sources YAML: Reference to raw schema/table
  • Schema YAML: Basic tests for primary key columns

Materialization Strategy

Tables (Default)

Staging models use materialized='table' for Metabase compatibility:

{{ config(materialized='table', schema='staging') }}

Why tables instead of views?

Metabase requires tables for reliable schema discovery. While views work in some cases, tables provide consistent behavior across all Metabase features.

For very large tables, consider incremental materialization:

{{ config(
    materialized='incremental',
    unique_key='id',
    schema='staging'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}

{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}

Customizing Staging Models

The auto-generated template is a starting point. Common customizations include:

Adding Deduplication

If your source data may contain duplicates (e.g., from multiple API pulls), add deduplication logic:

-- dbt/models/staging/stg_stripe_customers.sql
{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as _rn
    FROM {{ source('stripe', 'customers') }}
)

SELECT * FROM deduplicated
WHERE _rn = 1

This keeps only the most recently extracted version of each record.

Selecting Specific Columns

Instead of SELECT *, explicitly list columns you need:

{{ config(materialized='table', schema='staging') }}

SELECT
    id,
    email,
    name,
    created,
    _dlt_load_id,
    _dlt_extracted_at
FROM {{ source('stripe', 'customers') }}

Type Casting

Cast columns to appropriate types:

{{ config(materialized='table', schema='staging') }}

SELECT
    id,
    CAST(amount AS DECIMAL(10,2)) / 100.0 as amount_dollars,
    CAST(created AS TIMESTAMP) as created_at,
    status
FROM {{ source('stripe', 'charges') }}

No Deduplication Needed

For append-only event logs, the template works as-is:

-- Events are unique, no deduplication needed
{{ config(materialized='table', schema='staging') }}

SELECT * FROM {{ source('analytics', 'events') }}

Column Naming Conventions

When customizing staging models, follow dbt best practices:

SELECT
    id,
    customer as customer_id,           -- Add _id suffix for foreign keys
    amount_cents / 100.0 as amount,    -- Convert to useful units
    created as created_at,             -- Standardize timestamp names
    is_deleted as is_deleted           -- Keep boolean prefixes
FROM {{ source('stripe', 'charges') }}

Reserved Names

Quote SQL keywords and rename them:

-- Bad (reserved keyword)
"order",

-- Good
"order" as order_id,

Standardized Timestamps

SELECT
    created as created_at,
    updated as updated_at,
    deleted as deleted_at
FROM {{ source('crm', 'contacts') }}

Working with Generated Models

Customizing vs. Creating Intermediate Models

You have two options for adding logic on top of raw data:

Option 1: Customize staging directly (recommended for data cleaning)

Edit the staging model to add column selection, type casting, or deduplication:

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY _dlt_extracted_at DESC) as _rn
    FROM {{ source('stripe', 'charges') }}
)

SELECT
    id,
    customer as customer_id,
    amount / 100.0 as amount_usd,
    status,
    created as created_at
FROM deduplicated
WHERE _rn = 1

Option 2: Create intermediate models (recommended for business logic)

Keep staging simple, add business logic in intermediate:

-- dbt/models/intermediate/int_stripe_charges_enriched.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    c.*,
    CASE
        WHEN c.amount >= 100000 THEN 'enterprise'
        WHEN c.amount >= 10000 THEN 'professional'
        ELSE 'starter'
    END as customer_tier,
    DATE_TRUNC('month', c.created_at) as month
FROM {{ ref('stg_stripe_charges') }} c
WHERE c.status = 'succeeded'

Regeneration overwrites staging

Running dango generate or dango sync will overwrite staging models. If you've customized staging, either:

  • Back up your changes before regenerating
  • Or use intermediate models for complex logic

Regeneration Workflow

When Staging is Regenerated

Staging templates are regenerated:

  • Automatically: During dango sync (every sync regenerates staging)
  • Manually: When you run dango generate

Handling Customized Staging Models

If you've customized staging models:

# 1. Check for uncommitted changes
git status

# 2. Backup if needed
cp dbt/models/staging/stg_stripe_charges.sql dbt/models/staging/stg_stripe_charges.sql.backup

# 3. Sync (will regenerate staging)
dango sync --source stripe_payments

# 4. Restore customizations
# Either manually restore from backup or use git checkout for individual files

# 5. Run transformations
dango run

# 6. Test
dbt test --profiles-dir dbt --project-dir dbt --select staging.*

Avoid regeneration conflicts

If you have significant staging customizations, consider putting business logic in intermediate models instead. Intermediate models are NOT overwritten by regeneration.


Testing Staging Models

Auto-generated Tests

Schema YAML includes basic tests for ID columns:

models:
  - name: stg_stripe_charges
    columns:
      - name: id
        tests:
          - unique        # No duplicate IDs
          - not_null      # ID always present

Run tests:

# Test all staging models
dbt test --profiles-dir dbt --project-dir dbt --select staging.*

# Test specific model
dbt test --profiles-dir dbt --project-dir dbt --select stg_stripe_charges

Add Custom Tests

Extend _stg_<source>__schema.yml:

models:
  - name: stg_stripe_charges
    columns:
      - name: id
        tests:
          - unique
          - not_null

      - name: amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"  # Amount never negative

      - name: currency
        tests:
          - accepted_values:
              values: ['usd', 'eur', 'gbp']

Custom tests may be overwritten

Tests in _stg_<source>__schema.yml are regenerated during sync. For tests you want to preserve, create a separate schema.yml file.


Advanced Patterns

Multi-Source Staging

When same entity comes from multiple sources, create a combined model in intermediate:

-- dbt/models/intermediate/int_all_customers.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    'stripe' as source,
    id as customer_id,
    email,
    created_at
FROM {{ ref('stg_stripe_customers') }}

UNION ALL

SELECT
    'hubspot' as source,
    id as customer_id,
    email,
    created_at
FROM {{ ref('stg_hubspot_contacts') }}

Incremental Staging

For very large tables:

{{ config(
    materialized='incremental',
    unique_key='id',
    schema='staging',
    on_schema_change='append_new_columns'
) }}

SELECT
    id,
    event_type,
    timestamp,
    _dlt_extracted_at
FROM {{ source('analytics', 'events') }}

{% if is_incremental() %}
WHERE _dlt_extracted_at > (SELECT MAX(_dlt_extracted_at) FROM {{ this }})
{% endif %}

Troubleshooting

Generation Fails

Problem: dango generate errors or skips tables

Solutions:

  1. Check raw data exists:

    duckdb data/warehouse.duckdb "SHOW TABLES;"
    

  2. Verify source configuration:

    dango source list
    

  3. Check directory permissions:

    ls -la dbt/models/staging/
    

Duplicate Records in Staging

Problem: Same record appears multiple times

Cause: Raw data contains duplicates (common with API sources that return the same record across multiple syncs)

Solution: Add deduplication logic to your staging model:

{{ config(materialized='table', schema='staging') }}

WITH deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as _rn
    FROM {{ source('stripe', 'charges') }}
)

SELECT * FROM deduplicated WHERE _rn = 1

Performance Issues

Problem: Staging models slow to run

Solutions:

  1. Use incremental for large tables:

    {{ config(materialized='incremental', unique_key='id', schema='staging') }}
    

  2. Limit historical data at source:

    sources:
      - name: stripe_payments
        stripe:
          start_date: 2024-01-01  # Limit historical data
    

  3. Select only needed columns:

    SELECT id, customer, amount, status, created
    FROM {{ source('stripe', 'charges') }}
    -- Instead of SELECT *
    


Best Practices

1. Keep Staging Simple

Staging should only focus on data cleaning:

  • Select columns (optional)
  • Rename columns (optional)
  • Cast types (optional)
  • Deduplicate (if needed)

Don't add business logic in staging:

-- Bad - business logic in staging
SELECT
    id,
    CASE WHEN amount > 1000 THEN 'high_value' ELSE 'standard' END as tier
FROM {{ source('stripe', 'charges') }}

-- Good - staging stays simple
SELECT
    id,
    amount
FROM {{ source('stripe', 'charges') }}

-- Put business logic in intermediate or marts

2. Preserve dlt Metadata

Always keep dlt columns:

SELECT
    -- Business columns
    id,
    email,
    created,

    -- dlt metadata (important for debugging)
    _dlt_load_id,
    _dlt_extracted_at
FROM {{ source('stripe', 'customers') }}

3. Document Column Mappings

When renaming:

models:
  - name: stg_stripe_charges
    columns:
      - name: customer_id
        description: "Stripe customer ID (renamed from 'customer')"

4. Test Primary Keys

Always test uniqueness:

columns:
  - name: id
    tests:
      - unique
      - not_null

5. Version Control Schema Files

Commit generated YAML to track schema changes:

git add dbt/models/staging/_stg_*__schema.yml
git commit -m "Update staging schema for Stripe"

Comparison: Staging vs. Intermediate vs. Marts

Aspect Staging Intermediate Marts
Purpose Clean raw data Reusable logic Final metrics
Generated Yes (auto template) No (dango model add) No (dango model add)
Materialization Table Table Table
Complexity Simple Medium Complex
Business Logic None Some Full
Dependencies Raw tables Staging Intermediate

Reference: Generated SQL Template

Dango generates this template for staging models:

{{ config(
    materialized='table',
    schema='staging'
) }}

SELECT * FROM {{ source('<source_name>', '<table_name>') }}

This minimal template is intentionally simple. You customize it by adding:

  • Column selection
  • Deduplication logic
  • Type casting
  • Column renaming

See Customizing Staging Models for examples.


Next Steps