dbt Basics¶

Learn how dbt works within Dango for SQL-based transformations.

Overview¶

dbt (data build tool) transforms raw data in your warehouse using SQL. In Dango, dbt runs on top of DuckDB to create clean, modeled data for analytics.

What dbt does:

Transforms raw data using SQL SELECT statements
Manages table dependencies (lineage)
Tests data quality
Documents your data models
Versions transformations with Git

What Dango automates:

Auto-generates staging models from raw tables
Configures dbt project for DuckDB
Integrates dango run command
Generates dbt documentation

Quick Start¶

1. Load Raw Data¶

# Add and sync a data source
dango sync --source stripe_payments

Raw data is loaded into DuckDB:

raw_stripe.charges
raw_stripe.customers
raw_stripe.subscriptions

2. Staging Models (Auto-Generated)¶

dango sync automatically generates staging model templates.

This creates:

dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml

3. Run Transformations¶

# Run all dbt models
dango run

This executes dbt and creates:

staging.stg_stripe_charges
staging.stg_stripe_customers
staging.stg_stripe_subscriptions

4. Create Custom Models¶

Use dango model add to create custom models, or manually create files in dbt/models/marts/:

dango model add
# Select "marts", enter name, add description

Or manually create:

-- dbt/models/marts/customer_lifetime_value.sql
{{ config(materialized='table') }}

WITH charges AS (
    SELECT
        customer_id,
        SUM(amount / 100.0) as total_spent,
        COUNT(*) as order_count
    FROM {{ ref('stg_stripe_charges') }}
    WHERE status = 'succeeded'
    GROUP BY customer_id
),

customers AS (
    SELECT
        id,
        email,
        created
    FROM {{ ref('stg_stripe_customers') }}
)

SELECT
    c.id,
    c.email,
    c.created,
    COALESCE(ch.total_spent, 0) as lifetime_value,
    COALESCE(ch.order_count, 0) as total_orders
FROM customers c
LEFT JOIN charges ch ON c.id = ch.customer_id

Run again:

dango run

Now query in Metabase:

SELECT * FROM marts.customer_lifetime_value
ORDER BY lifetime_value DESC
LIMIT 10

dbt Project Structure¶

Dango creates a standard dbt project:

dbt/
├── dbt_project.yml           # Project configuration
├── profiles.yml              # DuckDB connection (auto-configured)
├── models/
│   ├── staging/              # Auto-generated from raw tables
│   │   ├── stg_<source>_<table>.sql
│   │   ├── _stg_<source>__sources.yml
│   │   └── _stg_<source>__schema.yml
│   ├── intermediate/         # Reusable business logic (your code)
│   └── marts/                # Final tables for BI (your code)
├── tests/                    # Custom data tests
├── macros/                   # Reusable SQL snippets
└── snapshots/                # Slowly changing dimensions

Model Types¶

Staging Models (Auto-generated)¶

Purpose: Template for cleaning raw data

Location: dbt/models/staging/

Auto-generated by: dango sync

Example (auto-generated template):

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}

SELECT * FROM {{ source('stripe', 'charges') }}

When to modify: Customize to add column selection, deduplication, or type casting as needed.

Intermediate Models (Your Code)¶

Purpose: Reusable business logic, joins, aggregations

Location: dbt/models/intermediate/

Created by: dango model add (select "intermediate")

Example:

-- dbt/models/intermediate/int_customer_orders.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    c.id as customer_id,
    c.email,
    o.id as order_id,
    o.amount,
    o.created as order_date
FROM {{ ref('stg_stripe_customers') }} c
LEFT JOIN {{ ref('stg_stripe_charges') }} o
    ON c.id = o.customer_id
WHERE o.status = 'succeeded'

Marts Models (Your Code)¶

Purpose: Final tables for business intelligence

Location: dbt/models/marts/

Created by: dango model add (select "marts")

Example:

-- dbt/models/marts/customer_metrics.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    email,
    COUNT(*) as total_orders,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value,
    MIN(order_date) as first_order_date,
    MAX(order_date) as last_order_date
FROM {{ ref('int_customer_orders') }}
GROUP BY customer_id, email

Materializations¶

Control how dbt models are stored in DuckDB:

table (Default in Dango)¶

{{ config(materialized='table') }}

Stores results as physical table
Fast queries, uses storage
Required for Metabase compatibility
All Dango models use tables by default

Why tables?

Metabase requires tables for reliable schema discovery. Views can work in some cases but may have inconsistent behavior.

incremental¶

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}
{% if is_incremental() %}
WHERE created > (SELECT MAX(created) FROM {{ this }})
{% endif %}

Only processes new/changed rows
Most efficient for large datasets
Requires unique_key for deduplication

ephemeral¶

{{ config(materialized='ephemeral') }}

Not materialized (CTE in dependent models)
Reduces clutter, no intermediate tables
Best for: Small helper logic

dbt Jinja & Macros¶

dbt uses Jinja templating for dynamic SQL:

ref() - Reference Other Models¶

SELECT * FROM {{ ref('stg_stripe_customers') }}

Benefits: - Establishes dependency order - Enables lineage tracking - Works across environments

source() - Reference Raw Tables¶

SELECT * FROM {{ source('stripe', 'charges') }}

Sources are defined in _sources.yml:

sources:
  - name: stripe
    schema: raw_stripe
    tables:
      - name: charges

config() - Model Configuration¶

{{ config(
    materialized='table',
    tags=['daily', 'finance']
) }}

Custom Macros¶

Create reusable SQL in dbt/macros/:

-- dbt/macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)
{% endmacro %}

Use in models:

SELECT
    id,
    {{ cents_to_dollars('amount') }} as amount_usd
FROM {{ ref('stg_stripe_charges') }}

Running dbt in Dango¶

dango run¶

Run all dbt models:

dango run

This executes: 1. Reads dbt_project.yml 2. Runs models in dependency order (staging → intermediate → marts) 3. Updates tables in DuckDB 4. Generates run logs

Selective Runs¶

Run specific models:

# Run one model
dbt run --select customer_metrics

# Run model and downstream dependencies
dbt run --select customer_metrics+

# Run model and upstream dependencies
dbt run --select +customer_metrics

# Run all marts
dbt run --select marts.*

Note: Use dbt directly for advanced options. Dango's dango run is equivalent to dbt run --profiles-dir dbt --project-dir dbt.

Testing Data Quality¶

Schema Tests (YAML)¶

Define tests in schema.yml:

# dbt/models/marts/schema.yml
version: 2

models:
  - name: customer_metrics
    description: Customer lifetime value and order metrics
    columns:
      - name: customer_id
        description: Stripe customer ID
        tests:
          - unique
          - not_null

      - name: email
        description: Customer email address
        tests:
          - not_null

      - name: total_revenue
        description: Total revenue from this customer
        tests:
          - not_null

Run tests:

dbt test --profiles-dir dbt --project-dir dbt

Custom Tests¶

Create SQL-based tests in dbt/tests/:

-- dbt/tests/assert_positive_revenue.sql
SELECT *
FROM {{ ref('customer_metrics') }}
WHERE total_revenue < 0

Test fails if query returns any rows.

Documentation¶

Auto-generated Docs¶

dbt documentation is available via the web UI:

dango start
# Open http://localhost:8800/dbt-docs

The documentation includes: - Model lineage graphs - Column-level documentation - SQL code for each model - Test results

You can also generate and serve docs manually:

dbt docs generate --profiles-dir dbt --project-dir dbt
dbt docs serve --profiles-dir dbt --project-dir dbt --port 8081

Adding Descriptions¶

# dbt/models/marts/schema.yml
models:
  - name: customer_metrics
    description: |
      Customer-level metrics calculated from Stripe charges.
      Updated daily via dango sync && dango run.

    columns:
      - name: customer_id
        description: Unique Stripe customer ID (pk)

      - name: total_revenue
        description: Sum of all successful charges in USD

Common Patterns¶

Deduplication¶

Add to staging models if needed:

{{ config(materialized='table', schema='staging') }}

WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as rn
    FROM {{ source('stripe', 'customers') }}
)

SELECT * FROM deduped
WHERE rn = 1

Date Spine¶

-- dbt/models/intermediate/int_date_spine.sql
{{ config(materialized='table') }}

{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2024-01-01' as date)",
    end_date="cast(current_date() as date)"
) }}

Union Multiple Sources¶

{{ config(materialized='table', schema='intermediate') }}

SELECT
    'source_a' as source,
    *
FROM {{ ref('stg_source_a_users') }}

UNION ALL

SELECT
    'source_b' as source,
    *
FROM {{ ref('stg_source_b_users') }}

Debugging¶

View Compiled SQL¶

dbt compile --profiles-dir dbt --project-dir dbt --select customer_metrics

Check dbt/target/compiled/ for rendered SQL.

Run with Debug Logging¶

dbt run --profiles-dir dbt --project-dir dbt --debug

Test Individual Model¶

dbt run --profiles-dir dbt --project-dir dbt --select customer_metrics

Query DuckDB Directly¶

duckdb data/warehouse.duckdb

D SELECT * FROM marts.customer_metrics LIMIT 5;
D DESCRIBE marts.customer_metrics;

Best Practices¶

1. Keep Staging Simple¶

Staging models should only: - Select columns - Rename columns - Cast data types - Deduplicate

Business logic belongs in intermediate/marts.

2. Use CTEs for Readability¶

-- Good
WITH active_customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
    WHERE status = 'active'
),

recent_orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
    WHERE created > CURRENT_DATE - INTERVAL 30 DAY
)

SELECT ... FROM active_customers JOIN recent_orders ...

-- Avoid
SELECT ... FROM {{ ref('stg_customers') }} JOIN {{ ref('stg_orders') }} ...
WHERE ... AND ...

3. Use Incremental for Large Tables¶

For tables with millions of rows:

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

4. Document Business Logic¶

Add descriptions to complex calculations:

- name: customer_lifetime_value
  description: |
    Calculated as sum of all successful charges.
    Excludes refunds and failed payments.

5. Test Critical Assumptions¶

tests:
  - unique
  - not_null
  - accepted_values:
      values: ['succeeded', 'failed', 'pending']

Next Steps¶

Staging Models - Deep dive into auto-generated staging
Custom Models - Build marts and intermediate models
Testing - Comprehensive data quality testing
dbt Documentation - Official dbt docs