Skip to content

dbt Basics

Learn how dbt works within Dango for SQL-based transformations.


Overview

dbt (data build tool) transforms raw data in your warehouse using SQL. In Dango, dbt runs on top of DuckDB to create clean, modeled data for analytics.

What dbt does:

  • Transforms raw data using SQL SELECT statements
  • Manages table dependencies (lineage)
  • Tests data quality
  • Documents your data models
  • Versions transformations with Git

What Dango automates:

  • Auto-generates staging models from raw tables
  • Configures dbt project for DuckDB
  • Integrates dango run command
  • Generates dbt documentation

Quick Start

1. Load Raw Data

# Add and sync a data source
dango sync --source stripe_payments

Raw data is loaded into DuckDB:

raw_stripe.charges
raw_stripe.customers
raw_stripe.subscriptions

2. Staging Models (Auto-Generated)

dango sync automatically generates staging model templates.

This creates:

dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml

3. Run Transformations

# Run all dbt models
dango run

This executes dbt and creates:

staging.stg_stripe_charges
staging.stg_stripe_customers
staging.stg_stripe_subscriptions

4. Create Custom Models

Use dango model add to create custom models, or manually create files in dbt/models/marts/:

dango model add
# Select "marts", enter name, add description

Or manually create:

-- dbt/models/marts/customer_lifetime_value.sql
{{ config(materialized='table') }}

WITH charges AS (
    SELECT
        customer_id,
        SUM(amount / 100.0) as total_spent,
        COUNT(*) as order_count
    FROM {{ ref('stg_stripe_charges') }}
    WHERE status = 'succeeded'
    GROUP BY customer_id
),

customers AS (
    SELECT
        id,
        email,
        created
    FROM {{ ref('stg_stripe_customers') }}
)

SELECT
    c.id,
    c.email,
    c.created,
    COALESCE(ch.total_spent, 0) as lifetime_value,
    COALESCE(ch.order_count, 0) as total_orders
FROM customers c
LEFT JOIN charges ch ON c.id = ch.customer_id

Run again:

dango run

Now query in Metabase:

SELECT * FROM marts.customer_lifetime_value
ORDER BY lifetime_value DESC
LIMIT 10


dbt Project Structure

Dango creates a standard dbt project:

dbt/
├── dbt_project.yml           # Project configuration
├── profiles.yml              # DuckDB connection (auto-configured)
├── models/
│   ├── staging/              # Auto-generated from raw tables
│   │   ├── stg_<source>_<table>.sql
│   │   ├── _stg_<source>__sources.yml
│   │   └── _stg_<source>__schema.yml
│   ├── intermediate/         # Reusable business logic (your code)
│   └── marts/                # Final tables for BI (your code)
├── tests/                    # Custom data tests
├── macros/                   # Reusable SQL snippets
└── snapshots/                # Slowly changing dimensions

Model Types

Staging Models (Auto-generated)

Purpose: Template for cleaning raw data

Location: dbt/models/staging/

Auto-generated by: dango sync

Example (auto-generated template):

-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}

SELECT * FROM {{ source('stripe', 'charges') }}

When to modify: Customize to add column selection, deduplication, or type casting as needed.

Intermediate Models (Your Code)

Purpose: Reusable business logic, joins, aggregations

Location: dbt/models/intermediate/

Created by: dango model add (select "intermediate")

Example:

-- dbt/models/intermediate/int_customer_orders.sql
{{ config(materialized='table', schema='intermediate') }}

SELECT
    c.id as customer_id,
    c.email,
    o.id as order_id,
    o.amount,
    o.created as order_date
FROM {{ ref('stg_stripe_customers') }} c
LEFT JOIN {{ ref('stg_stripe_charges') }} o
    ON c.id = o.customer_id
WHERE o.status = 'succeeded'

Marts Models (Your Code)

Purpose: Final tables for business intelligence

Location: dbt/models/marts/

Created by: dango model add (select "marts")

Example:

-- dbt/models/marts/customer_metrics.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    email,
    COUNT(*) as total_orders,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value,
    MIN(order_date) as first_order_date,
    MAX(order_date) as last_order_date
FROM {{ ref('int_customer_orders') }}
GROUP BY customer_id, email


Materializations

Control how dbt models are stored in DuckDB:

table (Default in Dango)

{{ config(materialized='table') }}
  • Stores results as physical table
  • Fast queries, uses storage
  • Required for Metabase compatibility
  • All Dango models use tables by default

Why tables?

Metabase requires tables for reliable schema discovery. Views can work in some cases but may have inconsistent behavior.

incremental

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

SELECT * FROM {{ source('stripe', 'charges') }}
{% if is_incremental() %}
WHERE created > (SELECT MAX(created) FROM {{ this }})
{% endif %}
  • Only processes new/changed rows
  • Most efficient for large datasets
  • Requires unique_key for deduplication

ephemeral

{{ config(materialized='ephemeral') }}
  • Not materialized (CTE in dependent models)
  • Reduces clutter, no intermediate tables
  • Best for: Small helper logic

dbt Jinja & Macros

dbt uses Jinja templating for dynamic SQL:

ref() - Reference Other Models

SELECT * FROM {{ ref('stg_stripe_customers') }}

Benefits: - Establishes dependency order - Enables lineage tracking - Works across environments

source() - Reference Raw Tables

SELECT * FROM {{ source('stripe', 'charges') }}

Sources are defined in _sources.yml:

sources:
  - name: stripe
    schema: raw_stripe
    tables:
      - name: charges

config() - Model Configuration

{{ config(
    materialized='table',
    tags=['daily', 'finance']
) }}

Custom Macros

Create reusable SQL in dbt/macros/:

-- dbt/macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)
{% endmacro %}

Use in models:

SELECT
    id,
    {{ cents_to_dollars('amount') }} as amount_usd
FROM {{ ref('stg_stripe_charges') }}


Running dbt in Dango

dango run

Run all dbt models:

dango run

This executes: 1. Reads dbt_project.yml 2. Runs models in dependency order (staging → intermediate → marts) 3. Updates tables in DuckDB 4. Generates run logs

Selective Runs

Run specific models:

# Run one model
dbt run --select customer_metrics

# Run model and downstream dependencies
dbt run --select customer_metrics+

# Run model and upstream dependencies
dbt run --select +customer_metrics

# Run all marts
dbt run --select marts.*

Note: Use dbt directly for advanced options. Dango's dango run is equivalent to dbt run --profiles-dir dbt --project-dir dbt.


Testing Data Quality

Schema Tests (YAML)

Define tests in schema.yml:

# dbt/models/marts/schema.yml
version: 2

models:
  - name: customer_metrics
    description: Customer lifetime value and order metrics
    columns:
      - name: customer_id
        description: Stripe customer ID
        tests:
          - unique
          - not_null

      - name: email
        description: Customer email address
        tests:
          - not_null

      - name: total_revenue
        description: Total revenue from this customer
        tests:
          - not_null

Run tests:

dbt test --profiles-dir dbt --project-dir dbt

Custom Tests

Create SQL-based tests in dbt/tests/:

-- dbt/tests/assert_positive_revenue.sql
SELECT *
FROM {{ ref('customer_metrics') }}
WHERE total_revenue < 0

Test fails if query returns any rows.


Documentation

Auto-generated Docs

dbt documentation is available via the web UI:

dango start
# Open http://localhost:8800/dbt-docs

The documentation includes: - Model lineage graphs - Column-level documentation - SQL code for each model - Test results

You can also generate and serve docs manually:

dbt docs generate --profiles-dir dbt --project-dir dbt
dbt docs serve --profiles-dir dbt --project-dir dbt --port 8081

Adding Descriptions

# dbt/models/marts/schema.yml
models:
  - name: customer_metrics
    description: |
      Customer-level metrics calculated from Stripe charges.
      Updated daily via dango sync && dango run.

    columns:
      - name: customer_id
        description: Unique Stripe customer ID (pk)

      - name: total_revenue
        description: Sum of all successful charges in USD

Common Patterns

Deduplication

Add to staging models if needed:

{{ config(materialized='table', schema='staging') }}

WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY id
            ORDER BY _dlt_extracted_at DESC
        ) as rn
    FROM {{ source('stripe', 'customers') }}
)

SELECT * FROM deduped
WHERE rn = 1

Date Spine

-- dbt/models/intermediate/int_date_spine.sql
{{ config(materialized='table') }}

{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2024-01-01' as date)",
    end_date="cast(current_date() as date)"
) }}

Union Multiple Sources

{{ config(materialized='table', schema='intermediate') }}

SELECT
    'source_a' as source,
    *
FROM {{ ref('stg_source_a_users') }}

UNION ALL

SELECT
    'source_b' as source,
    *
FROM {{ ref('stg_source_b_users') }}

Debugging

View Compiled SQL

dbt compile --profiles-dir dbt --project-dir dbt --select customer_metrics

Check dbt/target/compiled/ for rendered SQL.

Run with Debug Logging

dbt run --profiles-dir dbt --project-dir dbt --debug

Test Individual Model

dbt run --profiles-dir dbt --project-dir dbt --select customer_metrics

Query DuckDB Directly

duckdb data/warehouse.duckdb

D SELECT * FROM marts.customer_metrics LIMIT 5;
D DESCRIBE marts.customer_metrics;

Best Practices

1. Keep Staging Simple

Staging models should only: - Select columns - Rename columns - Cast data types - Deduplicate

Business logic belongs in intermediate/marts.

2. Use CTEs for Readability

-- Good
WITH active_customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
    WHERE status = 'active'
),

recent_orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
    WHERE created > CURRENT_DATE - INTERVAL 30 DAY
)

SELECT ... FROM active_customers JOIN recent_orders ...

-- Avoid
SELECT ... FROM {{ ref('stg_customers') }} JOIN {{ ref('stg_orders') }} ...
WHERE ... AND ...

3. Use Incremental for Large Tables

For tables with millions of rows:

{{ config(
    materialized='incremental',
    unique_key='id'
) }}

4. Document Business Logic

Add descriptions to complex calculations:

- name: customer_lifetime_value
  description: |
    Calculated as sum of all successful charges.
    Excludes refunds and failed payments.

5. Test Critical Assumptions

tests:
  - unique
  - not_null
  - accepted_values:
      values: ['succeeded', 'failed', 'pending']

Next Steps