dbt Basics¶
Learn how dbt works within Dango for SQL-based transformations.
Overview¶
dbt (data build tool) transforms raw data in your warehouse using SQL. In Dango, dbt runs on top of DuckDB to create clean, modeled data for analytics.
What dbt does:
- Transforms raw data using SQL SELECT statements
- Manages table dependencies (lineage)
- Tests data quality
- Documents your data models
- Versions transformations with Git
What Dango automates:
- Auto-generates staging models from raw tables
- Configures dbt project for DuckDB
- Integrates
dango runcommand - Generates dbt documentation
Quick Start¶
1. Load Raw Data¶
Raw data is loaded into DuckDB:
2. Staging Models (Auto-Generated)¶
dango sync automatically generates staging model templates.
This creates:
dbt/models/staging/
├── stg_stripe_charges.sql
├── stg_stripe_customers.sql
├── stg_stripe_subscriptions.sql
├── _stg_stripe__sources.yml
└── _stg_stripe__schema.yml
3. Run Transformations¶
This executes dbt and creates:
4. Create Custom Models¶
Use dango model add to create custom models, or manually create files in dbt/models/marts/:
Or manually create:
-- dbt/models/marts/customer_lifetime_value.sql
{{ config(materialized='table') }}
WITH charges AS (
SELECT
customer_id,
SUM(amount / 100.0) as total_spent,
COUNT(*) as order_count
FROM {{ ref('stg_stripe_charges') }}
WHERE status = 'succeeded'
GROUP BY customer_id
),
customers AS (
SELECT
id,
email,
created
FROM {{ ref('stg_stripe_customers') }}
)
SELECT
c.id,
c.email,
c.created,
COALESCE(ch.total_spent, 0) as lifetime_value,
COALESCE(ch.order_count, 0) as total_orders
FROM customers c
LEFT JOIN charges ch ON c.id = ch.customer_id
Run again:
Now query in Metabase:
dbt Project Structure¶
Dango creates a standard dbt project:
dbt/
├── dbt_project.yml # Project configuration
├── profiles.yml # DuckDB connection (auto-configured)
├── models/
│ ├── staging/ # Auto-generated from raw tables
│ │ ├── stg_<source>_<table>.sql
│ │ ├── _stg_<source>__sources.yml
│ │ └── _stg_<source>__schema.yml
│ ├── intermediate/ # Reusable business logic (your code)
│ └── marts/ # Final tables for BI (your code)
├── tests/ # Custom data tests
├── macros/ # Reusable SQL snippets
└── snapshots/ # Slowly changing dimensions
Model Types¶
Staging Models (Auto-generated)¶
Purpose: Template for cleaning raw data
Location: dbt/models/staging/
Auto-generated by: dango sync
Example (auto-generated template):
-- dbt/models/staging/stg_stripe_charges.sql
{{ config(materialized='table', schema='staging') }}
SELECT * FROM {{ source('stripe', 'charges') }}
When to modify: Customize to add column selection, deduplication, or type casting as needed.
Intermediate Models (Your Code)¶
Purpose: Reusable business logic, joins, aggregations
Location: dbt/models/intermediate/
Created by: dango model add (select "intermediate")
Example:
-- dbt/models/intermediate/int_customer_orders.sql
{{ config(materialized='table', schema='intermediate') }}
SELECT
c.id as customer_id,
c.email,
o.id as order_id,
o.amount,
o.created as order_date
FROM {{ ref('stg_stripe_customers') }} c
LEFT JOIN {{ ref('stg_stripe_charges') }} o
ON c.id = o.customer_id
WHERE o.status = 'succeeded'
Marts Models (Your Code)¶
Purpose: Final tables for business intelligence
Location: dbt/models/marts/
Created by: dango model add (select "marts")
Example:
-- dbt/models/marts/customer_metrics.sql
{{ config(materialized='table') }}
SELECT
customer_id,
email,
COUNT(*) as total_orders,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value,
MIN(order_date) as first_order_date,
MAX(order_date) as last_order_date
FROM {{ ref('int_customer_orders') }}
GROUP BY customer_id, email
Materializations¶
Control how dbt models are stored in DuckDB:
table (Default in Dango)¶
- Stores results as physical table
- Fast queries, uses storage
- Required for Metabase compatibility
- All Dango models use tables by default
Why tables?
Metabase requires tables for reliable schema discovery. Views can work in some cases but may have inconsistent behavior.
incremental¶
{{ config(
materialized='incremental',
unique_key='id'
) }}
SELECT * FROM {{ source('stripe', 'charges') }}
{% if is_incremental() %}
WHERE created > (SELECT MAX(created) FROM {{ this }})
{% endif %}
- Only processes new/changed rows
- Most efficient for large datasets
- Requires
unique_keyfor deduplication
ephemeral¶
- Not materialized (CTE in dependent models)
- Reduces clutter, no intermediate tables
- Best for: Small helper logic
dbt Jinja & Macros¶
dbt uses Jinja templating for dynamic SQL:
ref() - Reference Other Models¶
Benefits: - Establishes dependency order - Enables lineage tracking - Works across environments
source() - Reference Raw Tables¶
Sources are defined in _sources.yml:
config() - Model Configuration¶
Custom Macros¶
Create reusable SQL in dbt/macros/:
-- dbt/macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)
{% endmacro %}
Use in models:
Running dbt in Dango¶
dango run¶
Run all dbt models:
This executes: 1. Reads dbt_project.yml 2. Runs models in dependency order (staging → intermediate → marts) 3. Updates tables in DuckDB 4. Generates run logs
Selective Runs¶
Run specific models:
# Run one model
dbt run --select customer_metrics
# Run model and downstream dependencies
dbt run --select customer_metrics+
# Run model and upstream dependencies
dbt run --select +customer_metrics
# Run all marts
dbt run --select marts.*
Note: Use dbt directly for advanced options. Dango's dango run is equivalent to dbt run --profiles-dir dbt --project-dir dbt.
Testing Data Quality¶
Schema Tests (YAML)¶
Define tests in schema.yml:
# dbt/models/marts/schema.yml
version: 2
models:
- name: customer_metrics
description: Customer lifetime value and order metrics
columns:
- name: customer_id
description: Stripe customer ID
tests:
- unique
- not_null
- name: email
description: Customer email address
tests:
- not_null
- name: total_revenue
description: Total revenue from this customer
tests:
- not_null
Run tests:
Custom Tests¶
Create SQL-based tests in dbt/tests/:
-- dbt/tests/assert_positive_revenue.sql
SELECT *
FROM {{ ref('customer_metrics') }}
WHERE total_revenue < 0
Test fails if query returns any rows.
Documentation¶
Auto-generated Docs¶
dbt documentation is available via the web UI:
The documentation includes: - Model lineage graphs - Column-level documentation - SQL code for each model - Test results
You can also generate and serve docs manually:
dbt docs generate --profiles-dir dbt --project-dir dbt
dbt docs serve --profiles-dir dbt --project-dir dbt --port 8081
Adding Descriptions¶
# dbt/models/marts/schema.yml
models:
- name: customer_metrics
description: |
Customer-level metrics calculated from Stripe charges.
Updated daily via dango sync && dango run.
columns:
- name: customer_id
description: Unique Stripe customer ID (pk)
- name: total_revenue
description: Sum of all successful charges in USD
Common Patterns¶
Deduplication¶
Add to staging models if needed:
{{ config(materialized='table', schema='staging') }}
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY _dlt_extracted_at DESC
) as rn
FROM {{ source('stripe', 'customers') }}
)
SELECT * FROM deduped
WHERE rn = 1
Date Spine¶
-- dbt/models/intermediate/int_date_spine.sql
{{ config(materialized='table') }}
{{ dbt_utils.date_spine(
datepart="day",
start_date="cast('2024-01-01' as date)",
end_date="cast(current_date() as date)"
) }}
Union Multiple Sources¶
{{ config(materialized='table', schema='intermediate') }}
SELECT
'source_a' as source,
*
FROM {{ ref('stg_source_a_users') }}
UNION ALL
SELECT
'source_b' as source,
*
FROM {{ ref('stg_source_b_users') }}
Debugging¶
View Compiled SQL¶
Check dbt/target/compiled/ for rendered SQL.
Run with Debug Logging¶
Test Individual Model¶
Query DuckDB Directly¶
duckdb data/warehouse.duckdb
D SELECT * FROM marts.customer_metrics LIMIT 5;
D DESCRIBE marts.customer_metrics;
Best Practices¶
1. Keep Staging Simple¶
Staging models should only: - Select columns - Rename columns - Cast data types - Deduplicate
Business logic belongs in intermediate/marts.
2. Use CTEs for Readability¶
-- Good
WITH active_customers AS (
SELECT * FROM {{ ref('stg_customers') }}
WHERE status = 'active'
),
recent_orders AS (
SELECT * FROM {{ ref('stg_orders') }}
WHERE created > CURRENT_DATE - INTERVAL 30 DAY
)
SELECT ... FROM active_customers JOIN recent_orders ...
-- Avoid
SELECT ... FROM {{ ref('stg_customers') }} JOIN {{ ref('stg_orders') }} ...
WHERE ... AND ...
3. Use Incremental for Large Tables¶
For tables with millions of rows:
4. Document Business Logic¶
Add descriptions to complex calculations:
- name: customer_lifetime_value
description: |
Calculated as sum of all successful charges.
Excludes refunds and failed payments.
5. Test Critical Assumptions¶
Next Steps¶
- Staging Models - Deep dive into auto-generated staging
- Custom Models - Build marts and intermediate models
- Testing - Comprehensive data quality testing
- dbt Documentation - Official dbt docs