Performance Optimization¶

Strategies for optimizing Dango performance with large datasets.

Overview¶

As your data grows, you may need to optimize: - Sync performance (data loading) - Query performance (DuckDB) - dbt model execution - Memory usage

DuckDB Optimization¶

Understanding DuckDB¶

DuckDB is optimized for analytical workloads: - Columnar storage for fast aggregations - Automatic parallelization - In-memory processing with disk spillover

Memory Configuration¶

-- Check current memory limit
SELECT current_setting('memory_limit');

-- Increase memory for large datasets (run in DuckDB CLI)
SET memory_limit = '8GB';
SET threads = 4;

Configure in dbt:

# dbt/profiles.yml
my_project:
  outputs:
    dev:
      type: duckdb
      path: ../data/warehouse.duckdb
      config:
        memory_limit: '8GB'
        threads: 4

Query Optimization¶

Use EXPLAIN to analyze queries:

EXPLAIN ANALYZE
SELECT date_trunc('day', order_date), SUM(amount)
FROM raw_orders.orders
GROUP BY 1;

Common optimizations:

-- BAD: Full table scan for filter
SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- GOOD: Range filter
SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

-- BAD: SELECT *
SELECT * FROM orders JOIN customers USING (customer_id);

-- GOOD: Select only needed columns
SELECT o.order_id, o.amount, c.name
FROM orders o JOIN customers c USING (customer_id);

Large Dataset Handling¶

Sync Strategies¶

Incremental Syncs:

# .dango/sources.yml
sources:
  - name: large_source
    type: stripe
    stripe:
      # Only sync recent data after initial load
      start_date: "2024-01-01"

Source Timeouts:

# For slow sources, increase timeout
sources:
  - name: slow_api
    type: rest_api
    rest_api:
      timeout: 600  # 10 minutes

Parallel Syncs:

# Sync multiple sources in parallel (background)
dango sync --source source1 &
dango sync --source source2 &
wait

Data Volume Guidelines¶

Data Size	Recommendation
< 100 MB	Default settings work well
100 MB - 1 GB	Consider incremental syncs
1 GB - 10 GB	Use materialized tables, indexes
> 10 GB	Consider partitioning, external tables

dbt Performance¶

Materialization Strategy¶

Dango uses table materialization for all models to ensure Metabase compatibility:

Type	Use When	Performance
`table`	Default for all models	Best query performance, reliable Metabase support
`incremental`	Very large, append-only data	Fast to build, fast to query

Why Tables Only

Dango auto-generates staging models as tables (not views) for reliable Metabase compatibility. While views rebuild faster, they can cause query issues in Metabase.

-- Staging: tables (auto-generated by Dango)
{{ config(materialized='table') }}

-- Intermediate: tables
{{ config(materialized='table') }}

-- Marts: tables (query performance matters)
{{ config(materialized='table') }}

-- Large fact tables: incremental
{{ config(
    materialized='incremental',
    unique_key='id',
    incremental_strategy='merge'
) }}

Incremental Models¶

For large fact tables:

-- dbt/models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

SELECT
    event_id,
    user_id,
    event_type,
    created_at
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Run Specific Models¶

# Don't rebuild everything
cd dbt

# Run only what changed
dbt run --select state:modified+

# Run specific model and downstream
dbt run --select my_model+

Parallel Execution¶

# Run models in parallel (default is 1)
dbt run --threads 4

Sync Optimization¶

Reduce Sync Scope¶

Select specific endpoints:

sources:
  - name: stripe
    type: stripe
    stripe:
      endpoints:
        - charges      # Only what you need
        - customers
        # Skip: products, plans, etc.

Date range limits:

sources:
  - name: google_analytics
    type: google_analytics
    google_analytics:
      start_date: "2024-01-01"  # Don't fetch all history

Full Refresh Scheduling¶

# Regular syncs: incremental
dango sync

# Weekly: full refresh for data quality
dango sync --full-refresh  # Run on weekends

Memory Management¶

Monitor Memory Usage¶

# Check DuckDB memory
duckdb data/warehouse.duckdb "SELECT * FROM duckdb_memory()"

# Check system memory during sync
top -pid $(pgrep -f dango)

Reduce Memory Pressure¶

Close unused connections:

# Stop services when not needed
dango stop

Process in batches:

# Custom source with batching
@dlt.resource
def large_api():
    for page in range(100):
        batch = fetch_page(page, size=1000)
        yield batch  # Yield batches, not all at once

Database Maintenance¶

Vacuum and Analyze¶

-- Run periodically in DuckDB CLI
VACUUM;
ANALYZE;

Check Table Sizes¶

-- Find large tables
SELECT
    table_name,
    estimated_size
FROM duckdb_tables()
ORDER BY estimated_size DESC;

Remove Unused Data¶

# Clean up old dlt state
dango db clean

# This removes:
# - DuckDB database
# - dlt pipeline state
# Fresh sync required after

Monitoring Performance¶

Sync Duration¶

# Time a sync
time dango sync --source my_source

# Check sync logs
cat .dango/logs/sync_*.log | grep duration

Query Performance¶

-- Enable profiling
PRAGMA enable_profiling;
PRAGMA profile_output='query_profile.json';

-- Run your query
SELECT ...;

-- Check profile
.read query_profile.json

dbt Timing¶

# dbt shows timing by default
dbt run --select my_model

# Detailed timing
dbt run --select my_model --log-level debug

Performance Checklist¶

Before Scaling¶

Use incremental syncs where possible
Materialize slow models as tables
Limit sync scope to needed endpoints
Set appropriate date ranges

For Large Datasets¶

Increase DuckDB memory limit
Use incremental dbt models
Run dbt with multiple threads
Schedule full refreshes off-peak

Troubleshooting Slow Queries¶

Check with EXPLAIN ANALYZE
Verify indexes exist on join columns
Review model materializations
Check for unnecessary SELECT *

Next Steps¶

dbt Workflows - Model optimization
Troubleshooting - Debug slow syncs
Backup & Restore - Large database backups