CSV Files¶
Upload and sync CSV files into your data warehouse.
Overview¶
CSV sources in Dango provide a simple way to load flat files into DuckDB with automatic schema detection and file watching capabilities.
Key Features:
- Automatic schema detection
- File watcher with auto-sync on changes
- Manual and scheduled sync
- Support for multiple delimiters
- Header row handling
How CSV Loading Works¶
Core behavior: All files matching your file_pattern in the directory are combined (UNION) into a single table. This happens on every sync.
This simple design supports two common workflows:
Workflow 1: Accumulate Data Over Time¶
Add new files to the directory and all rows combine automatically.
data/uploads/sales/
├── sales_2024_01.csv # 1,000 rows
├── sales_2024_02.csv # 1,200 rows
└── sales_2024_03.csv # 1,100 rows
# ─────────────
# Table: 3,300 rows (all combined)
Use cases: Monthly exports, daily transaction logs, regional data files
Workflow 2: Replace with Latest Data¶
Keep only one file in the directory. Replace it when you have new data.
To update: delete the old file, copy in the new one, sync.
Use cases: Product catalogs, price lists, reference data, point-in-time snapshots
Choose your workflow
- Growing data? → Add files, let them accumulate
- Reference data? → Replace the single file each time
Quick Start¶
Via Wizard (Recommended)¶
dango source add
# Select "CSV Files"
# Enter source name (e.g., "sales_data")
# Confirm directory (default: data/uploads/sales_data)
# Confirm file pattern (default: *.csv)
The wizard creates the directory and configuration for you.
Via Web UI¶
- Start the platform:
dango start - Open Web UI at
http://localhost:8800 - Click "Add Source" → "CSV"
- Follow the prompts
- Upload files via the Web UI
Via Configuration File¶
Edit .dango/sources.yml:
version: '1.0'
sources:
- name: sales_data
type: csv
enabled: true
description: Monthly sales transactions
csv:
directory: data/uploads/sales_data
file_pattern: "*.csv"
Then copy files and sync:
# Create directory if needed
mkdir -p data/uploads/sales_data
# Copy your CSV files
cp my_sales.csv data/uploads/sales_data/
# Sync
dango sync --source sales_data
Configuration Options¶
Required Parameters¶
| Parameter | Description | Example |
|---|---|---|
name | Unique identifier for this source | sales_data |
type | Must be csv | csv |
csv.directory | Directory containing CSV files | data/uploads/sales_data |
csv.file_pattern | Glob pattern for files to load | *.csv, orders_*.csv |
Optional Parameters¶
| Parameter | Default | Description |
|---|---|---|
enabled | true | Whether this source is active |
description | "" | Human-readable description |
csv.notes | null | Notes on how to refresh this data |
Auto-detected settings
The following are NOT configurable - they are auto-detected by DuckDB:
- Delimiter - Auto-detected (comma, tab, pipe, etc.)
- Header - Assumed true (first row = column names)
- Encoding - Assumed UTF-8
- Data types - Inferred from first ~1000 rows
Complete Example¶
sources.yml¶
version: '1.0'
sources:
# Sales data - multiple CSV files in one directory
- name: sales_data
type: csv
enabled: true
description: Monthly sales transactions
csv:
directory: data/uploads/sales_data
file_pattern: "*.csv"
notes: "Export from POS system monthly"
# Customer data - specific file pattern
- name: customers
type: csv
enabled: true
description: Customer master data
csv:
directory: data/uploads/customers
file_pattern: "customer_*.csv"
# Product catalog
- name: products
type: csv
enabled: true
description: Product SKU catalog
csv:
directory: data/uploads/products
file_pattern: "*.csv"
File Structure¶
Each CSV source has its own directory:
my-dango-project/
├── .dango/
│ └── sources.yml
├── data/
│ └── uploads/
│ ├── sales_data/ # Source: sales_data
│ │ ├── sales_2024_01.csv
│ │ ├── sales_2024_02.csv
│ │ └── sales_2024_03.csv
│ ├── customers/ # Source: customers
│ │ └── customer_list.csv
│ └── products/ # Source: products
│ └── catalog.csv
└── dbt/
All files matching the file_pattern in a source's directory are combined into one table.
File Watcher (Auto-Sync)¶
When auto-sync is enabled in your project configuration, Dango monitors CSV directories and automatically triggers sync when files change.
Enable Auto-Sync¶
In .dango/project.yml:
Note
Auto-sync is a platform configuration setting, not a CLI flag or per-source setting.
How It Works¶
- Start platform:
dango start - File watcher monitors all CSV source directories
- When files are added or modified, sync triggers after debounce period
- Data is loaded into DuckDB
- Staging models are regenerated
Use Cases¶
- Live data feeds: CSV files updated by external scripts
- Development: Edit CSV files and see results in Metabase
- Manual exports: Drop in new CSV files and have them auto-load
Data Loading Behavior¶
Schema Detection¶
Dango uses DuckDB's automatic CSV parsing:
- Column names: From header row (if
header: true) - Data types: Inferred from first 1000 rows
- Null handling: Empty values treated as NULL
Write Disposition¶
CSV sources use replace disposition by default:
- Full table refresh on each sync
- Previous data is dropped
- Suitable for master data files (customers, products, etc.)
For append-only behavior (logs, events), use a custom dlt source instead.
Target Schema¶
Data is loaded into a source-specific schema:
Example:
-- source name: sales_data
-- target table: raw_sales_data.sales_data
SELECT * FROM raw_sales_data.sales_data LIMIT 10;
Note
All files matching the file_pattern are combined into a single table named after the source.
Schema Detection¶
CSV schema (column names and types) is fixed on first load:
- First sync: DuckDB analyzes file headers and infers types from first ~1000 rows
- Subsequent syncs: Schema must match the original
If your CSV schema changes (columns added/removed/renamed):
- Sync will fail with a schema mismatch error
- To fix: Remove and re-add the source
# Remove old source
dango source remove sales_data
# Re-add with same name (schema will be re-detected)
dango source add
# Select "CSV Files", use same name
Warning
Schema changes require re-creating the source. Plan your CSV structure before initial load.
Common Patterns¶
Accumulating Monthly Data (Workflow 1)¶
Keep adding files each month - all data combines automatically:
- name: monthly_sales
type: csv
enabled: true
description: Monthly sales exports - all months combined
csv:
directory: data/uploads/monthly_sales
file_pattern: "*.csv"
notes: "Add new monthly export file, keep previous months"
data/uploads/monthly_sales/
├── sales_2024_01.csv # January data
├── sales_2024_02.csv # February data
├── sales_2024_03.csv # March data
└── sales_2024_04.csv # April data (just added)
Workflow: Each month, export your data and copy the new file into the directory:
# Add new month's file (don't delete old ones)
cp sales_2024_05.csv data/uploads/monthly_sales/
# Sync - table now contains Jan through May
dango sync --source monthly_sales
The table raw_monthly_sales.monthly_sales contains all rows from all months.
Reference Data Replacement (Workflow 2)¶
Keep only the latest version - replace the file each time:
- name: product_catalog
type: csv
enabled: true
description: Current product catalog
csv:
directory: data/uploads/product_catalog
file_pattern: "*.csv"
notes: "Replace with latest export from inventory system"
Workflow: Replace the file when you have updated data:
# Remove old file, add new one
rm data/uploads/product_catalog/products.csv
cp new_products_export.csv data/uploads/product_catalog/products.csv
# Sync - table reflects only the new file
dango sync --source product_catalog
Regional/Category Files (Combined)¶
Multiple files representing different segments, all combined:
All four files are loaded into raw_regional_sales.regional_sales. Add a new region by adding a file.
Separate Tables per Category¶
If you need separate tables instead of combined, use separate sources:
- name: sales_north
type: csv
csv:
directory: data/uploads/sales_north
file_pattern: "*.csv"
- name: sales_south
type: csv
csv:
directory: data/uploads/sales_south
file_pattern: "*.csv"
Then combine in dbt if needed (see Transformations section).
Excel Files¶
Dango does not currently support Excel files (.xlsx) directly.
Workaround: Export to CSV from Excel:
- Open your Excel file
- File → Save As → CSV (Comma delimited)
- Save to your source's directory
Future support
Native Excel support may be added in a future version.
Troubleshooting¶
Schema Detection Errors¶
Problem: Incorrect data types inferred
Solution: Check first 1000 rows. DuckDB infers types from this sample. If later rows have different types, you may need to:
- Clean the CSV file
- Create a dbt staging model with explicit casting
- Use a custom dlt source for complex parsing
Encoding Issues¶
Problem: Special characters display incorrectly
Solution: DuckDB auto-detects encoding but assumes UTF-8 by default. If special characters display incorrectly:
- Convert your CSV file to UTF-8 before uploading
- Use a text editor or command-line tool:
- Move the converted file to your source directory
Directory Not Found¶
Problem: FileNotFoundError or no files synced
Solution: Directory paths are relative to project root (where .dango/ is). Verify:
# Check directory exists
ls data/uploads/sales_data/
# Check files match pattern
ls data/uploads/sales_data/*.csv
Ensure your sources.yml directory matches the actual location:
File Watcher Not Triggering¶
Problem: File changes but sync doesn't run
Solution: Check that:
auto_sync: truein.dango/project.ymlunderplatform:- Platform is running (
dango start) - Directory path is correct and accessible
- Files match the
file_patternglob
Note: Auto-sync has a debounce period (default 10 minutes) to avoid rapid repeated syncs.
Best Practices¶
1. Use Consistent Delimiters¶
Stick to standard formats: - CSV: Comma-delimited (.csv) - TSV: Tab-delimited (.tsv) - PSV: Pipe-delimited (.psv)
2. Always Include Headers¶
Makes data self-documenting and easier to work with in dbt/Metabase.
3. Validate Data Before Uploading¶
Check for: - Consistent column counts - Proper escaping of quotes - No binary data in text fields
4. Use Relative Paths¶
Keeps configuration portable across environments:
# Good - relative to project root
csv:
directory: data/uploads/sales_data
# Avoid - absolute paths break portability
csv:
directory: /Users/john/Desktop/sales_data
5. Version Your CSV Files¶
For important data, use Git LFS or date-based naming:
data/
├── customers-2024-12-01.csv
├── customers-2024-12-08.csv
└── customers.csv -> symlink to latest
Comparison: CSV vs. Other Sources¶
| Feature | CSV | Built-in dlt | Custom dlt |
|---|---|---|---|
| Setup complexity | Lowest | Low | Medium |
| Real-time data | No (file-based) | Yes (API) | Yes (API) |
| Schema evolution | Manual | Automatic | Automatic |
| Incremental loading | No | Yes | Yes |
| Best for | Static data, exports | SaaS APIs | Custom APIs |
Next Steps¶
- OAuth Sources - Connect to cloud services
- Custom Sources - Build your own integrations
- Transformations - Clean and model your CSV data
- Dashboards - Visualize CSV data in Metabase