Skip to content

datafold/Trino-dbt

Repository files navigation

Trino Agriscience DBT Demo

A dbt project demonstrating agricultural data analytics using Trino with 1 million+ rows of synthetic agriscience data.

Project Overview

This project contains:

  • 1 million rows of crop yield transaction data
  • 5 source tables (crop_yields, farms, crops, seasons, regions)
  • 5 staging models for data preparation
  • 3 advanced marts models with complex transformation logic:
    • Aggregated yield analytics
    • Farm performance metrics with YoY trends
    • Crop rotation and sustainability analysis

Project Structure

.
├── dbt_project.yml          # Main dbt project configuration
├── profiles.yml             # Trino connection profile (reference)
├── models/
│   ├── staging/             # Staging models (views)
│   │   ├── sources.yml      # Source table definitions
│   │   ├── stg_crop_yields.sql
│   │   ├── stg_farms.sql
│   │   ├── stg_crops.sql
│   │   ├── stg_seasons.sql
│   │   └── stg_regions.sql
│   └── marts/               # Final analytical models (tables)
│       ├── schema.yml       # Model documentation and tests
│       ├── yield_summary.sql
│       ├── farm_performance_metrics.sql
│       └── crop_rotation_analysis.sql
├── seeds/                   # CSV seed files (if any)
├── tests/                   # Custom data tests
├── macros/                  # Custom SQL macros
└── README.md               # This file

Data Model

Source Tables (in dbt schema)

  1. crop_yields (1,000,000 rows) - Transactional data

    • yield_id, farm_id, crop_id, season_id, region_id
    • harvest_date, acres_planted, actual_yield_per_acre, total_yield
    • quality_grade, moisture_content, field_number
  2. farms (500 rows) - Farm information

    • farm_id, farm_name, region_id, total_acres, organic_certified
  3. crops (15 rows) - Crop types

    • crop_id, crop_name, crop_category, avg_yield_per_acre
  4. seasons (15 rows) - Planting/harvest seasons (2020-2024)

    • season_id, year, season_name, start_date, end_date
  5. regions (50 rows) - Agricultural regions across 10 Midwest states

    • region_id, region_name, state, avg_rainfall_inches, avg_temp_f

Output Models (Marts)

  1. yield_summary - Aggregated yield analytics with:

    • Regional breakdowns (state, region)
    • Crop analysis (name, category)
    • Temporal grouping (year, season)
    • Farm type comparison (organic vs conventional)
    • Key metrics: total acres, average yields, production totals
  2. farm_performance_metrics - Comprehensive farm performance analysis featuring:

    • Year-over-year growth rates (production & yield)
    • 3-year rolling averages for trend analysis
    • Regional rankings and percentile scores
    • Performance classification (Top Performer, Above Average, etc.)
    • Efficiency metrics vs regional benchmarks
    • Organic vs conventional peer comparisons
    • Land utilization and crop diversity metrics
  3. crop_rotation_analysis - Advanced sustainability analytics including:

    • Shannon and Simpson diversity indices
    • Crop rotation pattern analysis (monoculture vs rotation)
    • Sequential crop sequencing by field over time
    • Rotation effectiveness scoring (1-5 scale)
    • Yield impact analysis (rotation vs monoculture)
    • Sustainability score (0-100) based on rotation practices
    • Rotation strategy classification
    • Soil health indicators derived from planting patterns

Setup Instructions

1. Install dbt-trino

pip install dbt-trino

2. Configure Connection

Copy profiles.yml to ~/.dbt/profiles.yml and update with your Trino connection details:

trino_demo:
  target: dev
  outputs:
    dev:
      type: trino
      method: none
      user: your_username
      host: your_trino_host
      port: 8080
      database: your_catalog
      schema: dbt
      threads: 4
      http_scheme: http

Or set environment variables:

export TRINO_USER=your_username
export TRINO_HOST=your_trino_host
export TRINO_PORT=8080
export TRINO_CATALOG=your_catalog

3. Verify Connection

cd /data/leo/demo
dbt debug

4. Run the Project

# Compile models to check for errors
dbt compile

# Run all models
dbt run

# Run specific model
dbt run --select yield_summary

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve

Development Workflow

Run Staging Models Only

dbt run --select staging.*

Run Marts Models Only

dbt run --select marts.*

Run with Full Refresh

dbt run --full-refresh

Test Specific Model

dbt test --select yield_summary

Run and Test

dbt build

Data Characteristics

  • Time Range: 2020-2024 (5 years, 15 seasons)
  • Geographic Coverage: 50 regions across 10 Midwest states
  • Crop Types: 15 varieties (Grains, Legumes, Oilseeds, Fiber, Vegetables, Forage, Roots)
  • Farm Types: Organic certified and conventional farms
  • Quality Grades: A through E
  • Field Distribution: 1-50 fields per record
  • Moisture Content: 8-20%

Example Queries

After running dbt run, you can query the output:

Yield Summary Queries

-- Top producing regions by crop
SELECT
    region_name,
    crop_name,
    total_yield
FROM dbt_marts.yield_summary
ORDER BY total_yield DESC
LIMIT 10;

-- Organic vs conventional comparison
SELECT
    organic_certified,
    AVG(avg_yield_per_acre) as avg_yield,
    SUM(total_acres_planted) as total_acres
FROM dbt_marts.yield_summary
GROUP BY organic_certified;

-- Seasonal trends
SELECT
    year,
    season_name,
    SUM(total_yield) as total_production
FROM dbt_marts.yield_summary
GROUP BY year, season_name
ORDER BY year, season_name;

Farm Performance Queries

-- Top performing farms by efficiency
SELECT
    farm_name,
    region_name,
    year,
    avg_yield_per_acre,
    yield_percentile,
    performance_tier
FROM dbt_marts.farm_performance_metrics
WHERE year = 2024
ORDER BY yield_percentile DESC
LIMIT 10;

-- Year-over-year growth analysis
SELECT
    farm_name,
    year,
    total_production,
    yoy_production_growth_pct,
    rolling_3yr_avg_production
FROM dbt_marts.farm_performance_metrics
WHERE farm_id = 1
ORDER BY year;

-- Organic vs conventional performance comparison
SELECT
    organic_certified,
    AVG(avg_yield_per_acre) as avg_efficiency,
    AVG(yield_vs_regional_avg_pct) as avg_vs_regional_pct,
    COUNT(DISTINCT farm_id) as farm_count
FROM dbt_marts.farm_performance_metrics
WHERE year = 2024
GROUP BY organic_certified;

Crop Rotation & Sustainability Queries

-- Sustainability leaders
SELECT
    farm_name,
    region_name,
    year,
    sustainability_score,
    diversity_classification,
    rotation_strategy,
    shannon_diversity_index
FROM dbt_marts.crop_rotation_analysis
WHERE year = 2024
ORDER BY sustainability_score DESC
LIMIT 10;

-- Rotation yield benefit analysis
SELECT
    rotation_strategy,
    AVG(avg_yield_monoculture) as avg_monoculture_yield,
    AVG(avg_yield_rotated) as avg_rotation_yield,
    AVG(rotation_yield_benefit_pct) as avg_benefit_pct,
    COUNT(*) as farm_count
FROM dbt_marts.crop_rotation_analysis
WHERE year = 2024
GROUP BY rotation_strategy;

-- Diversity and performance correlation
SELECT
    diversity_classification,
    AVG(sustainability_score) as avg_sustainability,
    AVG(unique_crops) as avg_crop_count,
    AVG(rotation_pct) as avg_rotation_pct
FROM dbt_marts.crop_rotation_analysis
GROUP BY diversity_classification
ORDER BY avg_sustainability DESC;

Troubleshooting

Connection Issues

  • Verify Trino is running and accessible
  • Check credentials in profiles.yml
  • Ensure catalog and schema exist

Compilation Errors

  • Run dbt compile to see detailed error messages
  • Check that all source tables exist in the dbt schema
  • Verify Trino SQL syntax compatibility

Performance

  • Consider adding indexes on foreign key columns
  • Adjust thread count in profiles.yml
  • Use dbt run --select for specific models

Advanced Features Included

  • ✅ Sophisticated window functions (rolling averages, year-over-year comparisons, rankings)
  • ✅ Complex transformation logic (diversity indices, rotation patterns)
  • ✅ Comprehensive data quality tests (38 tests across sources and models)
  • ✅ Performance tier classifications and percentile rankings
  • ✅ Sustainability scoring algorithms

Potential Enhancements

  • Add incremental models for near real-time updates
  • Implement custom macros for common agriscience calculations
  • Create materialized views for frequently accessed aggregations
  • Add snapshot models for slowly changing dimensions
  • Build dashboard-ready models optimized for specific BI tools

Support

For dbt documentation: https://docs.getdbt.com/ For Trino SQL reference: https://trino.io/docs/current/


Project: Trino dbt Demo 0.1 Data Context: Agriscience Source Type: Trino Models: 8 (5 staging + 3 marts) Tests: 38 Source Rows: 1,000,000+

About

Trino dbt project for agriscience data analytics with 1M+ rows of synthetic data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published