An end-to-end data engineering platform processing 9.5M+ NYC taxi records through batch (Airflow) and streaming (Kafka) pipelines with Medallion Architecture, automated data quality, star schema warehouse, Z-score anomaly detection, and interactive Streamlit dashboard.
| Resource | Link |
|---|---|
| π Analytics Dashboard | taxipulse-srujankothuri.streamlit.app |
| π» GitHub Repository | github.com/srujankothuri/TaxiPulse |
NYC TLC Data βββ¬ββ Batch Path (Airflow) βββ
β βββ MinIO (Bronze) ββ Quality Checks
βββ Stream Path (Kafka) ββββ β
βΌ
PostgreSQL (Silver β Gold)
β
ββββββββββββΌβββββββββββ
βΌ βΌ βΌ
Star Schema Anomaly Streamlit
Warehouse Detection Dashboard
+ Alerts
- Dual Ingestion: Batch (Airflow) + Real-time streaming (Kafka) pipelines
- Medallion Architecture: Bronze β Silver β Gold data layers
- Automated Data Quality: 18 validation checks with quarantine system (98.6% pass rate)
- Star Schema Warehouse: Fact table + 5 dimensions + pre-computed aggregations
- Anomaly Detection: Z-score based fare/volume spike detection (3,623 anomalies found)
- Slack Alerting: Real-time notifications for critical anomalies
- Interactive Dashboard: 4-page Streamlit app with 15+ charts
- Fully Containerized: 8 Docker services, one
docker-compose upto run everything - Tested: 38 pytest tests with GitHub Actions CI/CD
| Metric | Value |
|---|---|
| Total records processed | 9,554,778 |
| Quality pass rate | 98.6% |
| Clean Silver records | 9,417,374 |
| Quarantined records | 137,403 |
| Anomalies detected | 3,623 (1,478 critical) |
| Star schema dimensions | 5 |
| Hourly aggregations | 240,716 |
| Docker services | 8 |
| Pytest tests | 38 |
| Component | Technology |
|---|---|
| Orchestration | Apache Airflow |
| Streaming | Apache Kafka |
| Object Storage | MinIO (S3-compatible) |
| Data Warehouse | PostgreSQL |
| Data Quality | Custom Python Engine (18 checks) |
| Anomaly Detection | Python (scipy, numpy β Z-score) |
| Alerting | Slack Webhooks |
| Containerization | Docker + Docker Compose |
| Visualization | Streamlit + Plotly |
| Testing | pytest + GitHub Actions CI/CD |
| Language | Python 3.11+ |
TaxiPulse/
βββ airflow/ # Airflow DAGs (batch 7 tasks + streaming 2 tasks)
β βββ dags/
βββ ingestion/ # Data ingestion (batch + Kafka streaming)
β βββ batch/
β βββ streaming/
βββ transformations/ # Bronze β Silver β Gold transformations
β βββ bronze/
β βββ silver/
β βββ gold/
βββ quality/ # Data quality engine (18 expectations)
β βββ expectations/
βββ anomaly_detection/ # Z-score anomaly detection + Slack alerting
βββ streamlit_app/ # 4-page monitoring dashboard
β βββ pages/
β βββ data/ # Exported CSVs for cloud deployment
βββ scripts/ # Pipeline runner scripts
βββ tests/ # 38 pytest tests (4 modules)
βββ docker/ # Dockerfile for Airflow
βββ docs/ # Documentation and screenshots
βββ .github/workflows/ # GitHub Actions CI/CD
βββ docker-compose.yml # 8-service Docker infrastructure
βββ Makefile # Convenience commands
βββ README.md
- Docker Desktop (4GB+ RAM)
- Python 3.11+
# 1. Clone
git clone https://github.com/srujankothuri/TaxiPulse.git
cd TaxiPulse
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Mac/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure
cp .env.example .env
# Edit .env: set MINIO_ENDPOINT=localhost:9000, POSTGRES_HOST=localhost, KAFKA_BOOTSTRAP_SERVERS=localhost:29092
# 5. Start infrastructure
docker-compose up -d
# 6. Run complete pipeline (~50 min)
make pipeline
# 7. Load zone names
python scripts/load_zone_names.py
# 8. Launch dashboard
python -m streamlit run streamlit_app/app.py| Service | URL | Credentials |
|---|---|---|
| Streamlit Dashboard | http://localhost:8501 | β |
| Airflow UI | http://localhost:8080 | admin / admin |
| MinIO Console | http://localhost:9001 | taxipulse / taxipulse123 |
| PostgreSQL | localhost:5432 | taxipulse / taxipulse123 |
make help # Show all commands
make up # Start Docker services
make down # Stop Docker services
make pipeline # Run full batch pipeline
make streaming # Run Kafka streaming demo
make test # Run 38 tests
make dashboard # Launch Streamlit ββββββββββββββββ
β dim_datetime β
ββββββββ¬ββββββββ
β
ββββββββββββββββββββ ββββββ΄ββββββ ββββββββββββββββββββ
β dim_pickup_loc ββββ€fact_tripsββββ€ dim_dropoff_loc β
ββββββββββββββββββββ ββββ¬βββββ¬βββ ββββββββββββββββββββ
β β
ββββββββββββ ββββββββββββ
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββββ
β dim_payment β β dim_rate_code β
ββββββββββββββββββ ββββββββββββββββββββ
Batch: Download β MinIO β Bronze β Validate β Silver β Gold β Anomaly Detection
Streaming: Kafka Producer β Kafka Topic β Consumer β Validate β Silver
Both paths feed into the same Silver layer. Gold layer processes all data regardless of source.
This project is licensed under the MIT License β see the LICENSE file for details.
Venkata Srujan Kothuri
- GitHub: @srujankothuri
- LinkedIn: Connect with me











