👆 click the picture to see the presentation video!
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.
The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.
To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.
By combining these critical features—including:
| ✅ Core Feature | 🔥 Core Highlights | 📦 Deliverables |
|---|---|---|
| 1. Data Warehouse Modeling and Documentation | - Full dimensional modeling process (Star Schema / Snowflake Schema) - Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling) - Business Matrix: defining & managing dimensions & fact tables |
- Data warehouse design document (Markdown) - Hive SQL modeling code - DWH Dimensional Modelling Architecture Diagram |
| 2. A Self-Built Distributed Big Data Platform | - Fully containerized deployment with Docker for quick replication - High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse |
- Docker images (Open sourced on GitHub Container Registry) - docker-compose.yml (one-click cluster startup) - Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow |
| 3. Distributed Batch Processing | - ETL processing using PySpark - data ETL job: OLTP to DWH && DWH to OLAP - Data Warehouse internal processing: ODS → DWD → DIM/DWM → DWS → ADS - batch processing job scheduler using Airflow |
PySpark and Spark SQL Code - Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP) - Code - Batch Processing (DWH Internal Transform) - Code - Scheduling based on Airflow (DAGs) |
| 4. CI/CD Automation | - Automated data platform cluster launching and stop | - GitHub Actions workflow pipeline .yaml - CI/CD code and documentation - Sample log screenshots |
| 5. Storage & Computation Optimization | - SQL optimization (dynamic partitioning, indexing, storage partitioning) - Spark tuning: Salting, Skew Join Hint, Broadcast Join, reduceByKey vs. groupByKey - Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression |
- Pre & post optimization performance comparison - Spark optimization code - SQL execution plan screenshots |
| 6. DevOps - Monitoring and Alerting | - Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL - AlertManager for alerting and email receiving |
- Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager - Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager - Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh - Key Screenshots |
| 7. Business Intelligence & Visualization | - PowerBI dashboards for data analysis - Real business-driven visualizations - Providing actionable business insights |
- PowerBI visualization screenshots - PowerBI .pbix file - Key business metric explanations (BI Insights) |
this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.
This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.
🔗 Click to Show DWH Dimensional Modelling Documents and Code
Data Warehouse Development Specification
- Data Warehouse Layering Specification
- Table Naming Conventions
- Data Warehouse Column Naming Conventions
- Data Table Lifecycle Management Specification
- DWH Modelling Architecture Diagram
🔨 Code - Hive DDL(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)
Figure 1: DWH Dimensional Modelling SOP
Figure 2: DWH Dimensional Modelling Methodology Diagram
Figure 3: DWH Dimensional Modelling Architecture
This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to 🔨 GitHub Container Registry, making it easy for anyone to deploy the same platform locally.
Figure 1: All Containers Window
Figure 2: Data Platform Architecture
This project implements a robust distributed batch processing architecture using PySpark for computation and Apache Airflow for orchestration. The batch layer focuses on high-throughput, scalable ETL workflows and integrates seamlessly with the overall data warehouse design. The core functionalities are structured as follows:
A PySpark-based incremental extraction process is used to ingest new records from the Oracle OLTP database into the data warehouse. Additionally, downstream scripts handle the transformation and export of analytical and result-layer datasets from the data warehouse into external OLAP systems, enabling fast access by BI tools (e.g., Power BI, Tableau).
Multi-stage transformations are implemented using Spark SQL within PySpark jobs to process data across warehouse layers, such as: ODS (Operational Data Store) → DWD (Data Warehouse Detail) and DWD → DIM (Dimension Tables). These transformations ensure structured, cleaned, and query-optimized data for analytical use cases.
The entire batch workflow is automated via Apache Airflow, with DAGs scheduled to run nightly at 2:00 AM. The scheduler coordinates the extraction, transformation, and loading tasks, handles dependencies, and ensures timely creation of new partitions and ingestion of the latest data into the warehouse.
Figure 1: ETL Data Pipeline
Figure 2: Airflow Web UI
- GitHub Actions Code
- Key Screenshots
Figure 1: Data platform launching and stop automation
Figure 2: Sample Log Screenshot I
Figure 3: Sample Log Screenshot II
🔨 Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
🔨 Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
Figure 1: Prometheus
Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master
Figure 3: Grafana-MySQLD
🔗 Link - PowerBI Public Access(Expirable)
Use Microsoft PowerBI connect to the Clickhouse and extract the analytical data storage layer

Figure 1: PowerBI Dashboard Demo
This project sets up a high-availability big data platform, including the following components:
| Components | Features | Version |
|---|---|---|
| Apache Hadoop | Big Data Distributed Framework | 3.2.4 |
| Apache Zookeeper | High Availability | 3.8.4 |
| Apache Spark | Distributed Computing | 3.3.0 |
| Apache Hive | Data Warehousing | 3.1.3 |
| Apache Airflow | Workflow Scheduling | 2.7.2 |
| MySQL | Metastore | 8.0.39 |
| Oracle Database | Workflow Scheduling | 19.0.0 |
| Azure Cloud ClickHouse | OLAP Analysis | 24.12 |
| Microsoft PowerBI | BI Dashboard | latest |
| Prometheus | Monitoring | 2.52.0 |
| Grafana | Monitoring GUI | 10.3.1 |
| Docker | Containerization | 28.0.1 |
/bigdata-datawarehouse-project
│── /.github/workflows # CI/CD automation workflows via GitHub Actions
│── /docs # docs (all business and technologies documents about this project)
│── /src
│── /data_pipeline # ETL flow: OLTP2DWH & DWH2OLAP
│── /warehouse_modeling # DWH modelling(Hive SQL etc.)
│── /batch_processing # Data Batch processing (PySpark + SparkSQL)
│── /scheduler # Task Scheduler(Airflow DAGs)
│── /infra # infrastructure deployment(Docker, configuration files)
│── /snippets # common used commands and snippets
│── /scripts # container internal shell scripts
│── /bi # PowerBI Dashboard pbix file
│── /README # Source Code Use Instruction Markdown Files
│── README.md # Navigation of Source Code Use Instruction
│── main_data_pipeline.py # **main entry point for the data pipeline module
│── main_batch_processing.py # **main entry point for the batch processing module
│── /tests # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.)
│── README.md # Introduction about project
│── docker-compose-bigdata.yml # Docker Compose to launch the docker cluster
│── .env # `public the .env on purpose` for docker-compose file use
│── .gitignore # Git ignore some directory not to be committed to the remote repo
│── .gitattributes # Git repository attributes config
│── LICENSE # COPYRIGHT for this project
│── mysql-metadata-restore.sh # container operational level scripts: restore mysql container metadata
│── mysql-metastore-dump.sh # container operational level scripts: dump mysql container metadata
│── push-to-ghcr.sh # container operational level scripts: push the images to GitHub Container Registry
│── start-data-clients.sh # container operational level scripts: start hive, spark etc
│── start-hadoop-cluster.sh # container operational level scripts: start hadoop HA cluster
│── start-other-services.sh # container operational level scripts: start airflow, prometheus, grafana etc
│── stop-data-clients.sh # container operational level scripts: stop hive, spark etc
│── stop-hadoop-cluster.sh # container operational level scripts: stop hadoop HA cluster
│── stop-other-services.sh # container operational level scripts: stop airflow, prometheus, grafana etc-
Data Warehouse Development Specification
-
SQL Development Specification
-
Troubleshooting: NodeManager Disk Space Issue Preventing YARN Registration
-
Troubleshooting: Error Handling Log: YARN Web UI Log Loading Failures
-
Troubleshooting: Spark Task - Issue with Writing ODS Layer Avro Data to HDFS but Hive Cannot Read
-
Troubleshooting: JSON-like Dictionary Representation in Python Script Causes Execution Failure
-
Troubleshooting: Resolving "Unrecognized column type: DATE_TYPE" Issue on ods_orders_ipd Table
-
Troubleshooting Hive Unable to Read Parquet Files Written by Spark SQL
-
Troubleshooting NameNode Startup Failure in Hadoop HA Environment
-
Preventing SIGHUP from Killing Background Processes When Using Docker
exec -it
This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.


