This project provides a complete big data development environment using Docker Compose. It includes:
- Hadoop (HDFS & YARN) for distributed storage and processing
- Hive for SQL-like querying on large datasets
- Pig for data flow scripts and transformations
- Hue as a web-based GUI to interact with Hadoop, Hive, and Pig
- Jupyter Notebook for running data analysis and machine learning pipelines in Python
It is ideal for data scientists, engineers, and students looking to explore big data technologies in a local and containerized setup.
Hadoop is the foundation of the ecosystem, providing distributed storage and processing capabilities. HDFS (Hadoop Distributed File System) enables scalable and fault-tolerant data storage across nodes.
Hive enables querying and managing large datasets residing in HDFS using a SQL-like language (HiveQL). It connects with a Metastore backed by PostgreSQL for schema storage.
Pig is a high-level platform for creating MapReduce programs using a scripting language called Pig Latin. It’s particularly useful for data transformations and ETL tasks.
Hue is an open-source analytics workbench for querying and visualizing data. It integrates with Hive and Pig, providing a user-friendly GUI for writing queries and managing data.
Jupyter provides an interactive Python environment where users can access Hadoop and Hive using libraries like PyHive and interact with data using familiar tools such as pandas and matplotlib.
- Full local simulation of a big data ecosystem
- Interactive Jupyter notebooks for development and experimentation
- Web-based access to Hive, HDFS, and Pig via Hue
- Persistent volumes to retain data between container restarts
- Easily extensible to include Spark, Airflow, Superset, or other components
- Learning and practicing big data tools
- Prototyping data pipelines in a simulated Hadoop cluster
- Building and testing data science models that rely on distributed data
- Teaching Hadoop ecosystem components in classroom or workshop settings
- Start the containers using Docker Compose.
- Access HDFS via the web UI to upload or view files.
- Use Hue to write and run Hive or Pig queries.
- Launch Jupyter to run Python-based analytics.
- Store all notebooks in the
notebooksfolder.
- Hue GUI: Available at
http://localhost:8888 - Jupyter Notebook: Available at
http://localhost:8889 - HDFS UI: Available at
http://localhost:9870 - HiveServer2 (JDBC): Port
10000(for external connections)
- Docker
- Docker Compose
- At least 6 GB of RAM available for Docker
- Add Spark integration for distributed computing
- Connect Superset for data visualization
- Incorporate Airflow for workflow orchestration
- Enable secure access via HTTPS and authentication layers