Big Data Analytics Environment with Hadoop, Hive, Pig, Hue, and Jupyter

Overview

This project provides a complete big data development environment using Docker Compose. It includes:

Hadoop (HDFS & YARN) for distributed storage and processing
Hive for SQL-like querying on large datasets
Pig for data flow scripts and transformations
Hue as a web-based GUI to interact with Hadoop, Hive, and Pig
Jupyter Notebook for running data analysis and machine learning pipelines in Python

It is ideal for data scientists, engineers, and students looking to explore big data technologies in a local and containerized setup.

Components

Hadoop (HDFS & YARN)

Hadoop is the foundation of the ecosystem, providing distributed storage and processing capabilities. HDFS (Hadoop Distributed File System) enables scalable and fault-tolerant data storage across nodes.

Hive

Hive enables querying and managing large datasets residing in HDFS using a SQL-like language (HiveQL). It connects with a Metastore backed by PostgreSQL for schema storage.

Pig

Pig is a high-level platform for creating MapReduce programs using a scripting language called Pig Latin. It’s particularly useful for data transformations and ETL tasks.

Hue

Hue is an open-source analytics workbench for querying and visualizing data. It integrates with Hive and Pig, providing a user-friendly GUI for writing queries and managing data.

Jupyter Notebook

Jupyter provides an interactive Python environment where users can access Hadoop and Hive using libraries like PyHive and interact with data using familiar tools such as pandas and matplotlib.

Features

Full local simulation of a big data ecosystem
Interactive Jupyter notebooks for development and experimentation
Web-based access to Hive, HDFS, and Pig via Hue
Persistent volumes to retain data between container restarts
Easily extensible to include Spark, Airflow, Superset, or other components

Use Cases

Learning and practicing big data tools
Prototyping data pipelines in a simulated Hadoop cluster
Building and testing data science models that rely on distributed data
Teaching Hadoop ecosystem components in classroom or workshop settings

Getting Started

Start the containers using Docker Compose.
Access HDFS via the web UI to upload or view files.
Use Hue to write and run Hive or Pig queries.
Launch Jupyter to run Python-based analytics.
Store all notebooks in the notebooks folder.

Access Points

Hue GUI: Available at http://localhost:8888
Jupyter Notebook: Available at http://localhost:8889
HDFS UI: Available at http://localhost:9870
HiveServer2 (JDBC): Port 10000 (for external connections)

Requirements

Docker
Docker Compose
At least 6 GB of RAM available for Docker

Next Steps

Add Spark integration for distributed computing
Connect Superset for data visualization
Incorporate Airflow for workflow orchestration
Enable secure access via HTTPS and authentication layers

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Analytics Environment with Hadoop, Hive, Pig, Hue, and Jupyter

Overview

Components

Hadoop (HDFS & YARN)

Hive

Pig

Hue

Jupyter Notebook

Features

Use Cases

Getting Started

Access Points

Requirements

Next Steps

About

Uh oh!

Releases

Packages

Languages

foscraft/hadoop-cluster

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics Environment with Hadoop, Hive, Pig, Hue, and Jupyter

Overview

Components

Hadoop (HDFS & YARN)

Hive

Pig

Hue

Jupyter Notebook

Features

Use Cases

Getting Started

Access Points

Requirements

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages