This ETL Pipeline project is designed to extract, transform, and load data using Apache Airflow for orchestration. The project follows best practices in software engineering, including a well-defined structure, linting, security measures, and testing.
jp-airflow-test
├── dags
│ └── etl_workflow.py # Airflow DAG for orchestrating ETL tasks
├── docs
│ └── index.md # Project documentation
├── etl
│ ├── extract.py # Module for data extraction
│ ├── transform.py # Module for data transformation
│ ├── load.py # Module for loading data
│ └── __init__.py # Package initialization
├── tests
│ ├── unit
│ │ ├── test_extract.py # Unit tests for data extraction
│ │ ├── test_transform.py # Unit tests for data transformation
│ │ └── test_load.py # Unit tests for data loading
│ └── e2e
│ └── test_etl_workflow.py # End-to-end tests for the ETL workflow
├── scripts
│ └── run_etl.py # Script to run the ETL process manually
├── Makefile # Makefile for project commands
├── README.md # Repository documentation
├── CODE_OF_CONDCT.md # Organisation documentation
├── CODEOWNERS # Defines code owner(s)
└── CONTRIBUTING.md # Repository documentation
This project makes use of a poetry for python package dependency management, for installing requirements and provisioning virtual environments. If poetry is not already installed run:
pip install poetry-
Clone the Repository
git clone <repository-url> cd jp-airflow-test
-
Install Dependencies
make install-dependencies
-
Set Up Virtual Environment
make activate-virtual-env
-
Run Tests
- Unit tests:
make test-unit
- End-to-end tests:
make test-e2e
- All tests:
make test
- Unit tests:
-
Linting and Security Checks
make lint make security
-
Run the ETL Process manually using the provided script
make run
make get-airflow-docker-compose-yamlmake local-airflow-startmake local-airflow-stopNote:
- If you change your ETL code, restart Airflow to reload the DAG.
- For troubleshooting, check the Airflow UI for import errors or task logs.
The ETL pipeline consists of three main components:
- Extract: The
extract_datafunction retrieves data from the source. - Transform: The
transform_datafunction processes the extracted data. - Load: The
load_datafunction loads the transformed data into the target destination.
The Airflow DAG defined in dags/etl_workflow.py orchestrates these tasks, ensuring they run in the correct order.
- Code Quality: The project uses Flake8 for linting and pre-commit hooks to enforce code quality checks.
- Testing: Comprehensive unit and end-to-end tests are included to ensure the reliability of the ETL process.
- Documentation: Clear documentation is provided to facilitate understanding and usage of the project.
This project is licensed under the MIT License. See the LICENSE file for more details.