Skip to content

ONS-Innovation/jp-airflow-test

jp-airflow-test

This ETL Pipeline project is designed to extract, transform, and load data using Apache Airflow for orchestration. The project follows best practices in software engineering, including a well-defined structure, linting, security measures, and testing.

Project Structure

jp-airflow-test
├── dags
│   └── etl_workflow.py          # Airflow DAG for orchestrating ETL tasks
├── docs
│   └── index.md                 # Project documentation
├── etl
│   ├── extract.py               # Module for data extraction
│   ├── transform.py             # Module for data transformation
│   ├── load.py                  # Module for loading data
│   └── __init__.py              # Package initialization
├── tests
│   ├── unit
│   │   ├── test_extract.py      # Unit tests for data extraction
│   │   ├── test_transform.py    # Unit tests for data transformation
│   │   └── test_load.py         # Unit tests for data loading
│   └── e2e
│       └── test_etl_workflow.py # End-to-end tests for the ETL workflow
├── scripts
│   └── run_etl.py               # Script to run the ETL process manually
├── Makefile                     # Makefile for project commands
├── README.md                    # Repository documentation
├── CODE_OF_CONDCT.md            # Organisation documentation
├── CODEOWNERS                   # Defines code owner(s)
└── CONTRIBUTING.md              # Repository documentation

Makefile

This project makes use of a Makefile as a simple way to organise common commands/tasks.

Poetry

This project makes use of a poetry for python package dependency management, for installing requirements and provisioning virtual environments. If poetry is not already installed run:

pip install poetry

Setup Instructions

  1. Clone the Repository

    git clone <repository-url>
    cd jp-airflow-test
  2. Install Dependencies

    make install-dependencies

Setup Development Environment

  1. Set Up Virtual Environment

    make activate-virtual-env
  2. Run Tests

    • Unit tests:
      make test-unit
    • End-to-end tests:
      make test-e2e
    • All tests:
      make test
  3. Linting and Security Checks

    make lint
    make security
  4. Run the ETL Process manually using the provided script

    make run

Run Airflow Locally with Docker Compose

1. Get the docker-compose.yaml

NB An Airflow 3.0.6 docker compose file is already included

make get-airflow-docker-compose-yaml

2. Start Local Airflow on 8080:8080

make local-airflow-start

3. Stop Local Airflow

make local-airflow-stop

Note:

  • If you change your ETL code, restart Airflow to reload the DAG.
  • For troubleshooting, check the Airflow UI for import errors or task logs.

Usage

The ETL pipeline consists of three main components:

  • Extract: The extract_data function retrieves data from the source.
  • Transform: The transform_data function processes the extracted data.
  • Load: The load_data function loads the transformed data into the target destination.

The Airflow DAG defined in dags/etl_workflow.py orchestrates these tasks, ensuring they run in the correct order.

Best Practices

  • Code Quality: The project uses Flake8 for linting and pre-commit hooks to enforce code quality checks.
  • Testing: Comprehensive unit and end-to-end tests are included to ensure the reliability of the ETL process.
  • Documentation: Clear documentation is provided to facilitate understanding and usage of the project.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

About

test

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors