Skip to content

Sammyjoseph999/climate-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

310 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Climate Data Toolkit

A unified toolkit for retrieving climate data from various global datasets such as CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids.

API Dataset Badges

CHIRPS AgERA5 TerraClimate IMERG TAMSAT CHIRTS ERA5 NEX-GDDP NASA POWER CMIP6 SoilGrids


About

The Climate Toolkit offers a unified, programmatic interface to:

  • Retrieve climate data from CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids
  • Compute rainfall statistics, anomalies, and hazard indicators
  • Compare climate trends over historical and seasonal periods

Project Structure

climate_toolkit/
├── calculate_hazards/       # Hazard metrics like SPI
├── climate_statistics/      # Stats and anomalies
├── compare_periods/         # Compare historic trends
├── fetch_data/              # Modular data downloaders
└── season_analysis/         # Onset/cessation detection

Getting Started

  1. Clone the repository

    git clone https://github.com/Sammyjoseph999/climate-toolkit.git
    cd climate-toolkit
  2. Set up a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Create and configure your .env

    cp .env.example .env

Performance: Cold-Cache vs Warm-Cache

GEE-backed sources (ERA5, AgERA5, CHIRPS, CHIRTS, TerraClimate, IMERG, NEX-GDDP) fetch data over the network from Google Earth Engine. Runtime differs significantly depending on whether the GEE tile cache is warm:

Scenario Typical runtime
Cold-cache fetch (first run, 1 site, 1 year) 20 – 80 seconds
Warm-cache rerun (same request shortly after) < 1 second
Multi-source historical path (CHIRPS + AgERA5) 60 – 120 seconds cold

What this means in practice:

  • The first run after starting a session or switching locations will always be slower — this is normal, not a hang or failure.
  • Repeat runs against the same date range and location are much faster once GEE tiles are cached.
  • NEX-GDDP fetches over long date ranges (30 years) are chunked into ~13-year windows internally; expect proportionally longer cold-cache times.

To reduce cold-cache impact:

  • Use a project-local cache directory (e.g. outputs/cache/) and reuse downloaded CSVs where possible rather than re-fetching.
  • Keep GEE authenticated (earthengine authenticate) and GCP_PROJECT_ID set in .env to avoid auth overhead on every call.

How to Use

To download climate data from the terminal, run:

python climate_toolkit/fetch_data/source_data/source_data.py

This will trigger the configured download process based on the parameters defined in the SourceData class within the script.


Development

Setting Up

  • All configuration values (e.g., API keys) are managed via .env using python-dotenv.
  • Modular dataset handlers are found in fetch_data/, each with DownloadData classes.
  • Common utilities like enums and settings are stored in fetch_data/sources/utils/.

Solution Architecture

Technology Stack

Climate Data Workflow

Climate data processing workflow diagram showing the flow from data sources through processing and analysis to end consumers.

  • The core engine of the Climate Toolkit will reuse existing scripts, APIs, and code, with preference for lazy-execution engines.
  • Interoperability between R, Python, and other languages will be ensured via OpenAPI-compliant interfaces.
  • No permanent storage is provisioned initially—caching strategies will be considered for efficiency.
  • A notebook environment will support non-technical users in exploring climate data.
  • Technical users will have access to source code and APIs through GitHub.
  • The Solution Design & Architecture is a living document and will evolve over time.
  • Timestamps will follow the ISO8601 format and be recorded in UTC.

Application Modules

Below are the core modules that form the foundation of the application. Each module addresses a specific category of user stories and is designed with future scalability in mind—allowing for independent microservice deployment as the application evolves.

SN Title Type Description
1.a fetch_data Module Fetches data from a climate database and returns an enriched, analysis-ready dataset.
1.b source_data Function Retrieves raw data from a climate database in its native format.
1.c transform_data Function Standardizes external source data to align with the toolkit’s internal data dictionary.
1.d preprocess_data Function Prepares raw source data into an analysis-ready format (e.g., downscaling, bias correction). This step excludes enrichment like climate statistics, which is handled by dedicated services.
2 climate_statistics Module Generates climate statistics from pre-processed datasets.
3 calculate_hazards Module Retrieves crop hazard indices for specific locations.
4 compare_datasets Module Compares datasets from various climate sources to help users select preferred datasets.
5 compare_periods Module Allows comparison of climate statistics between two time periods.
6 season_analysis Module Estimates crop growing seasons in a specific location and returns relevant climate indicators.
Module Interaction Diagram

Interaction diagram showing how modules depend on and communicate with each other.

The diagram illustrates how the different modules interact within the Climate Toolkit. The numbering on the bottom right of each module indicates the suggested implementation order.

At the center is the fetch_data module, which orchestrates the retrieval and preprocessing of climate data from various sources. It ensures the data is transformed and standardized before being made available for further analysis.

This centralized workflow enables reuse across climate analysis operations like season_analysis, climate_statistics, and compare_periods, ensuring consistency in results and reducing duplication of effort.

The compare_datasets module is reserved for future implementation. Its placement in the diagram demonstrates its anticipated integration with existing components, providing the ability to assess and select preferred data sources.

API Statuses & Response Format

These are the API statuses that will be applicable to this application.

Status Code Status Message
20X REQUEST_SUCCESSFUL "Your request was received and data processed successfully"
40X REQUEST_UNSUCCESSFUL "Your request was received but there was an issue with processing the data"
50X SERVICE_UNREACHABLE "Your request was not received by the server"

This is a basic structure of the API response format containing the mandatory fields. This enables the responses for various services consumed in this toolkit to have a standardised response format. It should be noted that the payload key-value pairs will depend on the return values of the application logic:

  • status_code: integer
  • status: string
  • message: string
  • data: json
{
  "status_code": 200,
  "status": "REQUEST_SUCCESSFUL",
  "message": "Your request was received and data processed successfully",
  "data": {
    # Payload depends on the app logic's return values
    "key1": "value1",
    "key2": "value2"
  }
}

Development Best Practices

# Practice Description
1 Commit Early and Often Don't wait until a large feature is complete. Commit small, logical, and self-contained changes.
2 Atomic Commits Each commit should represent a single, coherent change or a small set of related changes. If you're fixing two different bugs, create two separate commits.
3 Don't Commit Half-Done Work (to shared branches) While local commits can be frequent, avoid pushing incomplete or broken code to shared development branches. Use "git stash" if you need a clean working directory temporarily.
4 Test Before Committing Ensure your code works as expected and passes tests before committing.
5 Review Before Committing Use "git diff" to review your own changes before committing to catch unintended modifications.
6 Conventional Commits Consider adopting a convention like Conventional Commits (feat:, fix:, chore:, docs:, ci:, refactor:, test:) to categorize changes and enable automated changelog generation. For example, "feat: Add CHIRPS as a climate data source". Ref: https://www.conventionalcommits.org
7 Consistent Naming Conventions Establish clear and consistent naming conventions for branches (e.g., feat/feature-name, fix/issue-description, refactor/performance-improvement, etc).
8 Pull Regularly Each feature or fix should be developed on a dedicated branch. These branches should be short-lived and merged back into a main development branch (e.g., develop or main) as soon as the work is complete and reviewed. Pull frequently to avoid merge conflicts.
9 Branching Strategies GitLab Flow will be used. It will have the following branches:

a. main: This branch should always be stable and deployable. Direct commits to this branch should be prohibited; all changes must come through pull requests.

b. staging: This branch is for the UAT/QA environment. Direct commits to this branch should be prohibited; all changes must come through pull requests. Maintainers can have "force push" access.
10 Well-Documented Pull Requests (PRs) Summarize the PR's purpose effectively in the subject. The PR should also have a detailed description that covers:

a. Problem Statement: Clearly describe the problem or feature addressed by the PR.

b. Solution Overview: Explain how the PR solves the problem or implements the feature.

c. Technical Details (if necessary): Provide any necessary technical context, architectural decisions, or trade-offs.

d. Screenshots/Videos: For UI changes, include screenshots or short videos to demonstrate the changes.

e. Testing Instructions: Provide clear steps for reviewers to test the changes, including any specific configurations or data needed.

f. Related Issues/Tickets: Link to relevant issues in your issue tracker.
11 DevOps Automating the build, test, and deployment process ensures that code changes are integrated frequently and validated quickly. This catches issues early and provides rapid feedback. This will be implemented using GitHub Actions since it is native to GitHub.
12 Conversation Trails Keep implementation discussions on the ticket in the Kanban system. This makes it easier to maintain a trail of the conversations and decisions regarding a proposed feature or fix. If discussions are held outside of the ticket (e.g., on Teams due to confidentiality), the conclusions from those discussions should be transferred to the ticket itself. This will still allow the project to maintain an trail of the conversation and decisions affecting the implementation of the feature.

Contributing

We welcome PRs and suggestions!

  1. Fork the repo
  2. Work in a feature branch
  3. Follow module layout and formatting
  4. Submit a pull request with a clear description

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages