Climate Data Toolkit

A unified toolkit for retrieving climate data from various global datasets such as CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids.

API Dataset Badges

About

The Climate Toolkit offers a unified, programmatic interface to:

Retrieve climate data from CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids
Compute rainfall statistics, anomalies, and hazard indicators
Compare climate trends over historical and seasonal periods

Project Structure

climate_toolkit/
├── calculate_hazards/       # Hazard metrics like SPI
├── climate_statistics/      # Stats and anomalies
├── compare_periods/         # Compare historic trends
├── fetch_data/              # Modular data downloaders
└── season_analysis/         # Onset/cessation detection

Getting Started

Clone the repository

git clone https://github.com/Sammyjoseph999/climate-toolkit.git
cd climate-toolkit

Set up a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Create and configure your .env
```
cp .env.example .env
```

Performance: Cold-Cache vs Warm-Cache

GEE-backed sources (ERA5, AgERA5, CHIRPS, CHIRTS, TerraClimate, IMERG, NEX-GDDP) fetch data over the network from Google Earth Engine. Runtime differs significantly depending on whether the GEE tile cache is warm:

Scenario	Typical runtime
Cold-cache fetch (first run, 1 site, 1 year)	20 – 80 seconds
Warm-cache rerun (same request shortly after)	< 1 second
Multi-source historical path (CHIRPS + AgERA5)	60 – 120 seconds cold

What this means in practice:

The first run after starting a session or switching locations will always be slower — this is normal, not a hang or failure.
Repeat runs against the same date range and location are much faster once GEE tiles are cached.
NEX-GDDP fetches over long date ranges (30 years) are chunked into ~13-year windows internally; expect proportionally longer cold-cache times.

To reduce cold-cache impact:

Use a project-local cache directory (e.g. outputs/cache/) and reuse downloaded CSVs where possible rather than re-fetching.
Keep GEE authenticated (earthengine authenticate) and GCP_PROJECT_ID set in .env to avoid auth overhead on every call.

How to Use

To download climate data from the terminal, run:

python climate_toolkit/fetch_data/source_data/source_data.py

This will trigger the configured download process based on the parameters defined in the SourceData class within the script.

Development

Setting Up

All configuration values (e.g., API keys) are managed via .env using python-dotenv.
Modular dataset handlers are found in fetch_data/, each with DownloadData classes.
Common utilities like enums and settings are stored in fetch_data/sources/utils/.

Solution Architecture

Technology Stack

Climate data processing workflow diagram showing the flow from data sources through processing and analysis to end consumers.

The core engine of the Climate Toolkit will reuse existing scripts, APIs, and code, with preference for lazy-execution engines.
Interoperability between R, Python, and other languages will be ensured via OpenAPI-compliant interfaces.
No permanent storage is provisioned initially—caching strategies will be considered for efficiency.
A notebook environment will support non-technical users in exploring climate data.
Technical users will have access to source code and APIs through GitHub.
The Solution Design & Architecture is a living document and will evolve over time.
Timestamps will follow the ISO8601 format and be recorded in UTC.

Application Modules

Below are the core modules that form the foundation of the application. Each module addresses a specific category of user stories and is designed with future scalability in mind—allowing for independent microservice deployment as the application evolves.

SN	Title	Type	Description
1.a	fetch_data	Module	Fetches data from a climate database and returns an enriched, analysis-ready dataset.
1.b	source_data	Function	Retrieves raw data from a climate database in its native format.
1.c	transform_data	Function	Standardizes external source data to align with the toolkit’s internal data dictionary.
1.d	preprocess_data	Function	Prepares raw source data into an analysis-ready format (e.g., downscaling, bias correction). This step excludes enrichment like climate statistics, which is handled by dedicated services.
2	climate_statistics	Module	Generates climate statistics from pre-processed datasets.
3	calculate_hazards	Module	Retrieves crop hazard indices for specific locations.
4	compare_datasets	Module	Compares datasets from various climate sources to help users select preferred datasets.
5	compare_periods	Module	Allows comparison of climate statistics between two time periods.
6	season_analysis	Module	Estimates crop growing seasons in a specific location and returns relevant climate indicators.

Interaction diagram showing how modules depend on and communicate with each other.

The diagram illustrates how the different modules interact within the Climate Toolkit. The numbering on the bottom right of each module indicates the suggested implementation order.

At the center is the fetch_data module, which orchestrates the retrieval and preprocessing of climate data from various sources. It ensures the data is transformed and standardized before being made available for further analysis.

This centralized workflow enables reuse across climate analysis operations like season_analysis, climate_statistics, and compare_periods, ensuring consistency in results and reducing duplication of effort.

The compare_datasets module is reserved for future implementation. Its placement in the diagram demonstrates its anticipated integration with existing components, providing the ability to assess and select preferred data sources.

API Statuses & Response Format

These are the API statuses that will be applicable to this application.

Status Code	Status	Message
20X	REQUEST_SUCCESSFUL	"Your request was received and data processed successfully"
40X	REQUEST_UNSUCCESSFUL	"Your request was received but there was an issue with processing the data"
50X	SERVICE_UNREACHABLE	"Your request was not received by the server"

This is a basic structure of the API response format containing the mandatory fields. This enables the responses for various services consumed in this toolkit to have a standardised response format. It should be noted that the payload key-value pairs will depend on the return values of the application logic:

status_code: integer
status: string
message: string
data: json

{
  "status_code": 200,
  "status": "REQUEST_SUCCESSFUL",
  "message": "Your request was received and data processed successfully",
  "data": {
    # Payload depends on the app logic's return values
    "key1": "value1",
    "key2": "value2"
  }
}

Development Best Practices

#	Practice	Description
1	Commit Early and Often	Don't wait until a large feature is complete. Commit small, logical, and self-contained changes.
2	Atomic Commits	Each commit should represent a single, coherent change or a small set of related changes. If you're fixing two different bugs, create two separate commits.
3	Don't Commit Half-Done Work (to shared branches)	While local commits can be frequent, avoid pushing incomplete or broken code to shared development branches. Use "git stash" if you need a clean working directory temporarily.
4	Test Before Committing	Ensure your code works as expected and passes tests before committing.
5	Review Before Committing	Use "git diff" to review your own changes before committing to catch unintended modifications.
6	Conventional Commits	Consider adopting a convention like Conventional Commits (feat:, fix:, chore:, docs:, ci:, refactor:, test:) to categorize changes and enable automated changelog generation. For example, "feat: Add CHIRPS as a climate data source". Ref: https://www.conventionalcommits.org
7	Consistent Naming Conventions	Establish clear and consistent naming conventions for branches (e.g., feat/feature-name, fix/issue-description, refactor/performance-improvement, etc).
8	Pull Regularly	Each feature or fix should be developed on a dedicated branch. These branches should be short-lived and merged back into a main development branch (e.g., develop or main) as soon as the work is complete and reviewed. Pull frequently to avoid merge conflicts.
9	Branching Strategies	GitLab Flow will be used. It will have the following branches: a. main: This branch should always be stable and deployable. Direct commits to this branch should be prohibited; all changes must come through pull requests. b. staging: This branch is for the UAT/QA environment. Direct commits to this branch should be prohibited; all changes must come through pull requests. Maintainers can have "force push" access.
10	Well-Documented Pull Requests (PRs)	Summarize the PR's purpose effectively in the subject. The PR should also have a detailed description that covers: a. Problem Statement: Clearly describe the problem or feature addressed by the PR. b. Solution Overview: Explain how the PR solves the problem or implements the feature. c. Technical Details (if necessary): Provide any necessary technical context, architectural decisions, or trade-offs. d. Screenshots/Videos: For UI changes, include screenshots or short videos to demonstrate the changes. e. Testing Instructions: Provide clear steps for reviewers to test the changes, including any specific configurations or data needed. f. Related Issues/Tickets: Link to relevant issues in your issue tracker.
11	DevOps	Automating the build, test, and deployment process ensures that code changes are integrated frequently and validated quickly. This catches issues early and provides rapid feedback. This will be implemented using GitHub Actions since it is native to GitHub.
12	Conversation Trails	Keep implementation discussions on the ticket in the Kanban system. This makes it easier to maintain a trail of the conversations and decisions regarding a proposed feature or fix. If discussions are held outside of the ticket (e.g., on Teams due to confidentiality), the conclusions from those discussions should be transferred to the ticket itself. This will still allow the project to maintain an trail of the conversation and decisions affecting the implementation of the feature.

Contributing

We welcome PRs and suggestions!

Fork the repo
Work in a feature branch
Follow module layout and formatting
Submit a pull request with a clear description

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 310 Commits
.github/workflows		.github/workflows
assets		assets
climate_tookit		climate_tookit
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate Data Toolkit

API Dataset Badges

About

Project Structure

Getting Started

Performance: Cold-Cache vs Warm-Cache

How to Use

Development

Setting Up

Solution Architecture

Technology Stack

Application Modules

API Statuses & Response Format

Development Best Practices

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Climate Data Toolkit

API Dataset Badges

About

Project Structure

Getting Started

Performance: Cold-Cache vs Warm-Cache

How to Use

Development

Setting Up

Solution Architecture

Technology Stack

Application Modules

API Statuses & Response Format

Development Best Practices

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages