A unified toolkit for retrieving climate data from various global datasets such as CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids.
The Climate Toolkit offers a unified, programmatic interface to:
- Retrieve climate data from CHIRPS, AGERA5, TerraClimate, IMERG, TAMSAT, CHIRTS, ERA5, NEX-GDDP, NASA POWER, CMIP6 and SoilGrids
- Compute rainfall statistics, anomalies, and hazard indicators
- Compare climate trends over historical and seasonal periods
climate_toolkit/
├── calculate_hazards/ # Hazard metrics like SPI
├── climate_statistics/ # Stats and anomalies
├── compare_periods/ # Compare historic trends
├── fetch_data/ # Modular data downloaders
└── season_analysis/ # Onset/cessation detection
-
Clone the repository
git clone https://github.com/Sammyjoseph999/climate-toolkit.git cd climate-toolkit -
Set up a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Create and configure your
.envcp .env.example .env
GEE-backed sources (ERA5, AgERA5, CHIRPS, CHIRTS, TerraClimate, IMERG, NEX-GDDP) fetch data over the network from Google Earth Engine. Runtime differs significantly depending on whether the GEE tile cache is warm:
| Scenario | Typical runtime |
|---|---|
| Cold-cache fetch (first run, 1 site, 1 year) | 20 – 80 seconds |
| Warm-cache rerun (same request shortly after) | < 1 second |
| Multi-source historical path (CHIRPS + AgERA5) | 60 – 120 seconds cold |
What this means in practice:
- The first run after starting a session or switching locations will always be slower — this is normal, not a hang or failure.
- Repeat runs against the same date range and location are much faster once GEE tiles are cached.
- NEX-GDDP fetches over long date ranges (30 years) are chunked into ~13-year windows internally; expect proportionally longer cold-cache times.
To reduce cold-cache impact:
- Use a project-local cache directory (e.g.
outputs/cache/) and reuse downloaded CSVs where possible rather than re-fetching. - Keep GEE authenticated (
earthengine authenticate) andGCP_PROJECT_IDset in.envto avoid auth overhead on every call.
To download climate data from the terminal, run:
python climate_toolkit/fetch_data/source_data/source_data.pyThis will trigger the configured download process based on the parameters defined in the SourceData class within the script.
- All configuration values (e.g., API keys) are managed via
.envusingpython-dotenv. - Modular dataset handlers are found in
fetch_data/, each withDownloadDataclasses. - Common utilities like enums and settings are stored in
fetch_data/sources/utils/.
Climate data processing workflow diagram showing the flow from data sources through processing and analysis to end consumers.
- The core engine of the Climate Toolkit will reuse existing scripts, APIs, and code, with preference for lazy-execution engines.
- Interoperability between R, Python, and other languages will be ensured via OpenAPI-compliant interfaces.
- No permanent storage is provisioned initially—caching strategies will be considered for efficiency.
- A notebook environment will support non-technical users in exploring climate data.
- Technical users will have access to source code and APIs through GitHub.
- The Solution Design & Architecture is a living document and will evolve over time.
- Timestamps will follow the ISO8601 format and be recorded in UTC.
Below are the core modules that form the foundation of the application. Each module addresses a specific category of user stories and is designed with future scalability in mind—allowing for independent microservice deployment as the application evolves.
| SN | Title | Type | Description |
|---|---|---|---|
| 1.a | fetch_data | Module | Fetches data from a climate database and returns an enriched, analysis-ready dataset. |
| 1.b | source_data | Function | Retrieves raw data from a climate database in its native format. |
| 1.c | transform_data | Function | Standardizes external source data to align with the toolkit’s internal data dictionary. |
| 1.d | preprocess_data | Function | Prepares raw source data into an analysis-ready format (e.g., downscaling, bias correction). This step excludes enrichment like climate statistics, which is handled by dedicated services. |
| 2 | climate_statistics | Module | Generates climate statistics from pre-processed datasets. |
| 3 | calculate_hazards | Module | Retrieves crop hazard indices for specific locations. |
| 4 | compare_datasets | Module | Compares datasets from various climate sources to help users select preferred datasets. |
| 5 | compare_periods | Module | Allows comparison of climate statistics between two time periods. |
| 6 | season_analysis | Module | Estimates crop growing seasons in a specific location and returns relevant climate indicators. |
The diagram illustrates how the different modules interact within the Climate Toolkit. The numbering on the bottom right of each module indicates the suggested implementation order.
At the center is the fetch_data module, which orchestrates the retrieval and preprocessing of climate data from various sources. It ensures the data is transformed and standardized before being made available for further analysis.
This centralized workflow enables reuse across climate analysis operations like season_analysis, climate_statistics, and compare_periods, ensuring consistency in results and reducing duplication of effort.
The compare_datasets module is reserved for future implementation. Its placement in the diagram demonstrates its anticipated integration with existing components, providing the ability to assess and select preferred data sources.
These are the API statuses that will be applicable to this application.
| Status Code | Status | Message |
|---|---|---|
| 20X | REQUEST_SUCCESSFUL | "Your request was received and data processed successfully" |
| 40X | REQUEST_UNSUCCESSFUL | "Your request was received but there was an issue with processing the data" |
| 50X | SERVICE_UNREACHABLE | "Your request was not received by the server" |
This is a basic structure of the API response format containing the mandatory fields. This enables the responses for various services consumed in this toolkit to have a standardised response format. It should be noted that the payload key-value pairs will depend on the return values of the application logic:
status_code: integerstatus: stringmessage: stringdata: json
{
"status_code": 200,
"status": "REQUEST_SUCCESSFUL",
"message": "Your request was received and data processed successfully",
"data": {
# Payload depends on the app logic's return values
"key1": "value1",
"key2": "value2"
}
}
| # | Practice | Description |
|---|---|---|
| 1 | Commit Early and Often | Don't wait until a large feature is complete. Commit small, logical, and self-contained changes. |
| 2 | Atomic Commits | Each commit should represent a single, coherent change or a small set of related changes. If you're fixing two different bugs, create two separate commits. |
| 3 | Don't Commit Half-Done Work (to shared branches) | While local commits can be frequent, avoid pushing incomplete or broken code to shared development branches. Use "git stash" if you need a clean working directory temporarily. |
| 4 | Test Before Committing | Ensure your code works as expected and passes tests before committing. |
| 5 | Review Before Committing | Use "git diff" to review your own changes before committing to catch unintended modifications. |
| 6 | Conventional Commits | Consider adopting a convention like Conventional Commits (feat:, fix:, chore:, docs:, ci:, refactor:, test:) to categorize changes and enable automated changelog generation. For example, "feat: Add CHIRPS as a climate data source". Ref: https://www.conventionalcommits.org |
| 7 | Consistent Naming Conventions | Establish clear and consistent naming conventions for branches (e.g., feat/feature-name, fix/issue-description, refactor/performance-improvement, etc). |
| 8 | Pull Regularly | Each feature or fix should be developed on a dedicated branch. These branches should be short-lived and merged back into a main development branch (e.g., develop or main) as soon as the work is complete and reviewed. Pull frequently to avoid merge conflicts. |
| 9 | Branching Strategies | GitLab Flow will be used. It will have the following branches: a. main: This branch should always be stable and deployable. Direct commits to this branch should be prohibited; all changes must come through pull requests. b. staging: This branch is for the UAT/QA environment. Direct commits to this branch should be prohibited; all changes must come through pull requests. Maintainers can have "force push" access. |
| 10 | Well-Documented Pull Requests (PRs) | Summarize the PR's purpose effectively in the subject. The PR should also have a detailed description that covers: a. Problem Statement: Clearly describe the problem or feature addressed by the PR. b. Solution Overview: Explain how the PR solves the problem or implements the feature. c. Technical Details (if necessary): Provide any necessary technical context, architectural decisions, or trade-offs. d. Screenshots/Videos: For UI changes, include screenshots or short videos to demonstrate the changes. e. Testing Instructions: Provide clear steps for reviewers to test the changes, including any specific configurations or data needed. f. Related Issues/Tickets: Link to relevant issues in your issue tracker. |
| 11 | DevOps | Automating the build, test, and deployment process ensures that code changes are integrated frequently and validated quickly. This catches issues early and provides rapid feedback. This will be implemented using GitHub Actions since it is native to GitHub. |
| 12 | Conversation Trails | Keep implementation discussions on the ticket in the Kanban system. This makes it easier to maintain a trail of the conversations and decisions regarding a proposed feature or fix. If discussions are held outside of the ticket (e.g., on Teams due to confidentiality), the conclusions from those discussions should be transferred to the ticket itself. This will still allow the project to maintain an trail of the conversation and decisions affecting the implementation of the feature. |
We welcome PRs and suggestions!
- Fork the repo
- Work in a feature branch
- Follow module layout and formatting
- Submit a pull request with a clear description
This project is licensed under the MIT License.
