Analyze GitHub repositories to track commits, code churn, trends, and spikes over time.
This project shows how PyDriller can be used to extract commit information from a GitHub repository and analyze it to better understand development activity.
The main research question guiding this work is: How does development activity evolve over time? Specifically, we aim to analyze how monthly commit counts and code churn (lines added or removed) fluctuate across a repository’s lifetime, highlighting long-term trends and notable spikes.
The approach is demonstrated on the NumPy repository, focusing on the last six months of 2025, to observe how commit activity and code changes evolve over time.
Ensure Python 3.10+ is installed. Create and activate a virtual environment, then install dependencies:
# From the workspace root
python -m venv venv
& .\venv\Scripts\Activate.ps1 # Windows PowerShell
pip install -r requirements.txtLaunch the interactive extraction script and provide repository path or URL and date range:
python data-extraction.py
# Example inputs:
# Repository local path or Git URL: https://github.com/numpy/numpy
# Start date (YYYY-MM-DD): 2018-01-01
# End date (YYYY-MM-DD): 2024-12-31Output: CSV files saved under /data using the naming pattern:
{repo_name}_{YYYY-MM-DD}_to_{YYYY-MM-DD}_activity.csv
The analysis notebook analyzing_numpy_github_repository.ipynb serves as a case study based on the NumPy repository. It can be reused as a reference template for analyzing other repositories.
/data: CSV files containing commits and change metrics./plots: Charts and figures visualizing commits and churn over time for the NumPy repository.
data-extraction.py: Extracts commit data using PyDriller and outputs CSVs.analyzing_numpy_github_repository.ipynb: Analysis notebook used as a case study on the NumPy repository. It serves as a reusable reference for analyzing commit activity and code churn in other repositories.data/: Stores extracted CSVs (auto-created if missing).plots/: Stores visualizations from analysis.requirements.txt: Python dependencies.README.md: Project documentation.
- Analyze churn at file-type or directory-level to identify hotspots.
- Create a web-based interface where the user can enter a repository and select a date range to generate a web-based report with interactive visuals and explanations powered by LLMs.
- PyDriller – for mining Git repositories.
- pandas – for data processing and aggregation.
- plotly – for interactive visualizations.
- openai – for commits analysis.