GitHub Activity Analyzer: Commit & Code Churn Insights with PyDriller

Analyze GitHub repositories to track commits, code churn, trends, and spikes over time.

Overview

This project shows how PyDriller can be used to extract commit information from a GitHub repository and analyze it to better understand development activity.

The main research question guiding this work is: How does development activity evolve over time? Specifically, we aim to analyze how monthly commit counts and code churn (lines added or removed) fluctuate across a repository’s lifetime, highlighting long-term trends and notable spikes.

The approach is demonstrated on the NumPy repository, focusing on the last six months of 2025, to observe how commit activity and code changes evolve over time.

Getting Started

1. Install Python and dependencies

Ensure Python 3.10+ is installed. Create and activate a virtual environment, then install dependencies:

# From the workspace root
python -m venv venv
& .\venv\Scripts\Activate.ps1   # Windows PowerShell
pip install -r requirements.txt

2. Run Data Extraction

Launch the interactive extraction script and provide repository path or URL and date range:

python data-extraction.py
# Example inputs:
# Repository local path or Git URL: https://github.com/numpy/numpy
# Start date (YYYY-MM-DD): 2018-01-01
# End date   (YYYY-MM-DD): 2024-12-31

Output: CSV files saved under /data using the naming pattern:

{repo_name}_{YYYY-MM-DD}_to_{YYYY-MM-DD}_activity.csv

3. Analyze Data

The analysis notebook analyzing_numpy_github_repository.ipynb serves as a case study based on the NumPy repository. It can be reused as a reference template for analyzing other repositories.

Outputs

/data: CSV files containing commits and change metrics.
/plots: Charts and figures visualizing commits and churn over time for the NumPy repository.

Project Structure

data-extraction.py: Extracts commit data using PyDriller and outputs CSVs.
analyzing_numpy_github_repository.ipynb: Analysis notebook used as a case study on the NumPy repository. It serves as a reusable reference for analyzing commit activity and code churn in other repositories.
data/: Stores extracted CSVs (auto-created if missing).
plots/: Stores visualizations from analysis.
requirements.txt: Python dependencies.
README.md: Project documentation.

Future Work

Analyze churn at file-type or directory-level to identify hotspots.
Create a web-based interface where the user can enter a repository and select a date range to generate a web-based report with interactive visuals and explanations powered by LLMs.

Tools Used

PyDriller – for mining Git repositories.
pandas – for data processing and aggregation.
plotly – for interactive visualizations.
openai – for commits analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Activity Analyzer: Commit & Code Churn Insights with PyDriller

Overview

Getting Started

1. Install Python and dependencies

2. Run Data Extraction

3. Analyze Data

Outputs

Project Structure

Future Work

Tools Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
plots		plots
.gitignore		.gitignore
README.md		README.md
analyzing_numpy_github_repository.ipynb		analyzing_numpy_github_repository.ipynb
data-extraction.py		data-extraction.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GitHub Activity Analyzer: Commit & Code Churn Insights with PyDriller

Overview

Getting Started

1. Install Python and dependencies

2. Run Data Extraction

3. Analyze Data

Outputs

Project Structure

Future Work

Tools Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages