Skip to content

adimidania/github-activity-analyzer

Repository files navigation

GitHub Activity Analyzer: Commit & Code Churn Insights with PyDriller

Analyze GitHub repositories to track commits, code churn, trends, and spikes over time.


Overview

This project shows how PyDriller can be used to extract commit information from a GitHub repository and analyze it to better understand development activity.

The main research question guiding this work is: How does development activity evolve over time? Specifically, we aim to analyze how monthly commit counts and code churn (lines added or removed) fluctuate across a repository’s lifetime, highlighting long-term trends and notable spikes.

The approach is demonstrated on the NumPy repository, focusing on the last six months of 2025, to observe how commit activity and code changes evolve over time.


Getting Started

1. Install Python and dependencies

Ensure Python 3.10+ is installed. Create and activate a virtual environment, then install dependencies:

# From the workspace root
python -m venv venv
& .\venv\Scripts\Activate.ps1   # Windows PowerShell
pip install -r requirements.txt

2. Run Data Extraction

Launch the interactive extraction script and provide repository path or URL and date range:

python data-extraction.py
# Example inputs:
# Repository local path or Git URL: https://github.com/numpy/numpy
# Start date (YYYY-MM-DD): 2018-01-01
# End date   (YYYY-MM-DD): 2024-12-31

Output: CSV files saved under /data using the naming pattern:

{repo_name}_{YYYY-MM-DD}_to_{YYYY-MM-DD}_activity.csv

3. Analyze Data

The analysis notebook analyzing_numpy_github_repository.ipynb serves as a case study based on the NumPy repository. It can be reused as a reference template for analyzing other repositories.


Outputs

  • /data: CSV files containing commits and change metrics.
  • /plots: Charts and figures visualizing commits and churn over time for the NumPy repository.

Project Structure

  • data-extraction.py: Extracts commit data using PyDriller and outputs CSVs.
  • analyzing_numpy_github_repository.ipynb: Analysis notebook used as a case study on the NumPy repository. It serves as a reusable reference for analyzing commit activity and code churn in other repositories.
  • data/: Stores extracted CSVs (auto-created if missing).
  • plots/: Stores visualizations from analysis.
  • requirements.txt: Python dependencies.
  • README.md: Project documentation.

Future Work

  • Analyze churn at file-type or directory-level to identify hotspots.
  • Create a web-based interface where the user can enter a repository and select a date range to generate a web-based report with interactive visuals and explanations powered by LLMs.

Tools Used

  • PyDriller – for mining Git repositories.
  • pandas – for data processing and aggregation.
  • plotly – for interactive visualizations.
  • openai – for commits analysis.

About

Analyze GitHub repositories to track commits, code churn, trends, and spikes over time.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors