Skip to content

josephkern/tool-py-scorecard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scorecard

A single-command code quality tool for Python projects. Runs 5 analysis tools, aggregates their output into a composite scorecard, detects regressions between runs, and emits prioritized remediation instructions.

$ python scorecard.py

 myproject scorecard                    score: 91/100 (A)
 8 files  3,219 SLOC  0 errors  0 type errors
 0 dead items  3 dup blocks (27 lines)  1 smell

 File                 MI    CC-max  CC-avg  Halstead   LOC  Issues
 db.py              A 33     B 8    A 2.1       412   245  —
 llm.py             A 50     A 5    A 1.8       318   270  —
 pipeline.py        B 18     B 10   A 4.7      1439   782  2 CC boundary, 1 smell
 batch.py           A 32     B 8    A 3.2       890   340  3 dup blocks

 Regressions since last run: none

 Remediation (3 items):
   1. [DUPLICATE] batch.py + pipeline.py: 3 duplicate blocks (27 lines)
      Action: Extract shared logic into pipeline helpers
   2. [SMELL] pipeline.py:run_hierarchy_refine  too-many-locals (25/15)
      Action: Extract variable clusters into NamedTuples
   3. [DEAD] pipeline.py:368 unused parameter 'batch_mode'
      Action: Remove parameter from signature

Why

Code quality tools exist (ruff, mypy, radon, pylint, vulture) but nothing ties them together. Running 5 commands, reading 5 outputs, and deciding what to fix first is tedious. This tool does it in one step and tells you exactly what to do.

It's also designed to be used by LLMs. The --json output and structured remediation items make it easy for Claude or other agents to self-assess code quality and act on the results. See program.md for LLM-specific instructions.

Install

Requires Python 3.11+ and uv.

git clone <repo-url> tools/tool-py-scorecard
cd tools/tool-py-scorecard
uv venv && uv pip install -e .

The tool installs its own dependencies (ruff, mypy, radon, pylint, vulture) into .venv/ so it doesn't pollute your project.

Usage

Run from your project root:

# Full scorecard
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py

# JSON output (for CI or programmatic use)
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --json

# Only show regressions since last run
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --diff

# Only show remediation items
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --remediate

# Scope to specific files
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --files src/db.py src/pipeline.py

# Run a single dimension
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --dimension D7

By default, the scorecard scans all *.py files in the current directory (recursively), excluding test_*.py, .venv/, dev/, tools/, and scorecard.py itself.

What it measures

Nine dimensions across three tiers, weighted by impact:

Tier 1 — Correctness (20%)

Dim Name Tool Weight
D1 Lint errors ruff 10%
D2 Type errors mypy 10%

Tier 2 — Structure (30%)

Dim Name Tool Weight
D3 Maintainability Index radon mi 10%
D6 File size / raw metrics radon raw 5%
D7 Duplicate code pylint 15%

Tier 3 — Cognitive load (50%)

Dim Name Tool Weight
D4 Cyclomatic complexity radon cc 20%
D5 Halstead metrics radon hal 15%
D8 Code smells pylint 10%
D9 Dead code vulture 5%

Complexity (D4) has the highest weight because it's the strongest predictor of bug density. Duplicates (D7) are weighted heavily because they cause divergent bugs — a fix applied to one copy but not the other.

Scoring

Each dimension produces a 0–100 score. The composite is a weighted average:

Grade Score Meaning
A 90–100 Ship it
B 75–89 Acceptable, minor issues
C 50–74 Needs attention before new features
F < 50 Stop and remediate

The exit code is 0 for scores >= 75 and 1 otherwise, so you can use it in CI gates.

Regression detection

Each run saves a snapshot to results/.scorecard_history.json. The next run compares against it and flags:

  • Composite score dropped >= 5 points
  • Any dimension dropped >= 10 points
  • New file exceeding 500 SLOC
  • New duplicate blocks, new smells, or increased dead code

Remediation

The remediation section is the actionable output. Items are sorted by impact:

  1. DUPLICATE — Extract shared code into helpers (highest bug risk)
  2. SMELL — Refactor functions exceeding complexity thresholds
  3. COMPLEXITY — Split or simplify high-CC functions
  4. DEAD — Remove unused code
  5. LINT — Fix style violations
  6. TYPE — Fix type errors

Each item includes the file, line number, what's wrong, what to do, and the expected score impact.

Graceful degradation

If a tool is missing, the scorecard skips that dimension and redistributes its weight across the remaining dimensions. It will never crash due to a missing dependency — it just warns and continues.

Integration

With Claude / LLMs: Copy this repo into tools/tool-py-scorecard/ in your project. This follows a lightweight convention where LLM tools live under tools/, each with their own program.md (the interface doc the LLM reads), their own .venv, and --json support. The LLM bootstraps the venv on first use and discovers tools by looking for tools/*/program.md. See program.md for the full instructions.

As a pre-commit hook: Run scorecard.py --json and fail on composite < threshold or D1/D2 errors.

In CI: Use --json output and check the exit code.

Project structure

scorecard.py          # The tool (single file, all 9 dimensions)
pyproject.toml        # Dependencies and build config
program.md            # LLM-facing usage instructions
SPEC.md               # Detailed technical specification
README.md             # This file

Full specification

See SPEC.md for the complete technical spec including scoring formulas, thresholds, tool invocation details, and output format definitions.

About

Single-command Python code quality scorecard — runs ruff, mypy, radon, pylint, and vulture, aggregates results into a weighted 0–100 score, detects regressions between runs, and outputs prioritized remediation instructions. Designed for both humans and LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages