A single-command code quality tool for Python projects. Runs 5 analysis tools, aggregates their output into a composite scorecard, detects regressions between runs, and emits prioritized remediation instructions.
$ python scorecard.py
myproject scorecard score: 91/100 (A)
8 files 3,219 SLOC 0 errors 0 type errors
0 dead items 3 dup blocks (27 lines) 1 smell
File MI CC-max CC-avg Halstead LOC Issues
db.py A 33 B 8 A 2.1 412 245 —
llm.py A 50 A 5 A 1.8 318 270 —
pipeline.py B 18 B 10 A 4.7 1439 782 2 CC boundary, 1 smell
batch.py A 32 B 8 A 3.2 890 340 3 dup blocks
Regressions since last run: none
Remediation (3 items):
1. [DUPLICATE] batch.py + pipeline.py: 3 duplicate blocks (27 lines)
Action: Extract shared logic into pipeline helpers
2. [SMELL] pipeline.py:run_hierarchy_refine too-many-locals (25/15)
Action: Extract variable clusters into NamedTuples
3. [DEAD] pipeline.py:368 unused parameter 'batch_mode'
Action: Remove parameter from signature
Code quality tools exist (ruff, mypy, radon, pylint, vulture) but nothing ties them together. Running 5 commands, reading 5 outputs, and deciding what to fix first is tedious. This tool does it in one step and tells you exactly what to do.
It's also designed to be used by LLMs. The --json output and structured remediation items make it easy for Claude or other agents to self-assess code quality and act on the results. See program.md for LLM-specific instructions.
Requires Python 3.11+ and uv.
git clone <repo-url> tools/tool-py-scorecard
cd tools/tool-py-scorecard
uv venv && uv pip install -e .The tool installs its own dependencies (ruff, mypy, radon, pylint, vulture) into .venv/ so it doesn't pollute your project.
Run from your project root:
# Full scorecard
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py
# JSON output (for CI or programmatic use)
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --json
# Only show regressions since last run
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --diff
# Only show remediation items
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --remediate
# Scope to specific files
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --files src/db.py src/pipeline.py
# Run a single dimension
tools/tool-py-scorecard/.venv/bin/python tools/tool-py-scorecard/scorecard.py --dimension D7By default, the scorecard scans all *.py files in the current directory (recursively), excluding test_*.py, .venv/, dev/, tools/, and scorecard.py itself.
Nine dimensions across three tiers, weighted by impact:
Tier 1 — Correctness (20%)
| Dim | Name | Tool | Weight |
|---|---|---|---|
| D1 | Lint errors | ruff | 10% |
| D2 | Type errors | mypy | 10% |
Tier 2 — Structure (30%)
| Dim | Name | Tool | Weight |
|---|---|---|---|
| D3 | Maintainability Index | radon mi | 10% |
| D6 | File size / raw metrics | radon raw | 5% |
| D7 | Duplicate code | pylint | 15% |
Tier 3 — Cognitive load (50%)
| Dim | Name | Tool | Weight |
|---|---|---|---|
| D4 | Cyclomatic complexity | radon cc | 20% |
| D5 | Halstead metrics | radon hal | 15% |
| D8 | Code smells | pylint | 10% |
| D9 | Dead code | vulture | 5% |
Complexity (D4) has the highest weight because it's the strongest predictor of bug density. Duplicates (D7) are weighted heavily because they cause divergent bugs — a fix applied to one copy but not the other.
Each dimension produces a 0–100 score. The composite is a weighted average:
| Grade | Score | Meaning |
|---|---|---|
| A | 90–100 | Ship it |
| B | 75–89 | Acceptable, minor issues |
| C | 50–74 | Needs attention before new features |
| F | < 50 | Stop and remediate |
The exit code is 0 for scores >= 75 and 1 otherwise, so you can use it in CI gates.
Each run saves a snapshot to results/.scorecard_history.json. The next run compares against it and flags:
- Composite score dropped >= 5 points
- Any dimension dropped >= 10 points
- New file exceeding 500 SLOC
- New duplicate blocks, new smells, or increased dead code
The remediation section is the actionable output. Items are sorted by impact:
- DUPLICATE — Extract shared code into helpers (highest bug risk)
- SMELL — Refactor functions exceeding complexity thresholds
- COMPLEXITY — Split or simplify high-CC functions
- DEAD — Remove unused code
- LINT — Fix style violations
- TYPE — Fix type errors
Each item includes the file, line number, what's wrong, what to do, and the expected score impact.
If a tool is missing, the scorecard skips that dimension and redistributes its weight across the remaining dimensions. It will never crash due to a missing dependency — it just warns and continues.
With Claude / LLMs: Copy this repo into tools/tool-py-scorecard/ in your project. This follows a lightweight convention where LLM tools live under tools/, each with their own program.md (the interface doc the LLM reads), their own .venv, and --json support. The LLM bootstraps the venv on first use and discovers tools by looking for tools/*/program.md. See program.md for the full instructions.
As a pre-commit hook: Run scorecard.py --json and fail on composite < threshold or D1/D2 errors.
In CI: Use --json output and check the exit code.
scorecard.py # The tool (single file, all 9 dimensions)
pyproject.toml # Dependencies and build config
program.md # LLM-facing usage instructions
SPEC.md # Detailed technical specification
README.md # This file
See SPEC.md for the complete technical spec including scoring formulas, thresholds, tool invocation details, and output format definitions.