Token Efficiency Ratio (TER) calculator for Claude Code sessions. Measures how efficiently an AI coding agent uses its token budget by classifying output token spans as aligned (contributing to the task) or waste (redundant reasoning, unnecessary tool calls, over-explanation), and surfaces the economics of each session -- cost, cache efficiency, context growth, and where waste concentrates. Supports grouped analysis of parent + subagent sessions and input-side analysis including prompt redundancy, intent drift, and prompt-response alignment.
Every Claude Code session consumes tokens across two axes:
- Output tokens -- what the model generates (thinking, tool calls, responses). This is where waste happens: reasoning loops, duplicate tool calls, verbose explanations.
- Input tokens -- context sent to the model each turn (prompt, history, tool results, cached context). This is where cost accumulates and context bloat appears.
TER bridges both. The core score measures output quality (0-1, where 1 = every token was aligned with user intent). The economics layer reveals input cost structure, cache efficiency, and whether your context is growing faster than it should.
| Signal | What it means | What to do |
|---|---|---|
| Low TER (< 0.7) | Model is generating significant waste | Check which phase (reasoning/tool/generation) is dragging the score down |
| High waste % in tool use | Duplicate or unnecessary tool calls | Improve prompt specificity, reduce ambiguity |
| Low generation TER | Over-explanation or context restatement | The model is being verbose or repeating itself |
| Low cache hit rate | Prompt caching not effective | Restructure prompts for cacheability |
| Context bloat detected | Input tokens growing super-linearly | Break long sessions into smaller tasks |
| Late-session waste | Positional TER drops toward end | Session may be too long, model is losing focus |
| Early-session waste | Waste concentrated at start | Prompt was unclear, model needed iterations to understand intent |
| High prompt redundancy | User is repeating similar asks | Consolidate requests, be more specific upfront |
| Convergent drift | User keeps refining the same request | Model may not be addressing the core ask |
| Low prompt-response alignment | Responses don't match what was asked | Rephrase prompts or check if model is going off-track |
| Bash anti-patterns | Model using Bash where dedicated tools exist | Configure hooks or instructions to prefer Read/Grep/Glob |
| Failed tool retries | Tool calls failing and being retried | Check for incorrect paths, permissions, or assumptions |
| Edit fragmentation | Many sequential edits to the same file | Model should batch changes into fewer operations |
The Python package lives in the TER/ subdirectory of this repository (where pyproject.toml is). From the repo root:
cd TER
pip install -e .For development:
pip install -e ".[dev]"ter analyze path/to/session.jsonlWhen a session spawns subagents, use --group to analyze the entire run together:
ter analyze path/to/session.jsonl --groupThis discovers subagent sessions automatically from the filesystem layout (SESSION_ID/subagents/*.jsonl), analyzes each one, and reports token-weighted aggregate TER, total cost, and per-session breakdown.
ter analyze path/to/session.jsonl --format jsonter compare session1.jsonl session2.jsonl --sort terWhen you have exactly two session files (e.g. before and after a rules change), a Markdown delta table:
ter compare before.jsonl after.jsonl --baselineUses the same default thresholds as a plain ter analyze (see docs/TER_GOAL_AND_CHANGES.md for extending this).
ter list
ter list ~/.claude/projects/Sessions with subagents show the count (e.g. SESSION_ID (128.5 KB, 6 subagents)). Subagent files are hidden from the listing.
ter report path/to/session.jsonl
ter report path/to/session.jsonl -o report.mdPrints a Markdown one-pager to stdout, or writes it to a file with -o / --output. Content includes TER, waste %, cost, output calibration ratio, cache, positional TER, top structural patterns, and suggested next steps. Same analysis pipeline and flags as analyze (except --format / --group).
ter analyze <path>
--format text|json Output format (default: text)
--similarity-threshold Cosine similarity threshold (default: 0.40)
--confidence-threshold Classifier confidence threshold (default: 0.75)
--restatement-threshold Context restatement threshold (default: 0.85)
--phase-weights r,t,g Phase weights (default: 0.3,0.4,0.3)
--no-waste-patterns Skip waste pattern detection
--cost-model MODEL Pricing: 'sonnet' (default) or 'input,output,cache_read,cache_write'
--group Include subagent sessions in grouped analysis
--no-input-analysis Disable input analysis (token breakdown, drift, alignment)
--prompt-similarity-threshold Cosine similarity for flagging redundant prompts (default: 0.75)
ter compare <paths_or_dirs...>
--format text|json
--sort ter|tokens|waste
--baseline Exactly two .jsonl files: before/after Markdown delta
Accepts directories (expands to all *.jsonl files inside)
ter list [path]
--format text|json
--limit N
ter report <path>
-o, --output FILE Write Markdown to FILE instead of stdout
(same threshold/cost flags as analyze)
Sample sessions are included in sample_sessions/. Run TER against them to see what the output looks like:
# Analyze a single session
ter analyze sample_sessions/b1a1450c-b006-40fe-8f9c-f15622a94324.jsonl
# Grouped analysis (parent + subagents)
ter analyze sample_sessions/b1a1450c-b006-40fe-8f9c-f15622a94324.jsonl --group
# Compare all sessions in a directory
ter compare sample_sessions/ --sort ter
# JSON output for programmatic use
ter analyze sample_sessions/b1a1450c-b006-40fe-8f9c-f15622a94324.jsonl --format jsonTER Report: b1a1450c...
════════════════════════════════════════
TER: 0.97 | Waste: 7.5% | Cost: $2.45 | Waste $: $0.06
Drift: stable | Alignment: 0.62 | Redundancy: 0% | User: 12%
Phases: Reasoning Tool Use Generation
1.00 0.92 1.00
Output Tokens: 52,497 (aligned: 48,553 waste: 3,944)
Input: 7,700 Cache Read: 3,239,712 Cache Hit: 99.8%
Context Growth: 5.7x over 49 turns [BLOAT]
Positional TER: 1.00 (early) / 0.66 (mid) / 0.89 (late)
Waste Breakdown:
Source Tokens % Cost Count
Duplicate Tool Calls 300 50% $0.0045 2
Redundant Reasoning 200 33% $0.0030 3
Over-Explanation 100 17% $0.0015 1
Total 600 100% $0.0090
Group Analysis: b1a1450c...
══════════════════════════════════════════════════════
TER: 0.94 | Waste: 8.2% | Cost: $15.30 | Waste $: $0.42
Sessions: 1 parent + 6 subagent(s) | Tokens: 312,450
Role Session TER Waste% Tokens Cost Waste $ Patterns
parent b1a1450c... 0.97 7.5 52,497 $2.45 $0.06 1
agent agent-001 0.92 10.1 48,210 $2.10 $0.08 2
agent agent-002 0.95 6.3 44,800 $1.95 $0.05 0
...
Total 0.94 8.2 312,450 $15.30 $0.42
TER Comparison
════════════════════════════════════════
# Session TER Waste% Cache% Cost Waste $ Patterns
1 64948793... 0.99 2.7 100% $10.20 $0.11 0
2 b1a1450c... 0.97 7.5 100% $2.45 $0.06 1
3 a3b73c37... 0.94 9.5 100% $8.25 $0.42 3
4 ff410fa9... 0.88 5.9 100% $10.47 $0.14 2
5 3331fd66... 0.83 15.2 100% $7.46 $0.31 5
Average TER: 0.92 | Total Cost: $38.84 | Total Waste: $1.04
TER is computed per-phase and combined with configurable weights:
| Phase | What it covers | Default Weight |
|---|---|---|
| Reasoning | Thinking blocks, planning | 0.3 |
| Tool Use | Tool calls and results | 0.4 |
| Generation | Text responses to the user | 0.3 |
A span is aligned by default. It is only classified as waste when a specific signal fires:
- Self-repetition -- duplicates a recent same-phase span (cosine similarity >= 0.88)
- Filler reasoning -- very low relevance (< 0.10) and fewer than 15 words
- Verbose generation -- extremely low relevance (< 0.08) and more than 50 words
Structural and behavioural patterns detected across the session:
- Reasoning loops -- 3+ consecutive redundant reasoning spans
- Duplicate tool calls -- identical tool invocations within a 5-step window
- Context restatement -- response text repeating prior responses (similarity > 0.85)
- Repetitive reads -- same file read 3+ times (first read necessary, rest redundant)
- Edit fragmentation -- 3+ consecutive edits to the same file (could be batched)
- Bash anti-patterns -- Bash commands that should use dedicated tools (
cat→ Read,grep/rg→ Grep,find→ Glob,head/tail→ Read) - Failed tool retries -- tool calls that error and are retried (wasted tokens on the failed attempt + error result)
- Repeated commands -- same Bash command run 3+ times (normalized to ignore trailing
| tail -N/| head -N)
Surfaces the actual API token usage data from each session:
- Input/Output tokens -- real token counts from API usage (not heuristic estimates)
- Cache hit rate --
cache_read / (cache_read + input_tokens)-- measures prompt caching effectiveness - Estimated cost -- USD estimate using configurable per-MTok rates (Sonnet defaults: $3 input, $15 output, $0.30 cache read, $3.75 cache write)
Splits classified spans into thirds (early/mid/late) and computes TER per segment. Reveals whether waste concentrates early (unclear prompts) or late (session fatigue / context overload).
Analyzes the user side of the conversation:
- Token breakdown -- classifies all tokens by origin (user prompt text, tool results, model reasoning, tool calls, generation) and computes the user/model ratio
- Prompt redundancy -- pairwise cosine similarity between user prompts; flags near-duplicate asks above a threshold (default 0.75) and reports a redundancy score
- Intent drift -- tracks how user intent evolves between consecutive prompts: convergent (refining same ask), divergent (new topic), or evolving (gradual shift). Reports overall trajectory (stable/convergent/divergent/mixed)
- Prompt-response alignment -- measures how well each model response matches its triggering prompt. Low alignment indicates the model went off-track
Tracks total context size (input + cache read tokens) per turn. Detects:
- Growth rate -- ratio of final context size to first meaningful turn
- Super-linear growth -- context accelerating faster than linear (via second differences)
- Context bloat -- flagged when growth is both super-linear and > 2x
When --group is used, TER discovers subagent sessions from the filesystem and analyzes the full run:
- Token-weighted TER -- aggregate TER weighted by each session's token count, so large sessions dominate appropriately
- Total cost/waste -- summed across parent + all subagents
- Per-session breakdown -- each session shown with role (parent/agent), individual TER, waste%, cost, and pattern counts
- Load -- parses Claude Code JSONL session files, deduplicates streaming entries by
requestId - Segment -- splits content blocks into token spans, assigns phases by block type
- Intent extraction -- embeds user prompts using
all-MiniLM-L6-v2(384-dim) to create an intent vector - Classification -- embeds each span, checks for self-repetition, applies phase-specific heuristics
- TER computation -- calculates aligned/total ratio per phase, combines with weights
- Waste detection -- scans for structural patterns (loops, duplicates, restatement, bash anti-patterns, failed retries, repeated commands, repetitive reads, edit fragmentation)
- Economics -- aggregates real API token usage, computes cache efficiency, cost, positional TER, and context growth
- Input analysis -- token breakdown by origin, prompt redundancy, intent drift, prompt-response alignment
- Grouping -- discovers and analyzes subagent sessions, computes token-weighted aggregates
src/ter_calculator/
cli.py CLI (analyze, report, compare, list; --group)
analyze_pipeline.py Shared full-session analysis (analyze + report)
config_parse.py Cost model & phase-weight parsing
session_report.py Markdown report + baseline delta formatting
loader.py JSONL parsing, deduplication, span segmentation, subagent discovery
intent.py Intent extraction and embedding
classifier.py Span classification (aligned vs waste)
compute.py TER score computation
waste.py Waste pattern detection (8 detectors) and summarization
economics.py Session economics, cost, positional analysis, growth
input_analysis.py Input-side analysis (token breakdown, redundancy, drift, alignment)
formatter.py Rich/text/JSON output formatting (single, comparison, grouped)
models.py Data models and enums
cd src
# Run tests
pytest
# Lint
ruff check .TER is a heuristic tool, not a tokenizer clone of Anthropic’s API:
- Spans use
len(text) // 4for rough token counts; they will not match billed tokens line-for-line. - Waste classification uses embeddings and thresholds; it is not ground-truth labeling.
- Waste $ uses assistant-origin waste and calibrates to API
output_tokenswhen usage data exists — seewaste_output_calibration_ratioin JSON. Ratios far from 1.0 mean the heuristic mass diverges from billing; interpret dollars cautiously. - Input-priced vs output-priced rows in the waste breakdown reflect whether waste behaved like context re-injection vs generated text (see UPDATES.md).
For what changed and why (measurement pass, decoupled tools, ter report), read docs/TER_GOAL_AND_CHANGES.md.
Economics calibration, waste $ pricing, classifier flags, and maintainer notes: UPDATES.md.
Evolution toward the project end goal and maintainer proposal: docs/TER_GOAL_AND_CHANGES.md.
- Python 3.11+
- sentence-transformers (embeddings)
- numpy (similarity computation)
- rich (terminal formatting)