LLM Eval Tracker

Live index of LLM evaluation tools and benchmarks, refreshed every 15 minutes from GitHub

⭐ Star this repo to bookmark — fresh data every 15 minutes

English · 中文 · 日本語 · 한국어 · Español · Português

💡 What is this?

Automatically discovers and indexes new LLM evaluation frameworks, benchmarks, and harnesses as they appear on GitHub. Generates a structured, searchable catalog with metadata like stars, activity, and category tags. Designed for ML engineers who need to stay current without manually scanning repositories.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.

📋 Current Items

⏰ Last updated: 2026-06-25 19:30 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

#	Name	⭐	Lang	Updated	Description
1	G59-Toneli/dataset-eval-skill	0	JavaScript	2026-06-25	A Claude skill for building golden sets to test AI systems — matching, RAG, LLM-as-judge — without false greens.
2	valbaudo/awf	1	Go	2026-06-25	Run agents you don't babysit, and trust the result. awf runs agentic workflows with an independent gate that checks ever
3	saddled-panicattack529/idea-evaluation-pipeline	0	—	2026-06-25	Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist
4	promptfoo/promptfoo	22600	TypeScript	2026-06-25	Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C
5	Kondwani10/Origin-Continuum	0	—	2026-06-25	🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation
6	Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer-	0	—	2026-06-25	♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou
7	bhavya7995/AI_governance	1	PowerShell	2026-06-25	🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a
8	multivon-ai/multivon-eval	8	Python	2026-06-25	Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI
9	NoesisVision/nasde-toolkit	10	Python	2026-06-25	CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind
10	thewonderofyou777z-dot/tjoe-reviewkit	0	Python	2026-06-25	TjoeReviewKit：tjoe 的本地离线工作流复盘检查工具；不运行任务、不联网、不接管工具调用、不采集生产日志
11	Giskard-AI/giskard-oss	5465	Python	2026-06-25	🐢 Open-Source Evaluation & Testing library for LLM Agents
12	Arize-ai/phoenix	10278	Python	2026-06-25	AI Observability & Evaluation
13	verifywise-ai/verifywise	312	TypeScript	2026-06-25	Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo
14	homemade-software-inc/completion-kit	1	Ruby	2026-06-25	Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c
15	jeremylongshore/j-rig-skill-binary-eval	0	TypeScript	2026-06-25	Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e
16	RaphaelFakhri/reagent	0	Python	2026-06-24	Tool-using ReAct + RAG agent (enterprise assistant) with a built-in evaluation harness scoring accuracy, tool selection,
17	tkarim45/agent-eval-harness	0	Python	2026-06-24	Agent eval harness — measure task success, tool-call accuracy, step efficiency, and cost for tool-using LLM agents (Clau
18	melody-ling-L/eval-resume	0	HTML	2026-06-24	中文 LLM 简历改写诚实度 benchmark：20 脱敏简历 × 3 模型 × 4 维度 · promptfoo + LLM-as-judge · 含在线报告
19	TheAnacondA57/BidAgent	0	Python	2026-06-23	RAG agentique sur des documents de concession télécom publique (DSP/RIP), pensé eval-first et contrôlé en CI.
20	IonDen/mlx-quant-fidelity	1	Python	2026-06-23	Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
21	ahwurm/localshift	3	Python	2026-06-22	Migrate headless Claude/AI workloads to local LLMs with a derived, per-workload quality eval — cron job in, zero-margina
22	gashel01/evalmcp	0	Python	2026-06-22	Evaluation for AI agents — judge-based scoring and native RAG metrics (faithfulness, relevancy, context precision/recall
23	anejakartik/evalstack	0	TypeScript	2026-06-22	Open-source LLM evaluation framework — drop-in SDK + CI plugin. LLM-as-judge, regression detection, free + self-hostable
24	truera/trulens	3399	Python	2026-06-21	Evaluation and Tracking for LLM Experiments and AI Agents
25	jmpei/nl2sql-agents	0	Python	2026-06-21	NL→SQL multi-agent pipeline (LangGraph + Claude) with deterministic SQL-injection guardrails and golden-set eval.
26	lokesh75-kank/agenteval	0	TypeScript	2026-06-21	Reliability and audit-evidence testing for LLM agents - wrap any agent, assert behavior, measure determinism, check grou
27	TeracAI/svg-arena	0	TypeScript	2026-06-20	A forkable example of the human-in-the-loop model-improvement loop: AI generates, humans judge via the Terac MCP, you im
28	ozlar34/job-match-radar	0	Python	2026-06-20	Self-hosted n8n + Supabase pipeline that scrapes LinkedIn and a watchlist of company ATS endpoints, scores listings agai
29	kilocommits/campaign-eval-harness	0	Python	2026-06-20	An LLM-as-judge harness that scores AI-generated campaign phone scripts against a weighted quality rubric with a real Ha
30	Ayubjon/refusal-radar	0	JavaScript	2026-06-20	Zero-dependency detector and classifier for LLM refusals, deflections, and capability disclaimers — CLI + library with s
31	melody-ling-L/judgebuddy	0	HTML	2026-06-20	Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment.
32	ramenprotokol/hallucination-hunter	0	Python	2026-06-20	Detect & score LLM hallucinations by groundedness — labeled data, precision/recall/F1, runs offline with no API key. Plu
33	pdxlab/trustmodel-mcp-server	0	TypeScript	2026-06-19	TrustModel MCP Server — trust evaluation, red-team, and governance for AI agents via the Model Context Protocol. npm: @t
34	gititya/Quality-Agency-support	0	Python	2026-06-17	Five local QA judges that review B2B and B2C customer-support replies, catch the risky parts, and explain what to fix.
35	tushariitr-19/assay	2	Go	2026-06-17	Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.
36	jedobe/skill-evaluator	0	Python	2026-06-17	Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub
37	ALEX-nlp/OpenSkillEval	12	Python	2026-06-15	OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
38	mpuodziukas-labs/eval-harness-template	0	Python	2026-06-14	Eval harness template for LLM systems: golden regression, LLM-as-judge, invariants
39	mizcausevic-dev/agent-eval-arena	0	TypeScript	2026-06-22	Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions,
40	ejentum/eval	3	Python	2026-06-11	A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module.
41	akanjilal-work/agent-eval-harness	0	Python	2026-06-10	A lightweight harness to test agent behaviour (tool-call correctness, injection refusal, cost ceilings) before deploymen
42	karlmehta/trustmodel-mcp	0	TypeScript	2026-06-10	TrustModel MCP Server — trust evaluation, red-team & governance for AI agents via the Model Context Protocol. Public can
43	reaatech/agent-eval-harness	0	TypeScript	2026-06-22	End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w
44	alyssadata/continuity-keys	1	—	2026-06-08	Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen)
45	reaatech/classifier-evals	0	TypeScript	2026-06-24	Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio
46	reaatech/rag-eval-pack	0	TypeScript	2026-06-22	RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with
47	Juanllenato/llm-eval-harness	0	Python	2026-06-03	A small, production-minded evaluation and observability harness for LLM/RAG features. Runs offline or live, gates CI on
48	Victor-David-Medina/llm-eval-harness	0	Python	2026-06-03	LLM evaluation harness that gates quality in CI: golden datasets, regression detection, grounding and faithfulness check
49	harnexa/nexa-gauge	38	Python	2026-06-22	An graph-eval framework for LLM's
50	thestio/thest-eval	0	Python	2026-06-02	The CI regression gate and governance-evidence layer for LLM systems — zero-dependency, vendor-neutral, offline.

🔍 How it works

Every 15 minutes, a GitHub Action runs tracker.py. That script:

Fetches the latest state from GitHub Search API.
Diffs against data/items.json (the previous snapshot).
Rewrites the table above between the  markers.
Commits feat: +N added, -M removed (timestamp) if anything changed.

No external services. No paid APIs. Just a public data source and a free GitHub Action.

🤝 Contributing

See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current. If you spot a data-source bug or want to suggest a new column for the table, open an issue.

🔗 Related live trackers

If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:

trending-claude-skills — What's shipping in Claude Skills this week (topic:claude-skills)
mcp-servers-live — Live index of newest MCP servers (topic:mcp-server)
cursor-rules-live — Newest Cursor rules and .cursorrules patterns (topic:cursor-rules)
claude-code-plugin-tracker — Claude Code plugins and hook configs (topic:claude-code)
llm-agents-radar — Newest LLM agent frameworks (topic:llm-agent)
rag-radar — Newest RAG implementations and tools (topic:rag)
agent-framework-radar — Newest agent frameworks shipping on GitHub (topic:agent-framework)
vector-db-live — Newest vector DB projects and integrations (topic:vector-database)
llmops-radar — Newest LLMOps tooling (observability, deployment) (topic:llmops)
prompt-tools-live — Newest prompt-engineering tools and prompt repos (topic:prompt-engineering)
agent-eval-harness — Live benchmark of AI coding agents (topic:llm-eval)
skills-tracker — Tracking new GitHub 'skills' repos (topic:agent-skills)
awesome-agent-skills — Curated auto-updated awesome-list of AI agent skills (topic:agent-skills)

📜 License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3,385 Commits
.github/workflows		.github/workflows
data		data
README.md		README.md
README_CN.md		README_CN.md
README_ES.md		README_ES.md
README_JA.md		README_JA.md
README_KO.md		README_KO.md
README_PT.md		README_PT.md
requirements.txt		requirements.txt
tracker.py		tracker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Tracker

💡 What is this?

📋 Current Items

🔍 How it works

🤝 Contributing

🔗 Related live trackers

📜 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Tracker

💡 What is this?

📋 Current Items

🔍 How it works

🤝 Contributing

🔗 Related live trackers

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages