Live index of LLM evaluation tools and benchmarks, refreshed every 15 minutes from GitHub
⭐ Star this repo to bookmark — fresh data every 15 minutes
Automatically discovers and indexes new LLM evaluation frameworks, benchmarks, and harnesses as they appear on GitHub. Generates a structured, searchable catalog with metadata like stars, activity, and category tags. Designed for ML engineers who need to stay current without manually scanning repositories.
This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.
⏰ Last updated: 2026-06-25 19:30 UTC
Data source:
GitHub Search APIThe table below is rewritten on every cron tick. Star the repo to bookmark.
| # | Name | ⭐ | Lang | Updated | Description |
|---|---|---|---|---|---|
| 1 | G59-Toneli/dataset-eval-skill | 0 | JavaScript | 2026-06-25 | A Claude skill for building golden sets to test AI systems — matching, RAG, LLM-as-judge — without false greens. |
| 2 | valbaudo/awf | 1 | Go | 2026-06-25 | Run agents you don't babysit, and trust the result. awf runs agentic workflows with an independent gate that checks ever |
| 3 | saddled-panicattack529/idea-evaluation-pipeline | 0 | — | 2026-06-25 | Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist |
| 4 | promptfoo/promptfoo | 22600 | TypeScript | 2026-06-25 | Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C |
| 5 | Kondwani10/Origin-Continuum | 0 | — | 2026-06-25 | 🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation |
| 6 | Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer- | 0 | — | 2026-06-25 | ♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou |
| 7 | bhavya7995/AI_governance | 1 | PowerShell | 2026-06-25 | 🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a |
| 8 | multivon-ai/multivon-eval | 8 | Python | 2026-06-25 | Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI |
| 9 | NoesisVision/nasde-toolkit | 10 | Python | 2026-06-25 | CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind |
| 10 | thewonderofyou777z-dot/tjoe-reviewkit | 0 | Python | 2026-06-25 | TjoeReviewKit:tjoe 的本地离线工作流复盘检查工具;不运行任务、不联网、不接管工具调用、不采集生产日志 |
| 11 | Giskard-AI/giskard-oss | 5465 | Python | 2026-06-25 | 🐢 Open-Source Evaluation & Testing library for LLM Agents |
| 12 | Arize-ai/phoenix | 10278 | Python | 2026-06-25 | AI Observability & Evaluation |
| 13 | verifywise-ai/verifywise | 312 | TypeScript | 2026-06-25 | Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo |
| 14 | homemade-software-inc/completion-kit | 1 | Ruby | 2026-06-25 | Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c |
| 15 | jeremylongshore/j-rig-skill-binary-eval | 0 | TypeScript | 2026-06-25 | Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e |
| 16 | RaphaelFakhri/reagent | 0 | Python | 2026-06-24 | Tool-using ReAct + RAG agent (enterprise assistant) with a built-in evaluation harness scoring accuracy, tool selection, |
| 17 | tkarim45/agent-eval-harness | 0 | Python | 2026-06-24 | Agent eval harness — measure task success, tool-call accuracy, step efficiency, and cost for tool-using LLM agents (Clau |
| 18 | melody-ling-L/eval-resume | 0 | HTML | 2026-06-24 | 中文 LLM 简历改写诚实度 benchmark:20 脱敏简历 × 3 模型 × 4 维度 · promptfoo + LLM-as-judge · 含在线报告 |
| 19 | TheAnacondA57/BidAgent | 0 | Python | 2026-06-23 | RAG agentique sur des documents de concession télécom publique (DSP/RIP), pensé eval-first et contrôlé en CI. |
| 20 | IonDen/mlx-quant-fidelity | 1 | Python | 2026-06-23 | Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights |
| 21 | ahwurm/localshift | 3 | Python | 2026-06-22 | Migrate headless Claude/AI workloads to local LLMs with a derived, per-workload quality eval — cron job in, zero-margina |
| 22 | gashel01/evalmcp | 0 | Python | 2026-06-22 | Evaluation for AI agents — judge-based scoring and native RAG metrics (faithfulness, relevancy, context precision/recall |
| 23 | anejakartik/evalstack | 0 | TypeScript | 2026-06-22 | Open-source LLM evaluation framework — drop-in SDK + CI plugin. LLM-as-judge, regression detection, free + self-hostable |
| 24 | truera/trulens | 3399 | Python | 2026-06-21 | Evaluation and Tracking for LLM Experiments and AI Agents |
| 25 | jmpei/nl2sql-agents | 0 | Python | 2026-06-21 | NL→SQL multi-agent pipeline (LangGraph + Claude) with deterministic SQL-injection guardrails and golden-set eval. |
| 26 | lokesh75-kank/agenteval | 0 | TypeScript | 2026-06-21 | Reliability and audit-evidence testing for LLM agents - wrap any agent, assert behavior, measure determinism, check grou |
| 27 | TeracAI/svg-arena | 0 | TypeScript | 2026-06-20 | A forkable example of the human-in-the-loop model-improvement loop: AI generates, humans judge via the Terac MCP, you im |
| 28 | ozlar34/job-match-radar | 0 | Python | 2026-06-20 | Self-hosted n8n + Supabase pipeline that scrapes LinkedIn and a watchlist of company ATS endpoints, scores listings agai |
| 29 | kilocommits/campaign-eval-harness | 0 | Python | 2026-06-20 | An LLM-as-judge harness that scores AI-generated campaign phone scripts against a weighted quality rubric with a real Ha |
| 30 | Ayubjon/refusal-radar | 0 | JavaScript | 2026-06-20 | Zero-dependency detector and classifier for LLM refusals, deflections, and capability disclaimers — CLI + library with s |
| 31 | melody-ling-L/judgebuddy | 0 | HTML | 2026-06-20 | Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment. |
| 32 | ramenprotokol/hallucination-hunter | 0 | Python | 2026-06-20 | Detect & score LLM hallucinations by groundedness — labeled data, precision/recall/F1, runs offline with no API key. Plu |
| 33 | pdxlab/trustmodel-mcp-server | 0 | TypeScript | 2026-06-19 | TrustModel MCP Server — trust evaluation, red-team, and governance for AI agents via the Model Context Protocol. npm: @t |
| 34 | gititya/Quality-Agency-support | 0 | Python | 2026-06-17 | Five local QA judges that review B2B and B2C customer-support replies, catch the risky parts, and explain what to fix. |
| 35 | tushariitr-19/assay | 2 | Go | 2026-06-17 | Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks. |
| 36 | jedobe/skill-evaluator | 0 | Python | 2026-06-17 | Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub |
| 37 | ALEX-nlp/OpenSkillEval | 12 | Python | 2026-06-15 | OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents |
| 38 | mpuodziukas-labs/eval-harness-template | 0 | Python | 2026-06-14 | Eval harness template for LLM systems: golden regression, LLM-as-judge, invariants |
| 39 | mizcausevic-dev/agent-eval-arena | 0 | TypeScript | 2026-06-22 | Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, |
| 40 | ejentum/eval | 3 | Python | 2026-06-11 | A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module. |
| 41 | akanjilal-work/agent-eval-harness | 0 | Python | 2026-06-10 | A lightweight harness to test agent behaviour (tool-call correctness, injection refusal, cost ceilings) before deploymen |
| 42 | karlmehta/trustmodel-mcp | 0 | TypeScript | 2026-06-10 | TrustModel MCP Server — trust evaluation, red-team & governance for AI agents via the Model Context Protocol. Public can |
| 43 | reaatech/agent-eval-harness | 0 | TypeScript | 2026-06-22 | End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w |
| 44 | alyssadata/continuity-keys | 1 | — | 2026-06-08 | Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen) |
| 45 | reaatech/classifier-evals | 0 | TypeScript | 2026-06-24 | Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio |
| 46 | reaatech/rag-eval-pack | 0 | TypeScript | 2026-06-22 | RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with |
| 47 | Juanllenato/llm-eval-harness | 0 | Python | 2026-06-03 | A small, production-minded evaluation and observability harness for LLM/RAG features. Runs offline or live, gates CI on |
| 48 | Victor-David-Medina/llm-eval-harness | 0 | Python | 2026-06-03 | LLM evaluation harness that gates quality in CI: golden datasets, regression detection, grounding and faithfulness check |
| 49 | harnexa/nexa-gauge | 38 | Python | 2026-06-22 | An graph-eval framework for LLM's |
| 50 | thestio/thest-eval | 0 | Python | 2026-06-02 | The CI regression gate and governance-evidence layer for LLM systems — zero-dependency, vendor-neutral, offline. |
Every 15 minutes, a GitHub Action runs tracker.py. That script:
- Fetches the latest state from
GitHub Search API. - Diffs against
data/items.json(the previous snapshot). - Rewrites the table above between the
<!-- TRACKER_TABLE_* -->markers. - Commits
feat: +N added, -M removed (timestamp)if anything changed.
No external services. No paid APIs. Just a public data source and a free GitHub Action.
See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current.
If you spot a data-source bug or want to suggest a new column for the table, open
an issue.
If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:
- trending-claude-skills — What's shipping in Claude Skills this week (
topic:claude-skills) - mcp-servers-live — Live index of newest MCP servers (
topic:mcp-server) - cursor-rules-live — Newest Cursor rules and .cursorrules patterns (
topic:cursor-rules) - claude-code-plugin-tracker — Claude Code plugins and hook configs (
topic:claude-code) - llm-agents-radar — Newest LLM agent frameworks (
topic:llm-agent) - rag-radar — Newest RAG implementations and tools (
topic:rag) - agent-framework-radar — Newest agent frameworks shipping on GitHub (
topic:agent-framework) - vector-db-live — Newest vector DB projects and integrations (
topic:vector-database) - llmops-radar — Newest LLMOps tooling (observability, deployment) (
topic:llmops) - prompt-tools-live — Newest prompt-engineering tools and prompt repos (
topic:prompt-engineering) - agent-eval-harness — Live benchmark of AI coding agents (
topic:llm-eval) - skills-tracker — Tracking new GitHub 'skills' repos (
topic:agent-skills) - awesome-agent-skills — Curated auto-updated awesome-list of AI agent skills (
topic:agent-skills)
MIT — see LICENSE.