Live, open-source benchmark for comparing AI coding agents on real GitHub issues
⭐ Star this repo to bookmark — fresh data every 15 minutes
A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.
This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.
⏰ Last updated: 2026-06-24 22:38 UTC
Data source:
GitHub Search APIThe table below is rewritten on every cron tick. Star the repo to bookmark.
| # | Name | ⭐ | Lang | Updated | Description |
|---|---|---|---|---|---|
| 1 | promptfoo/promptfoo | 22557 | TypeScript | 2026-06-24 | Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C |
| 2 | saddled-panicattack529/idea-evaluation-pipeline | 0 | — | 2026-06-24 | Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist |
| 3 | Arize-ai/phoenix | 10264 | Python | 2026-06-24 | AI Observability & Evaluation |
| 4 | Kondwani10/Origin-Continuum | 0 | — | 2026-06-24 | 🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation |
| 5 | Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer- | 0 | — | 2026-06-24 | ♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou |
| 6 | bhavya7995/AI_governance | 1 | PowerShell | 2026-06-24 | 🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a |
| 7 | multivon-ai/multivon-eval | 8 | Python | 2026-06-24 | Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI |
| 8 | valbaudo/awf | 1 | Go | 2026-06-24 | Run agents you don't babysit, and trust the result. awf runs agentic workflows with an independent gate that checks ever |
| 9 | verifywise-ai/verifywise | 312 | TypeScript | 2026-06-24 | Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo |
| 10 | jeremylongshore/j-rig-skill-binary-eval | 0 | TypeScript | 2026-06-24 | Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e |
| 11 | Giskard-AI/giskard-oss | 5462 | Python | 2026-06-24 | 🐢 Open-Source Evaluation & Testing library for LLM Agents |
| 12 | RaphaelFakhri/reagent | 0 | Python | 2026-06-24 | Tool-using ReAct + RAG agent (enterprise assistant) with a built-in evaluation harness scoring accuracy, tool selection, |
| 13 | NoesisVision/nasde-toolkit | 10 | Python | 2026-06-24 | CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind |
| 14 | tkarim45/agent-eval-harness | 0 | Python | 2026-06-24 | Agent eval harness — measure task success, tool-call accuracy, step efficiency, and cost for tool-using LLM agents (Clau |
| 15 | melody-ling-L/eval-resume | 0 | HTML | 2026-06-24 | 中文 LLM 简历改写诚实度 benchmark:20 脱敏简历 × 3 模型 × 4 维度 · promptfoo + LLM-as-judge · 含在线报告 |
| 16 | TheAnacondA57/BidAgent | 0 | Python | 2026-06-23 | RAG agentique sur des documents de concession télécom publique (DSP/RIP), pensé eval-first et contrôlé en CI. |
| 17 | IonDen/mlx-quant-fidelity | 1 | Python | 2026-06-23 | Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights |
| 18 | ahwurm/localshift | 3 | Python | 2026-06-22 | Migrate headless Claude/AI workloads to local LLMs with a derived, per-workload quality eval — cron job in, zero-margina |
| 19 | gashel01/evalmcp | 0 | Python | 2026-06-22 | Evaluation for AI agents — judge-based scoring and native RAG metrics (faithfulness, relevancy, context precision/recall |
| 20 | anejakartik/evalstack | 0 | TypeScript | 2026-06-22 | Open-source LLM evaluation framework — drop-in SDK + CI plugin. LLM-as-judge, regression detection, free + self-hostable |
| 21 | truera/trulens | 3399 | Python | 2026-06-21 | Evaluation and Tracking for LLM Experiments and AI Agents |
| 22 | jmpei/nl2sql-agents | 0 | Python | 2026-06-21 | NL→SQL multi-agent pipeline (LangGraph + Claude) with deterministic SQL-injection guardrails and golden-set eval. |
| 23 | lokesh75-kank/agenteval | 0 | TypeScript | 2026-06-21 | Reliability and audit-evidence testing for LLM agents - wrap any agent, assert behavior, measure determinism, check grou |
| 24 | TeracAI/svg-arena | 0 | TypeScript | 2026-06-20 | A forkable example of the human-in-the-loop model-improvement loop: AI generates, humans judge via the Terac MCP, you im |
| 25 | ozlar34/job-match-radar | 0 | Python | 2026-06-20 | Self-hosted n8n + Supabase pipeline that scrapes LinkedIn and a watchlist of company ATS endpoints, scores listings agai |
| 26 | kilocommits/campaign-eval-harness | 0 | Python | 2026-06-20 | An LLM-as-judge harness that scores AI-generated campaign phone scripts against a weighted quality rubric with a real Ha |
| 27 | Ayubjon/refusal-radar | 0 | JavaScript | 2026-06-20 | Zero-dependency detector and classifier for LLM refusals, deflections, and capability disclaimers — CLI + library with s |
| 28 | melody-ling-L/judgebuddy | 0 | HTML | 2026-06-20 | Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment. |
| 29 | ramenprotokol/hallucination-hunter | 0 | Python | 2026-06-20 | Detect & score LLM hallucinations by groundedness — labeled data, precision/recall/F1, runs offline with no API key. Plu |
| 30 | pdxlab/trustmodel-mcp-server | 0 | TypeScript | 2026-06-19 | TrustModel MCP Server — trust evaluation, red-team, and governance for AI agents via the Model Context Protocol. npm: @t |
| 31 | thewonderofyou777z-dot/tjoe-reviewkit | 0 | Python | 2026-06-18 | TjoeReviewKit:tjoe 的本地离线工作流复盘检查工具;不运行任务、不联网、不接管工具调用、不采集生产日志 |
| 32 | gititya/Quality-Agency-support | 0 | Python | 2026-06-17 | Five local QA judges that review B2B and B2C customer-support replies, catch the risky parts, and explain what to fix. |
| 33 | tushariitr-19/assay | 2 | Go | 2026-06-17 | Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks. |
| 34 | jedobe/skill-evaluator | 0 | Python | 2026-06-17 | Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub |
| 35 | ALEX-nlp/OpenSkillEval | 11 | Python | 2026-06-15 | OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents |
| 36 | mpuodziukas-labs/eval-harness-template | 0 | Python | 2026-06-14 | Eval harness template for LLM systems: golden regression, LLM-as-judge, invariants |
| 37 | homemade-software-inc/completion-kit | 1 | Ruby | 2026-06-24 | Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c |
| 38 | mizcausevic-dev/agent-eval-arena | 0 | TypeScript | 2026-06-22 | Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, |
| 39 | ejentum/eval | 3 | Python | 2026-06-11 | A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module. |
| 40 | akanjilal-work/agent-eval-harness | 0 | Python | 2026-06-10 | A lightweight harness to test agent behaviour (tool-call correctness, injection refusal, cost ceilings) before deploymen |
| 41 | karlmehta/trustmodel-mcp | 0 | TypeScript | 2026-06-10 | TrustModel MCP Server — trust evaluation, red-team & governance for AI agents via the Model Context Protocol. Public can |
| 42 | reaatech/agent-eval-harness | 0 | TypeScript | 2026-06-22 | End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w |
| 43 | alyssadata/continuity-keys | 1 | — | 2026-06-08 | Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen) |
| 44 | reaatech/classifier-evals | 0 | TypeScript | 2026-06-24 | Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio |
| 45 | reaatech/rag-eval-pack | 0 | TypeScript | 2026-06-22 | RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with |
| 46 | Juanllenato/llm-eval-harness | 0 | Python | 2026-06-03 | A small, production-minded evaluation and observability harness for LLM/RAG features. Runs offline or live, gates CI on |
| 47 | Victor-David-Medina/llm-eval-harness | 0 | Python | 2026-06-03 | LLM evaluation harness that gates quality in CI: golden datasets, regression detection, grounding and faithfulness check |
| 48 | harnexa/nexa-gauge | 38 | Python | 2026-06-22 | An graph-eval framework for LLM's |
| 49 | thestio/thest-eval | 0 | Python | 2026-06-02 | The CI regression gate and governance-evidence layer for LLM systems — zero-dependency, vendor-neutral, offline. |
| 50 | monkeyin92/voice-agent-testops | 0 | TypeScript | 2026-06-01 | Regression testing for voice agents: scripted conversations, safety assertions, CI-ready reports. |
| 51 | fastxyz/skill-optimizer | 66 | TypeScript | 2026-05-28 | Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs |
| 52 | ajmeese7/local-llms | 1 | Python | 2026-05-27 | Use local Large Language Models for production use cases, and perform benchmarking for task-specific performance evaluat |
| 53 | rogue-socket/focusgroup | 0 | Python | 2026-05-27 | Persona-driven dynamic testing for conversational AI products. Focus groups for your agents. |
| 54 | chquandogong/mission-spec | 0 | TypeScript | 2026-06-22 | Mission Spec — AI 에이전트 워크플로를 위한 task contract layer |
| 55 | sanya2025/edututor-eval | 0 | Python | 2026-05-21 | A lightweight evaluation framework for AI tutoring responses, built for education-focused LLM systems |
| 56 | Alexanderk30/context-override-resistance | 0 | Python | 2026-05-19 | RL-style eval measuring intent/action divergence in frontier agents: model acknowledges a correction, then acts on the s |
| 57 | GiuseppeSp/n8n-customer-interview-synthesizer | 0 | — | 2026-05-19 | Multi-agent customer-interview synthesis pipeline in n8n with LLM-as-judge eval, Slack human-in-the-loop approval, and d |
| 58 | gmitt98/fieldtest | 0 | Python | 2026-05-16 | LLM evaluation framework — define what correct, well-formed, and safe means before you measure |
| 59 | verifywise-ai/plugin-marketplace | 3 | TypeScript | 2026-05-15 | VerifyWise AI Governance Plugin Marketplace |
| 60 | AI-QL/tuui | 1148 | TypeScript | 2026-05-14 | A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context |
| 61 | prompt-foundry/typescript-sdk | 6 | TypeScript | 2026-05-13 | The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS. |
| 62 | prompt-foundry/python-sdk | 8 | Python | 2026-05-13 | The prompt engineering, prompt management, and prompt evaluation tool for Python |
| 63 | Ruthwik-Data/mechanictrust | 0 | — | 2026-05-11 | AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair. |
| 64 | SAY-5/eval-observability | 0 | Python | 2026-05-10 | Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persist |
| 65 | Ruthwik-Data/finrag-eval | 0 | Python | 2026-05-10 | RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and bu |
| 66 | Ruthwik-Data/self-improving-prompt-agent | 0 | Python | 2026-05-10 | Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 → |
| 67 | SAY-5/genai-eval | 0 | Python | 2026-05-07 | Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard |
| 68 | HumphreySun98/repoagentbench | 32 | Python | 2026-04-30 | SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: cla |
| 69 | YagneshKhamar/phasio | 0 | TypeScript | 2026-04-29 | Jest-style testing for LLM prompts. Version prompts, run evals across OpenAI and Anthropic, catch regressions in CI. |
| 70 | lehigh-university-libraries/htr | 2 | Go | 2026-06-24 | Handwritten Text Recognition llm eval tool |
| 71 | JSLEEKR/evaltrack | 0 | TypeScript | 2026-04-24 | Local-first regression and trend CLI for promptfoo eval histories — the git log + git diff for LLM eval outputs. |
| 72 | izam-mohammed/ragrank | 47 | Python | 2026-04-21 | 🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, an |
| 73 | arthursoares/openclaw-llm-bench | 2 | Python | 2026-04-11 | A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-j |
| 74 | YuanyangLiNEU/mini-claude | 0 | TypeScript | 2026-04-11 | A minimal Claude Code built from scratch — agent loop, tool calling, web search, permissions, and a black-box LLM eval h |
| 75 | webrenew/models-dilemma | 4 | TypeScript | 2026-04-08 | The Prisoner's Dilemma played by LLMs |
| 76 | AdirAmsalem/openclaw-eval | 0 | Python | 2026-03-31 | Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, la |
| 77 | Data-ScienceTech/forcefield | 1 | Python | 2026-03-30 | ForceField Python SDK -- AI security in 3 lines of code. Prompt injection detection, PII redaction, security evals, tool |
| 78 | klausners/prompt-optimizer | 0 | TypeScript | 2026-03-26 | Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evalua |
| 79 | Aysnc-Labs/llm-eval | 1 | PHP | 2026-03-20 | A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correc |
| 80 | asarnaout/veritail | 6 | Python | 2026-03-15 | LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues |
| 81 | vola-trebla/llm-infrastructure | 0 | — | 2026-03-14 | Full-stack AI infrastructure - 5 projects from data ingestion to autonomous agents |
| 82 | whitecircle/circle-guard-bench | 70 | Python | 2026-03-07 | First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (g |
| 83 | tpertner/squeeze | 5 | Python | 2026-03-01 | Squeeze your model with pressure prompts to see if its behavior leaks. |
| 84 | grigio/llm-eval-simple | 70 | Python | 2026-02-28 | llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection |
| 85 | QuesmaOrg/BinaryAudit | 92 | Shell | 2026-02-27 | An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries. |
| 86 | paradime-io/dbt-llm-evals | 29 | Python | 2026-02-10 | The warehouse-native LLM evaluation package for dbt™ - monitor AI quality without data egress |
| 87 | Striveworks/valor | 41 | Python | 2026-02-09 | Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models. |
| 88 | TADSTech/llm-output-grader | 0 | Python | 2026-01-24 | systematic llm grading |
| 89 | 3ahmood/Agentic-Author-CrewAI | 1 | Jupyter Notebook | 2026-01-15 | On device autonomous research and content writing using open-sourced LLMs and Crew AI. |
| 90 | Supahands/llm-comparison-backend | 22 | Python | 2026-01-13 | This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be r |
| 91 | thedataquarry/structured-outputs | 28 | Python | 2025-12-23 | Structured output benchmarks comparing DSPy and BAML with different LLMs |
| 92 | higuseonhye/worldsim-eval | 0 | — | 2025-12-20 | Evaluate AI agents by simulating world-level consequences. |
| 93 | yukincom/llm-SugarScape | 6 | Python | 2025-11-28 | Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in |
| 94 | IAAR-Shanghai/GuessArena | 10 | Python | 2025-11-15 | [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Re |
| 95 | artefactop/promptdev | 2 | Python | 2025-09-22 | A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers. |
| 96 | multinear/multinear | 45 | Python | 2025-09-02 | Develop reliable AI apps |
| 97 | attogram/ollama-multirun | 16 | Shell | 2025-08-30 | Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance stat |
| 98 | khoj-ai/llm-coup | 14 | TypeScript | 2025-08-18 | Let LLMs play coup with each other and see who's the best at deception & strategy |
| 99 | jaaack-wang/multi-problem-eval-llm | 3 | Jupyter Notebook | 2025-08-08 | Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities |
| 100 | alan-turing-institute/prompto | 38 | Python | 2025-07-18 | An open source library for asynchronous querying of LLM endpoints |
Every 15 minutes, a GitHub Action runs tracker.py. That script:
- Fetches the latest state from
GitHub Search API. - Diffs against
data/items.json(the previous snapshot). - Rewrites the table above between the
<!-- TRACKER_TABLE_* -->markers. - Commits
feat: +N added, -M removed (timestamp)if anything changed.
No external services. No paid APIs. Just a public data source and a free GitHub Action.
See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current.
If you spot a data-source bug or want to suggest a new column for the table, open
an issue.
If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:
- trending-claude-skills — What's shipping in Claude Skills this week (
topic:claude-skills) - mcp-servers-live — Live index of newest MCP servers (
topic:mcp-server) - cursor-rules-live — Newest Cursor rules and .cursorrules patterns (
topic:cursor-rules) - claude-code-plugin-tracker — Claude Code plugins and hook configs (
topic:claude-code) - llm-agents-radar — Newest LLM agent frameworks (
topic:llm-agent) - rag-radar — Newest RAG implementations and tools (
topic:rag) - llm-eval-tracker — Newest LLM evaluation tools and benchmarks (
topic:llm-eval) - agent-framework-radar — Newest agent frameworks shipping on GitHub (
topic:agent-framework) - vector-db-live — Newest vector DB projects and integrations (
topic:vector-database) - llmops-radar — Newest LLMOps tooling (observability, deployment) (
topic:llmops) - prompt-tools-live — Newest prompt-engineering tools and prompt repos (
topic:prompt-engineering) - skills-tracker — Tracking new GitHub 'skills' repos (
topic:agent-skills) - awesome-agent-skills — Curated auto-updated awesome-list of AI agent skills (
topic:agent-skills)
MIT — see LICENSE.
-
Awesome Agent Skills — Curated, auto-updated awesome-list of vetted AI agent skills with quality ratings for Claude, GPT, and open-source agents (⭐ 0)
-
Agent Skills Daily Tracker — Real-time tracking of every new GitHub 'skills' repo to capture the AI agent skill ecosystem trend (⭐ 0)
-
Agent Eval Harness — Live, open-source benchmark for comparing AI coding agents on real GitHub issues (⭐ 0)
-
Prompt Tools Live — Live-updating tracker of prompt engineering tools, libraries, and techniques — refreshed every 15 minutes (⭐ 0)
-
LLMOps Radar — Live index of the newest LLMOps tooling — track what's shipping in LLM observability and deployment (⭐ 0)