Skill Evaluation

skill-evaluation is a Codex / OpenSkills skill for creating evidence-based Skill and Agent evaluation reports.

It reads a target SKILL.md, local repositories, code, logs, traces, metrics, test artifacts, and reference workflows to understand the evaluation object, build a three-layer model, define metrics, run or review tests, attribute failures, and produce a structured reusable report. It can also compare multiple candidate skills on the same task and explain which one works better, where each one fits, and what gaps remain.

Before formal execution, it first produces an evaluation design draft that covers the three-layer breakdown, goals and metrics, data plan, judging methods and rules, data records, and test flow. Only after confirmation should it proceed to test execution and final report generation.

Use Cases

Evaluate whether a new skill is usable and worth keeping.
Compare two or more candidate skills on the same task.
Evaluate the capability chain of a multi-step skill.
Evaluate an agent-like workflow with layered evidence.
Generate an evaluation report from real logs, test data, or historical artifacts.
Build reusable datasets with sample sources, reference outputs, and dataset splits.
Design systematic evaluation scenarios with core scenarios and coverage tags.
Save evaluation run records, sample records, metric results, and artifact paths.
Decide whether multi-agent or sub-agent collaboration is needed during testing.
Evaluate skill routing, invocation readiness, necessity, no-skill baselines, and failure feedback loops.
Generate candidate run matrices, dimension-by-dimension comparisons, scenario fit recommendations, and sample-level pairwise comparisons.
Use only three judging modes: rule-based judging, model-based judging, and human judging.
Reuse valid prior results and rerun only missing or invalid evidence units to reduce token and runtime cost.
After skill changes, add samples that target the intended improvement and rerun affected optimization samples plus related holdout samples.
Output reports suitable for Markdown or Feishu / Lark documents.

Core Method

Evaluation reports use a fixed three-layer architecture:

Layer	Focus
Runtime framework	Environment, credentials, permissions, APIs, logs, available events, output files, routing and invocation. Cost and duration are included only when actually collected.
Capability chain	Data preparation, toolchain, state handoff, schemas, intermediate artifacts, and process orchestration.
Business outcome	Final output quality, relevance, accuracy, false positives / false negatives, user value, and skill necessity.

Every report should include:

Evaluation object and evidence scope
Three-layer breakdown
Evaluation goals and metrics
Evaluation design confirmation record
Scenario catalog, scenario-level split matrix, and coverage tag matrix
Invocation readiness evaluation
Necessity / no-skill baseline evaluation
Multi-agent / sub-agent orchestration decision
Evaluation data records and storage location
Test flow and execution results
Failure points, disagreement samples, and improvement directions
Reproducible artifacts and references
LLM model names and versions used in the run
Judging rules, scoring rubrics, model-judge prompts, and fallback reason records
Resource efficiency and result reuse records

Multi-candidate comparison reports should also include:

Candidate skill list and version evidence
Shared dataset and candidate run matrix
Dimension-by-dimension winner table
Scenario fit recommendations
Sample-level pairwise comparison
Candidate traces and failure attribution

Installation

Clone this repository into a skills directory discoverable by Codex / OpenSkills:

mkdir -p ~/.codex/skills
git clone https://github.com/Scofy0123/skill-evaluation.git ~/.codex/skills/skill-evaluation

If your environment uses .agents/skills:

mkdir -p ~/.agents/skills
git clone https://github.com/Scofy0123/skill-evaluation.git ~/.agents/skills/skill-evaluation

Restart Codex after installation.

Usage Examples

Use $skill-evaluation to evaluate this local skill:
/path/to/your-skill/SKILL.md
Run every safe executable test and output a Markdown evaluation report.

Use $skill-evaluation with this workflow document and historical logs to evaluate a multi-step skill.
The report should include the three-layer breakdown, invocation readiness, necessity, test results, and improvement directions.

Use $skill-evaluation to compare these two PRD-generation skills.
Use a shared dataset, a no-skill baseline, and sample-level pairwise comparison. Explain which one works better, what scenarios each fits, and where the gaps are.

Use $skill-evaluation to update this Feishu evaluation report.
Preserve comments and update only the relevant sections instead of replacing the whole document.

Example Reports

This README is available in English and Chinese. The following evaluation report examples show the expected final report shape:

Repository Structure

SKILL.md
agents/openai.yaml
references/
  evaluation-runbook.md
  metrics-and-judging.md
  report-template.md
  feishu-output.md
  comparative-evaluation.md
  scenario-design-examples.md
  report-examples/
    单一Skill评估报告样例-backend-estimation.md
    多Skill对比评估报告样例-两个开源PRD.md

Reference Files

evaluation-runbook.md: Runbook for RUN-xx execution steps, including the three-layer architecture, dataset splitting, evidence reuse, and feedback loops.
metrics-and-judging.md: Metric design, judging modes and rules, static evaluation tags, and metadata compatibility checks.
report-template.md: Evaluation report template.
feishu-output.md: Feishu / Lark document creation and partial-update guidance.
comparative-evaluation.md: Same-task comparison method for multiple candidate skills.
scenario-design-examples.md: Scenario catalog examples, counterexamples, and common mistakes.
report-examples/: Finished report examples for single-skill evaluation and multi-skill comparison.

Design Principles

Do not fabricate test data. Untested items must be marked as 待补测 or 间接证据.
Before marking an item as 待补测, first check whether it can be measured with available tools, scripts, rule-based judging, historical artifact review, isolated no-skill baselines, Feishu / Lark CLI, or APIs.
Do not rename untested items as engineering gap, not applicable, or not directly measurable to make them look complete. When something truly cannot be measured, document attempted paths, blockers, and the next step to fill the evidence gap.
Read the target skill and related code before designing metrics.
Clarify key evaluation decisions with the user: goals and metrics, three-layer breakdown, data sources, dataset construction, human judging criteria, side-effect permissions, and output format.
Do not run bulk tests, perform online writes, or generate the final report before the evaluation design is confirmed.
Core test scenarios must come from one coherent taxonomy. Define the scenario catalog name and primary classification dimension first; use language, region, technical topic, industry, and risk as coverage tags.
Run real tests when an executable entry point exists. If not, use static evaluation, artifact review, or actively constructed substitute judging.
Inventory existing valid evidence before running new tests. Reuse same-sample, same-input, same-version, same-model results when artifacts and logs are traceable.
Save script-generated data to reusable table files or databases.
The judging mode field may contain only 规则评判, 模型评判, and 人工评判.
Scored metrics must have scoring rubrics. Model-based judging must record prompt files and model versions. Human judging must record concrete review steps.
Record fallback reason category, trigger point, handling action, and whether the fallback counts toward the quality conclusion.
Do not create a weighted aggregate score unless the user explicitly asks for one.
Prefer partial updates for online documents so existing comments are preserved.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
references		references
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Evaluation

Use Cases

Core Method

Installation

Usage Examples

Example Reports

Repository Structure

Reference Files

Design Principles

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Skill Evaluation

Use Cases

Core Method

Installation

Usage Examples

Example Reports

Repository Structure

Reference Files

Design Principles

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages