SkillOpt optimizes Claude Agent Skills using DSPy's GEPA (Greedy Evolutionary Prompt Adaptation) algorithm. It analyzes skills against best practices, removes filler and verbose explanations, and consolidates redundant content while preserving code blocks and structure. Training data with bad/good example pairs teaches the optimizer what to fix.
uv syncSet your OpenAI API key before running:
export OPENAI_API_KEY=your-api-key
# Optional: Set your API base to use with other providers (e.g. vLLM, Ollama, llama.cpp)
# Specify the model name with --model openai/<model-name> when running
# export OPENAI_API_BASE=http://localhost:11434/v1All commands go through main.py:
# Analyze a skill without optimizing
uv run python main.py analyze <skill_directory>
# Static-only optimization
uv run python main.py optimize <skill_directory> -o <output_directory>
# Eval-based optimization (recommended)
uv run python main.py optimize-evals <skill_directory> --evals <evals.json>
uv run python main.py optimize-evals <skill_directory> --generate-evals
# Variance benchmark
uv run python main.py benchmark <skill_directory> [--num-runs 10]Two Jupyter notebooks in notebooks/ provide step-by-step walkthroughs:
- Static-only optimization — uses the composite static metric (filler, conciseness, code blocks, structure)
- Eval-based optimization — adds LLM-as-judge assertion scoring on top of static analysis
Each subcommand documents its own flags:
uv run python main.py optimize --help
uv run python main.py optimize-evals --help- Parse — Load skill directory (SKILL.md + references) and extract content
- Analyze — Score against Claude Agent Skills best practices
- Optimize — GEPA evolves prompt instructions over multiple iterations, scored by the chosen metric
- Validate — (eval-based only) LLM-as-judge checks the optimized skill against all assertions
- Save — Output optimized SKILL.md and benchmark results
When using eval-based optimization, GEPA scores each candidate with:
- 40% static analysis — filler removal, conciseness, code-block preservation, structure
- 60% assertion satisfaction — an LLM judge evaluates whether the optimized skill enables handling every test case
Assertions can be provided in the evals file or auto-generated (3-6 per test case, filtered for objectivity).
The benchmark subcommand tests analyzer determinism by running multiple analysis passes and reporting score stability.
main.py # CLI entry point (analyze, optimize, optimize-evals, benchmark)
skillopt/ # Core library (parser, analyzer)
scripts/ # Optimization and benchmark scripts
notebooks/ # Interactive Jupyter walkthroughs
examples/ # Example skills, evals, and training data
MIT