| description | CI baseline compare workflow |
|---|
TraceCore CI integrates in two complementary patterns. Choose based on your workflow:
| Pattern | When to use | Gate command |
|---|---|---|
| Record/Replay (bundle-based) | New workflow. Record once locally or via manual dispatch; replay/strict enforces the sealed bundle on every PR. | --replay-bundle <path> --strict |
| Baseline compare (artifact-based) | Existing workflow. Compare a fresh run artifact against a checked-in JSON baseline using policy gates (step/tool-call delta thresholds). | tracecore baseline --compare |
The record/replay pattern is the recommended approach for new projects. The baseline-compare pattern is preserved for backwards compatibility.
If you want a maintained GitHub-native wrapper around tracecore run and tracecore verify, see tracecore-action. Public external consumer-validation examples for that wrapper live in tracecore-test.
[Developer, once]
tracecore run --agent ... --task ... --seed 0 --record
git add .agent_bench/baselines/<run_id>
git commit -m "seal: baseline for my_task@1"
[CI, every PR]
tracecore run --agent ... --task ... --seed 0 \
--replay-bundle .agent_bench/baselines/<run_id> --strict
If the trace diverges from the sealed bundle, CI exits 1 and uploads the run log for triage.
ci/templates/github-record-replay.yml: copy-ready GitHub Actions workflow — records viaworkflow_dispatch, enforces replay/strict on pull requests, uploads artifacts on success and failure.ci/templates/gitlab-record-replay.yml: copy-ready GitLab pipeline — manual record stage, merge-request strict gate, artifact upload.- Both templates now emit
verify.jsonfromtracecore baseline --verify <bundle>during the record job so the sealed bundle has a deterministic integrity report committed alongside the artifacts. scripts/policy_gate.pyand the reusable GitHub workflow also acceptverify.jsonso CI can fail immediately if bundle integrity drifts; pass--bundle-verify-jsontopolicy_gate.pyor rely on theEnforce bundle verify reportstep in.github/workflows/baseline-compare.yml..github/workflows/baseline-compare.yml(reusable, legacy): accepts agent, task, seed, baseline, and optional policy gates; emits run artifacts plusrun.json; fails with exit code1for mismatches and2for incompatible agent/task pairs..github/workflows/chain-agent-baseline.yml(caller, legacy): pins the chain agent +rate_limited_chain@1baseline.
Treat these as the source of truth for command order, artifact upload, and failure messaging.
For teams that prefer consuming a published GitHub Action instead of copying workflow templates, tracecore-action provides a thinner integration surface around these same TraceCore CLI flows.
jobs:
compare:
uses: ./.github/workflows/baseline-compare.yml
with:
agent_path: agents/toy_agent.py
task_ref: filesystem_hidden_config@1
seed: "0"
baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json
require_success: "true"
max_steps: "200"
max_tool_calls: "50"
max_step_delta: "15"
max_tool_call_delta: "10"You can also run the workflow from the Actions tab using workflow_dispatch inputs.
This repo also ships .github/workflows/chain-agent-baseline.yml, which runs the reusable workflow for:
- agent:
agents/chain_agent.py - task:
rate_limited_chain@1 - baseline:
.agent_bench/baselines/rate_limited_chain_chain_agent.json
Triggers: pull requests to main, pushes to main, and manual dispatch.
baselineaccepts either a run ID or a path to a run artifact.- Policy gates are optional; omit them or pass an empty string to disable a check.
max_step_delta/max_tool_call_deltacompare the current run to the baseline run metrics.- The repo includes
.agent_bench/baselines/rate_limited_chain_chain_agent.jsonas a checked-in reference baseline. - Exit codes:
0(identical),1(different),2(incompatible task/agent). - The workflow uploads
.agent_bench/runsandrun.jsonas artifacts for inspection.
name: tracecore-ci
on:
pull_request:
workflow_dispatch:
jobs:
tracecore-compare:
uses: ./.github/workflows/baseline-compare.yml
with:
agent_path: agents/chain_agent.py
task_ref: rate_limited_chain@1
seed: "0"
baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json
require_success: "true"
max_steps: "180"
max_tool_calls: "60"
max_step_delta: "10"
max_tool_call_delta: "5"Failure visibility
- Baseline mismatches bubble up through the reusable workflow via
agent-bench baseline --compare(exit codes above) and the policy gate script, so GitHub marks the job red and prints messages such assteps_used delta 14 exceeds max_step_delta 10. - Artifacts (
.agent_bench/runs,run.json) automatically attach for diff triage.
stages:
- run
- compare
- gate
variables:
PYTHON_VERSION: "3.12"
run_agent:
stage: run
image: python:$PYTHON_VERSION
script:
- pip install -e .[dev]
- tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 0 > run.json
artifacts:
paths:
- run.json
- .agent_bench/runs/
compare_baseline:
stage: compare
image: python:$PYTHON_VERSION
needs: [run_agent]
script:
- pip install -e .[dev]
- tracecore baseline --compare .agent_bench/baselines/rate_limited_chain_chain_agent.json $(python -c "import json;print(json.load(open('run.json'))['run_id'])") --format text
policy_gates:
stage: gate
image: python:$PYTHON_VERSION
needs: [compare_baseline]
script:
- pip install -e .[dev]
- python scripts/policy_gate.py --run-json run.json --baseline .agent_bench/baselines/rate_limited_chain_chain_agent.json --max-steps 180 --max-step-delta 10Failure visibility
agent-bench baseline --comparesets the job status (0/1/2). Keep--format textso failures show up in job logs.- The gate step can emit explicit
Policy gate failures:messages (mirror the reusable workflow’s Python snippet) and exit non-zero to stop the pipeline. - Artifacts from the
run_agentjob remain available for download from later stages.
- Schedule cadence – cron or orchestrator triggers (e.g., hourly) call a small runner script.
- Runner script
Note: Legacy alias
tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed "$TRACECORE_SEED" > run.json RUN_ID=$(python -c "import json;print(json.load(open('run.json'))['run_id'])") tracecore baseline --compare "$TRACECORE_BASELINE" "$RUN_ID" --format json > compare.json || echo "Compare exit code: $?" python scripts/policy_gate.py --run-json run.json --baseline "$TRACECORE_BASELINE" --max-steps 180 --max-step-delta 10
agent-benchremains optional for backwards compatibility. - Evidence capture – persist
.agent_bench/runs/${RUN_ID}.json,run.json, andcompare.jsonto your internal evidence store (S3, artifact bucket) along with a metadata row{run_id, task_ref, agent_path, timestamp}. - Alerting – wire the policy gate script’s exit code into your orchestrator’s alert channel (PagerDuty, Slack) so failures are noisy, mirroring the GitHub/GitLab examples.
This keeps TraceCore integrations consistent across CI providers and internal automation without duplicating logic.