description	CI baseline compare workflow

CI Baseline Compare Workflow

TraceCore CI integrates in two complementary patterns. Choose based on your workflow:

Which pattern to use

Pattern	When to use	Gate command
Record/Replay (bundle-based)	New workflow. Record once locally or via manual dispatch; replay/strict enforces the sealed bundle on every PR.	`--replay-bundle <path> --strict`
Baseline compare (artifact-based)	Existing workflow. Compare a fresh run artifact against a checked-in JSON baseline using policy gates (step/tool-call delta thresholds).	`tracecore baseline --compare`

The record/replay pattern is the recommended approach for new projects. The baseline-compare pattern is preserved for backwards compatibility.

If you want a maintained GitHub-native wrapper around tracecore run and tracecore verify, see tracecore-action. Public external consumer-validation examples for that wrapper live in tracecore-test.

Mental model: record locally, enforce in CI

[Developer, once]
tracecore run --agent ... --task ... --seed 0 --record
git add .agent_bench/baselines/<run_id>
git commit -m "seal: baseline for my_task@1"

[CI, every PR]
tracecore run --agent ... --task ... --seed 0 \
  --replay-bundle .agent_bench/baselines/<run_id> --strict

If the trace diverges from the sealed bundle, CI exits 1 and uploads the run log for triage.

Repo-provided workflow patterns

ci/templates/github-record-replay.yml: copy-ready GitHub Actions workflow — records via workflow_dispatch, enforces replay/strict on pull requests, uploads artifacts on success and failure.
ci/templates/gitlab-record-replay.yml: copy-ready GitLab pipeline — manual record stage, merge-request strict gate, artifact upload.
Both templates now emit verify.json from tracecore baseline --verify <bundle> during the record job so the sealed bundle has a deterministic integrity report committed alongside the artifacts.
scripts/policy_gate.py and the reusable GitHub workflow also accept verify.json so CI can fail immediately if bundle integrity drifts; pass --bundle-verify-json to policy_gate.py or rely on the Enforce bundle verify report step in .github/workflows/baseline-compare.yml.
.github/workflows/baseline-compare.yml (reusable, legacy): accepts agent, task, seed, baseline, and optional policy gates; emits run artifacts plus run.json; fails with exit code 1 for mismatches and 2 for incompatible agent/task pairs.
.github/workflows/chain-agent-baseline.yml (caller, legacy): pins the chain agent + rate_limited_chain@1 baseline.

Treat these as the source of truth for command order, artifact upload, and failure messaging.

For teams that prefer consuming a published GitHub Action instead of copying workflow templates, tracecore-action provides a thinner integration surface around these same TraceCore CLI flows.

Reusable workflow usage

jobs:
  compare:
    uses: ./.github/workflows/baseline-compare.yml
    with:
      agent_path: agents/toy_agent.py
      task_ref: filesystem_hidden_config@1
      seed: "0"
      baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json
      require_success: "true"
      max_steps: "200"
      max_tool_calls: "50"
      max_step_delta: "15"
      max_tool_call_delta: "10"

Manual trigger

You can also run the workflow from the Actions tab using workflow_dispatch inputs.

Dedicated chain-agent workflow

This repo also ships .github/workflows/chain-agent-baseline.yml, which runs the reusable workflow for:

agent: agents/chain_agent.py
task: rate_limited_chain@1
baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json

Triggers: pull requests to main, pushes to main, and manual dispatch.

Notes

baseline accepts either a run ID or a path to a run artifact.
Policy gates are optional; omit them or pass an empty string to disable a check.
max_step_delta/max_tool_call_delta compare the current run to the baseline run metrics.
The repo includes .agent_bench/baselines/rate_limited_chain_chain_agent.json as a checked-in reference baseline.
Exit codes: 0 (identical), 1 (different), 2 (incompatible task/agent).
The workflow uploads .agent_bench/runs and run.json as artifacts for inspection.

GitHub Actions example (policy gates + artifacts)

name: tracecore-ci

on:
  pull_request:
  workflow_dispatch:

jobs:
  tracecore-compare:
    uses: ./.github/workflows/baseline-compare.yml
    with:
      agent_path: agents/chain_agent.py
      task_ref: rate_limited_chain@1
      seed: "0"
      baseline: .agent_bench/baselines/rate_limited_chain_chain_agent.json
      require_success: "true"
      max_steps: "180"
      max_tool_calls: "60"
      max_step_delta: "10"
      max_tool_call_delta: "5"

Failure visibility

Baseline mismatches bubble up through the reusable workflow via agent-bench baseline --compare (exit codes above) and the policy gate script, so GitHub marks the job red and prints messages such as steps_used delta 14 exceeds max_step_delta 10.
Artifacts (.agent_bench/runs, run.json) automatically attach for diff triage.

GitLab CI example (separate stages)

stages:
  - run
  - compare
  - gate

variables:
  PYTHON_VERSION: "3.12"

run_agent:
  stage: run
  image: python:$PYTHON_VERSION
  script:
    - pip install -e .[dev]
    - tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 0 > run.json
  artifacts:
    paths:
      - run.json
      - .agent_bench/runs/

compare_baseline:
  stage: compare
  image: python:$PYTHON_VERSION
  needs: [run_agent]
  script:
    - pip install -e .[dev]
    - tracecore baseline --compare .agent_bench/baselines/rate_limited_chain_chain_agent.json $(python -c "import json;print(json.load(open('run.json'))['run_id'])") --format text

policy_gates:
  stage: gate
  image: python:$PYTHON_VERSION
  needs: [compare_baseline]
  script:
    - pip install -e .[dev]
    - python scripts/policy_gate.py --run-json run.json --baseline .agent_bench/baselines/rate_limited_chain_chain_agent.json --max-steps 180 --max-step-delta 10

Failure visibility

agent-bench baseline --compare sets the job status (0/1/2). Keep --format text so failures show up in job logs.
The gate step can emit explicit Policy gate failures: messages (mirror the reusable workflow’s Python snippet) and exit non-zero to stop the pipeline.
Artifacts from the run_agent job remain available for download from later stages.

Internal tooling / scheduler sketch

Schedule cadence – cron or orchestrator triggers (e.g., hourly) call a small runner script.

Runner script

tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed "$TRACECORE_SEED" > run.json
RUN_ID=$(python -c "import json;print(json.load(open('run.json'))['run_id'])")
tracecore baseline --compare "$TRACECORE_BASELINE" "$RUN_ID" --format json > compare.json || echo "Compare exit code: $?"
python scripts/policy_gate.py --run-json run.json --baseline "$TRACECORE_BASELINE" --max-steps 180 --max-step-delta 10

Note: Legacy alias agent-bench remains optional for backwards compatibility.

Evidence capture – persist .agent_bench/runs/${RUN_ID}.json, run.json, and compare.json to your internal evidence store (S3, artifact bucket) along with a metadata row {run_id, task_ref, agent_path, timestamp}.
Alerting – wire the policy gate script’s exit code into your orchestrator’s alert channel (PagerDuty, Slack) so failures are noisy, mirroring the GitHub/GitLab examples.

This keeps TraceCore integrations consistent across CI providers and internal automation without duplicating logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Baseline Compare Workflow

Which pattern to use

Mental model: record locally, enforce in CI

Repo-provided workflow patterns

Reusable workflow usage

Manual trigger

Dedicated chain-agent workflow

Notes

GitHub Actions example (policy gates + artifacts)

GitLab CI example (separate stages)

Internal tooling / scheduler sketch

FilesExpand file tree

ci_workflow.md

Latest commit

History

ci_workflow.md

File metadata and controls

CI Baseline Compare Workflow

Which pattern to use

Mental model: record locally, enforce in CI

Repo-provided workflow patterns

Reusable workflow usage

Manual trigger

Dedicated chain-agent workflow

Notes

GitHub Actions example (policy gates + artifacts)

GitLab CI example (separate stages)

Internal tooling / scheduler sketch