Skip to content

feat: whiteboard layout quality eval harness#425

Draft
wyuc wants to merge 15 commits intomainfrom
worktree-whiteboard-eval-harness
Draft

feat: whiteboard layout quality eval harness#425
wyuc wants to merge 15 commits intomainfrom
worktree-whiteboard-eval-harness

Conversation

@wyuc
Copy link
Copy Markdown
Contributor

@wyuc wyuc commented Apr 14, 2026

Summary

Architecture

eval/whiteboard-layout/
├── scenarios/        # 8 constructed eval scenarios (JSON)
├── runner.ts         # Main entry: multi-turn chat loop + orchestration
├── chat-client.ts    # HTTP POST to /api/chat, SSE stream parser
├── state-manager.ts  # Headless Zustand stores + ActionEngine bridge
├── capture.ts        # Playwright screenshot logic
├── scorer.ts         # VLM scoring with structured rubric
├── reporter.ts       # JSON + Markdown report generation
└── types.ts          # Shared type definitions

Plus app/eval/whiteboard/page.tsx — minimal render page for Playwright screenshots.

Key Design Decisions

  • No simulation drift: Uses the real ActionEngine with headless Zustand stores, not a separate simulator
  • VLM-based scoring: Screenshots are evaluated by a vision model with a 4-dimension rubric (readability, overlap, space utilization, layout logic) — more reliable than pure geometric metrics
  • Multi-turn support: Each scenario defines a sequence of user messages; whiteboard state carries forward between turns, matching production behavior

Usage

pnpm eval:whiteboard --scenario multi-step-math --api-key $OPENAI_API_KEY

Related Issues

Closes #38 (provides the evaluation foundation)
Related: #74, #115

Test plan

  • npx tsc --noEmit passes
  • pnpm lint passes
  • pnpm check passes
  • Dev server + single scenario run produces screenshot + VLM score report

🤖 Generated with Claude Code

wyuc and others added 15 commits April 14, 2026 13:56
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hots

Creates app/eval/whiteboard/page.tsx — a headless client page that
seeds the stageStore with a synthetic slide scene, exposes
window.__setElements() for Playwright to inject PPTElement[], and
renders them via ScreenElement inside a 1000×562.5px white canvas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace 8 generic scenarios with 6 that match real usage patterns
- Multi-agent discussion with short user replies (嗯, 明白了, 继续)
- Include real slide scene data as initialStoreState
- Generated agent configs with Chinese names and proper roles
- Cover: physics, math, finance, primary school, economics, medical

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract the core agent loop logic into lib/chat/agent-loop.ts as a pure
async function with callback injection. Both the frontend React hook and
the eval harness now share the same loop — SSE parsing, exit conditions
(END/cue_user/empty turns/max turns), and director state accumulation.

The frontend wires StreamBuffer callbacks for UI pacing; the eval wires
ActionEngine + message accumulation for headless execution. If loop logic
changes in the shared module, both consumers automatically stay in sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nfig

- Rewrite scorer to use resolveModel() + generateText() from AI SDK
  instead of raw fetch — supports all providers (OpenAI, Google, Anthropic)
- Model config via env vars (EVAL_CHAT_MODEL, EVAL_SCORER_MODEL),
  matching the pattern from outline-language eval
- Fix eval page: bootstrap store before SceneProvider mounts
- Fix __dirname for tsx CJS mode
- Remove --api-key/--scorer-model CLI args (use env vars instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Whiteboard element overlap detection and auto-layout

1 participant