feat: whiteboard layout quality eval harness by wyuc · Pull Request #425 · THU-MAIC/OpenMAIC

wyuc · 2026-04-14T08:18:23Z

Summary

Add an end-to-end eval harness for whiteboard layout quality (issues feat: Whiteboard element overlap detection and auto-layout #38, [Bug]: 多个讨论的时候，白板重叠了 #74, [Bug]:白板和做题区域太小了，无法查看 #115)
Runs constructed multi-turn chat scenarios against /api/chat, executes whiteboard actions via the real ActionEngine, renders results with Playwright, and scores layout quality via VLM
Includes 8 constructed scenarios covering math derivations, tables+formulas, diagrams, multi-agent collaboration, and stress tests

Architecture

eval/whiteboard-layout/
├── scenarios/        # 8 constructed eval scenarios (JSON)
├── runner.ts         # Main entry: multi-turn chat loop + orchestration
├── chat-client.ts    # HTTP POST to /api/chat, SSE stream parser
├── state-manager.ts  # Headless Zustand stores + ActionEngine bridge
├── capture.ts        # Playwright screenshot logic
├── scorer.ts         # VLM scoring with structured rubric
├── reporter.ts       # JSON + Markdown report generation
└── types.ts          # Shared type definitions

Plus app/eval/whiteboard/page.tsx — minimal render page for Playwright screenshots.

Key Design Decisions

No simulation drift: Uses the real ActionEngine with headless Zustand stores, not a separate simulator
VLM-based scoring: Screenshots are evaluated by a vision model with a 4-dimension rubric (readability, overlap, space utilization, layout logic) — more reliable than pure geometric metrics
Multi-turn support: Each scenario defines a sequence of user messages; whiteboard state carries forward between turns, matching production behavior

Usage

pnpm eval:whiteboard --scenario multi-step-math --api-key $OPENAI_API_KEY

Related Issues

Closes #38 (provides the evaluation foundation)
Related: #74, #115

Test plan

npx tsc --noEmit passes
pnpm lint passes
pnpm check passes
Dev server + single scenario run produces screenshot + VLM score report

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hots Creates app/eval/whiteboard/page.tsx — a headless client page that seeds the stageStore with a synthetic slide scene, exposes window.__setElements() for Playwright to inject PPTElement[], and renders them via ScreenElement inside a 1000×562.5px white canvas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…date VLM output, fix empty array crash

- Replace 8 generic scenarios with 6 that match real usage patterns - Multi-agent discussion with short user replies (嗯, 明白了, 继续) - Include real slide scene data as initialStoreState - Generated agent configs with Chinese names and proper roles - Cover: physics, math, finance, primary school, economics, medical Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract the core agent loop logic into lib/chat/agent-loop.ts as a pure async function with callback injection. Both the frontend React hook and the eval harness now share the same loop — SSE parsing, exit conditions (END/cue_user/empty turns/max turns), and director state accumulation. The frontend wires StreamBuffer callbacks for UI pacing; the eval wires ActionEngine + message accumulation for headless execution. If loop logic changes in the shared module, both consumers automatically stay in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nfig - Rewrite scorer to use resolveModel() + generateText() from AI SDK instead of raw fetch — supports all providers (OpenAI, Google, Anthropic) - Model config via env vars (EVAL_CHAT_MODEL, EVAL_SCORER_MODEL), matching the pattern from outline-language eval - Fix eval page: bootstrap store before SceneProvider mounts - Fix __dirname for tsx CJS mode - Remove --api-key/--scorer-model CLI args (use env vars instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wyuc and others added 15 commits April 14, 2026 13:56

feat(eval): add state manager bridging ActionEngine for eval

522a24d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(eval): add shared types for whiteboard layout eval harness

c98dd9d

feat(eval): add SSE chat client for whiteboard eval

d9a57c0

feat(eval): add Playwright capture module for whiteboard screenshots

f4c7c4e

feat(eval): add VLM scorer for whiteboard layout evaluation

5da5150

feat(eval): add report generator for whiteboard eval results

9b33b48

feat(eval): add 8 constructed scenarios for whiteboard layout eval

2390c0c

feat(eval): add main runner for whiteboard layout eval

284ce47

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(eval): add eval:whiteboard script, install tsx, gitignore results

210fb5a

fix(eval): fix TS errors, lint, and prettier formatting

032b9e0

fix(eval): address code review — add cue_user/empty turn guards, vali…

15d9ac2

…date VLM output, fix empty array crash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: whiteboard layout quality eval harness#425

feat: whiteboard layout quality eval harness#425
wyuc wants to merge 15 commits intomainfrom
worktree-whiteboard-eval-harness

wyuc commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wyuc commented Apr 14, 2026

Summary

Architecture

Key Design Decisions

Usage

Related Issues

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant