Draft
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hots Creates app/eval/whiteboard/page.tsx — a headless client page that seeds the stageStore with a synthetic slide scene, exposes window.__setElements() for Playwright to inject PPTElement[], and renders them via ScreenElement inside a 1000×562.5px white canvas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…date VLM output, fix empty array crash
- Replace 8 generic scenarios with 6 that match real usage patterns - Multi-agent discussion with short user replies (嗯, 明白了, 继续) - Include real slide scene data as initialStoreState - Generated agent configs with Chinese names and proper roles - Cover: physics, math, finance, primary school, economics, medical Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract the core agent loop logic into lib/chat/agent-loop.ts as a pure async function with callback injection. Both the frontend React hook and the eval harness now share the same loop — SSE parsing, exit conditions (END/cue_user/empty turns/max turns), and director state accumulation. The frontend wires StreamBuffer callbacks for UI pacing; the eval wires ActionEngine + message accumulation for headless execution. If loop logic changes in the shared module, both consumers automatically stay in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nfig - Rewrite scorer to use resolveModel() + generateText() from AI SDK instead of raw fetch — supports all providers (OpenAI, Google, Anthropic) - Model config via env vars (EVAL_CHAT_MODEL, EVAL_SCORER_MODEL), matching the pattern from outline-language eval - Fix eval page: bootstrap store before SceneProvider mounts - Fix __dirname for tsx CJS mode - Remove --api-key/--scorer-model CLI args (use env vars instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/api/chat, executes whiteboard actions via the realActionEngine, renders results with Playwright, and scores layout quality via VLMArchitecture
Plus
app/eval/whiteboard/page.tsx— minimal render page for Playwright screenshots.Key Design Decisions
ActionEnginewith headless Zustand stores, not a separate simulatorUsage
pnpm eval:whiteboard --scenario multi-step-math --api-key $OPENAI_API_KEYRelated Issues
Closes #38 (provides the evaluation foundation)
Related: #74, #115
Test plan
npx tsc --noEmitpassespnpm lintpassespnpm checkpasses🤖 Generated with Claude Code