For AI agents: Read this file first to understand where the project is. Update it after every meaningful task or group of tasks.
Last updated: 2026-02-15 (Session 13 - Read-triggered camera screenshot)
Branch: 001-minerva-mvp
Overall status: Phases 1-8 COMPLETE (T001-T067). Session 13: read-triggered camera screenshot; removed paper detection and scan button.
Goal: Conversational homework help — when the user stops talking (push-to-talk release) and their words contain "read", send a screenshot of the camera with the message so the model can see the homework. Remove paper-detection and scan-button UI to minimize friction and latency.
Changes:
- Removed: Paper detection loop and "Paper detected" overlay in FloatingVideoOverlay; Scan button in overlay (gallery) and in BottomControlBar; handleScan and onScan wiring from session page.
- Added: In useSession, when
onUserMessage(text)fires, iftextcontains "read" (case-insensitive) and the camera is on, capture one frame viacaptureFrame(userCamera.videoRef.current)and callbrain.handleStudentMessage(text, result); otherwisebrain.handleStudentMessage(text).
Files changed:
| File | Change |
|---|---|
src/app/student/session/page.tsx |
Removed captureFrame import, handleScan, onScan props |
src/components/session/FloatingVideoOverlay.tsx |
Removed onScan prop, DocumentOverlay, ScanFlash, detection state/loop, handleScan, Scan button |
src/components/session/BottomControlBar.tsx |
Removed onScan prop and scan button |
src/hooks/useSession.ts |
Import captureFrame; in onUserMessage, if "read" then capture frame and pass imageData |
Note: src/lib/camera/detector.ts is now unused (left in repo for possible future use).
Problem: When Claude called tools (setContentMode, executeCanvasCommands), it would sometimes generate ONLY tool calls without any speech text. The avatar would remain silent.
Root Cause: Two issues:
- The prompt didn't explicitly require speech with every response
- The code was checking for a non-existent
step-finishevent instead oftext-end
Fixes Applied:
- Removed check for non-existent
step-finishevent - Speech is now emitted when we see a
tool-callevent (before yielding the tool) - Added
text-endhandler for text-only responses - Fallback still catches edge cases
- Added CRITICAL: ALWAYS GENERATE SPEECH TEXT section
- Explicitly tells Claude: "Never call tools without also generating speech"
- Shows example response flow: generate speech FIRST, then call tools
| File | Change |
|---|---|
src/lib/claude/client.ts |
Fixed multi-step stream handling for tools with execute() |
src/lib/claude/prompts.ts |
Added mandatory speech requirement to prompt |
When a tool has an execute() function (like getExistingVideos), the AI SDK handles it automatically:
Step 1:
start-step → speechBuffer reset
text-delta events → Claude's intro speech
text-end → speech emitted
tool-call → getExistingVideos
tool-result → AI SDK executes, returns result
Step 2 (automatic continuation):
start-step → speechBuffer reset
text-delta events → Claude's follow-up based on tool result
text-end → speech emitted
done
Key changes:
- Added
start-stephandler to resetspeechBufferfor each step - Removed
speechEmittedflag - now emit speech per-step, not once text-endemits speech immediately (not waiting for tool calls)- Safety: also emit speech on
tool-calliftext-enddidn't fire
Merged two branches:
anton/latency-test— SSE streaming for faster time-to-first-word- HEAD — Sandbox token optimization + Manim video integration + Design system
Architecture:
- SSE streaming pipeline: speech arrives early (~1s), avatar starts talking while remaining fields generate
respondStream()async generator on TutorBrain — usesclient.messages.stream()+ regex speech extraction- API route returns
text/event-streamwithReadableStream - Frontend consumes SSE via
consumeStream()helper in useTutorBrain
Sandbox Token Optimization (80-90% token reduction):
- Claude outputs
sandboxContent+sandboxAccent(not full HTML) - Frontend wraps with Twind template in
buildSandboxHtml() - Subject-based accent colors (physics=blue, chemistry=emerald, etc.)
Manim Video Integration:
manimVideoFile— reuse existing video by filenamemanimPrompt— generate new video (30-120s)videoUrl— resolved URL added by server- Server auto-corrects contentMode to "video" if video fields present
Content Modes: "welcome" | "math" | "sandbox" | "video"
Push-to-Talk Enhancement:
avatarFlush()— immediately sends accumulated transcription on Space release- Fixes latency from debounce waiting
| File | Resolution |
|---|---|
src/types/session.ts |
Keep sandboxContent/sandboxAccent/videoUrl (HEAD) |
src/stores/sessionStore.ts |
Keep HEAD's fields + actions |
src/lib/claude/client.ts |
Merge: SSE streaming + our Zod schema with sandbox/manim fields |
src/hooks/useTutorBrain.ts |
Merge: SSE consumption + content mode validation + video/sandbox handling |
src/hooks/useSession.ts |
Keep HEAD's fields + add avatarFlush |
src/app/api/tutor/respond/route.ts |
Merge: SSE streaming + Manim generation in result event |
src/app/student/session/page.tsx |
Keep HEAD + add avatarFlush to push-to-talk |
src/components/session/ContentMode.tsx |
Keep HEAD's sandboxContent/accent/videoUrl props |
src/components/session/SandboxPanel.tsx |
Keep HEAD's content/accent + Twind template |
progress.md |
Combined both sessions' notes |
Two major changes: (1) Cohesive Soft Lavender (#A78BFA) + Aqua (#67E8F9) design identity across entire app. (2) Complete AI prompt rewrite with sandbox HTML templates and tighter speech rules.
- globals.css — Full lavender/aqua color palette replacing defaults.
--font-displayvariable. SVG grain texture overlay (3% opacity). Safari input fix (-webkit-appearance: none). - layout.tsx — Space Grotesk display font via
next/font/google.class="dark"on<html>. Body includes${spaceGrotesk.variable}.
- prompts.ts — MAJOR rewrite:
- Fixed HTML skeleton for sandbox (consistent layout every time)
- 6 layout templates: centered, split, steps, comparison, chart, interactive
- Subject-based accent colors (Physics=blue, Chemistry=emerald, Biology=green, History=amber, Literature=purple, General=cyan)
- BANNED PHRASES: "Great question!", "Absolutely!", "Excellent!", "Fantastic!", "Not quite"
- USE INSTEAD: "yeah that's right", "nice, so...", "hmm what if..."
- Speech: 1-2 sentences MAX, always end with question, sound like cool older sibling
- Content routing: first response = visual, follow-ups = speech only unless needed
- Hard constraints: 3500 chars max, no CDN, no scrolling, clamp() for responsive sizing
- client.ts — Added
sandboxTemplateto Zod schema (enum of 6 templates)
- SandboxPanel — Fade-in transition, lavender empty state, updated viewport CSS
- ChatSheet — Lavender user bubbles (
bg-[#A78BFA]), violet-tinted AI bubbles, 3 bouncing lavender dots for typing indicator, lavender focus ring - BottomControlBar — Lavender join button (was green), lavender timer text,
-webkit-backdrop-filterfor Safari - FloatingVideoOverlay — Lavender status dots, lavender thinking pulse/glow (was blue), lavender view mode icons
- Landing page — Dark bg (#0A0A0A), Space Grotesk headings, lavender TreeHacks badge, lavender feature cards with hover, lavender tech badges, lavender CTA section
- Login page —
font-displayon title - Session page — Lavender/aqua/violet mode badge dots,
-webkit-backdrop-filteron badge, aqua push-to-talk active state - Parent layout — Dark sidebar (
bg-[#0E0C18]), lavender logo, lavender nav hover - Parent dashboard —
font-displaytitle, lavender/aqua stat card borders + values, lavender session badges
Major change: Rearchitected the tutor response pipeline from single JSON response to SSE streaming. Speech field is extracted early via regex and emitted immediately, so the avatar starts speaking while sandboxHtml/canvasCommands are still generating.
-
respondStream()async generator — New method on TutorBrain that usesclient.messages.stream()+ regex-based speech extraction. Yieldsspeechevent as soon as the speech field is complete, thenresultevent with remaining fields. - SSE API route —
/api/tutor/respondnow returnstext/event-streamwithReadableStream. Events:speech,result,done,error. Perplexity enrichment still runs before stream starts. - Frontend SSE consumption —
useTutorBrainreads SSE events viafetch()+ReadableStreamreader. Avatar speaks onspeechevent (fire-and-forget). Sandbox/canvas/progress update onresultevent. -
buildClaudeRequest()helper — Extracted shared message-building logic fromrespond()to avoid duplication withrespondStream(). - Prompt caching —
cache_control: { type: "ephemeral" }on system prompts saves ~200-500ms after first request. - Module-level Anthropic client — Reuses HTTP connections, avoids TLS handshake per request.
- Speech extraction regex:
/"speech"\s*:\s*"((?:[^"\\]|\\.)*)"\s*[,}]/— detects complete speech value in the JSON token stream. Works becausespeechis the first field in the Zod schema. - Two SSE events:
speech(emitted early) +result(everything else, emitted when stream ends). Simpler than per-field events. - AbortController cascade: Frontend abort cancels the fetch → SSE ReadableStream cancel fires → server AbortController aborts Claude stream.
- Backward compatible:
respond()still exists as a non-streaming fallback.
Major change: Replaced side-by-side react-resizable-panels video grid with a true Zoom-style floating PiP overlay. Reverted CSS design system injection that made sandbox output look generic.
- Created
FloatingVideoOverlay.tsxusingreact-rnd— draggable + resizable floating PiP - Three view modes matching Zoom's actual behavior:
- Strip (— icon): Thin dark bar showing "Talking: Minerva" or status text
- Speaker (□ icon): One large video tile with name label + hover controls
- Gallery (⋮⋮⋮ icon): Two stacked video tiles (avatar top, camera bottom)
- View mode switch icons + minimize button only visible on hover (group-hover pattern)
- Video persistence:
<video>elements always mounted assr-only,<canvas>mirrors viarequestAnimationFrame+drawImage()— stream never lost across mode/minimize changes - Resize handles with stripe patterns (matching Zoom): bottom (horizontal stripes), right (vertical stripes), corner (diagonal lines SVG)
-
lockAspectRatiofor speaker mode, per-mode min/max sizes - Document detection + scan button preserved on camera tile in gallery mode
- Minimizable to small pill (top-right corner)
- Deleted
VideoGrid.tsx, removedreact-resizable-panelspackage
- Injected minimal CSS:
html,body{margin:0;padding:0;overflow:hidden;width:100%;height:100vh;max-height:100vh;} - Added "Content MUST fit in one screen" to Claude prompt sandbox rules
- Reverted CSS design system injection (user feedback: made output look "AI-ish generic")
- Reverted prompt changes that increased char limit and added design patterns
npx tsc --noEmit— 0 errors (pending verification after merge)npm run build— compiles successfully (pre-existing DB error on /parent SSR unrelated)
- Session 11 merge is complete — SSE streaming + sandbox optimization + Manim videos
- SSE latency benefit: Speech arrives ~1s, avatar starts talking immediately
- Sandbox token savings: 80-90% reduction (sandboxContent + sandboxAccent vs full HTML)
- Manim videos: Claude can reuse by filename or generate new (30-120s generation time)
- Push-to-talk:
avatarFlush()sends accumulated text immediately on Space release - Design identity: Soft Lavender (#A78BFA) primary + Aqua (#67E8F9) accent on deep purple-black (#0C0A14)
- Typography: Space Grotesk (display/headlines) + Geist (body). Use
font-displayclass for headings. - Safari:
-webkit-backdrop-filteradded alongsidebackdrop-filterin key components - FloatingVideoOverlay uses
react-rnd+ canvas mirroring — videos never unmount - Pre-existing build error:
/parentpage fails during static generation (local DB "kimsanov" doesn't exist) — unrelated to our code