Skip to content

Latest commit

 

History

History
228 lines (176 loc) · 13.4 KB

File metadata and controls

228 lines (176 loc) · 13.4 KB

Minerva — Progress Tracker

For AI agents: Read this file first to understand where the project is. Update it after every meaningful task or group of tasks.

Last updated: 2026-02-15 (Session 13 - Read-triggered camera screenshot) Branch: 001-minerva-mvp Overall status: Phases 1-8 COMPLETE (T001-T067). Session 13: read-triggered camera screenshot; removed paper detection and scan button.


Session 13: Read-triggered camera screenshot

Goal: Conversational homework help — when the user stops talking (push-to-talk release) and their words contain "read", send a screenshot of the camera with the message so the model can see the homework. Remove paper-detection and scan-button UI to minimize friction and latency.

Changes:

  • Removed: Paper detection loop and "Paper detected" overlay in FloatingVideoOverlay; Scan button in overlay (gallery) and in BottomControlBar; handleScan and onScan wiring from session page.
  • Added: In useSession, when onUserMessage(text) fires, if text contains "read" (case-insensitive) and the camera is on, capture one frame via captureFrame(userCamera.videoRef.current) and call brain.handleStudentMessage(text, result); otherwise brain.handleStudentMessage(text).

Files changed:

File Change
src/app/student/session/page.tsx Removed captureFrame import, handleScan, onScan props
src/components/session/FloatingVideoOverlay.tsx Removed onScan prop, DocumentOverlay, ScanFlash, detection state/loop, handleScan, Scan button
src/components/session/BottomControlBar.tsx Removed onScan prop and scan button
src/hooks/useSession.ts Import captureFrame; in onUserMessage, if "read" then capture frame and pass imageData

Note: src/lib/camera/detector.ts is now unused (left in repo for possible future use).


Session 12: Tool Calling Speech Fix

Problem: When Claude called tools (setContentMode, executeCanvasCommands), it would sometimes generate ONLY tool calls without any speech text. The avatar would remain silent.

Root Cause: Two issues:

  1. The prompt didn't explicitly require speech with every response
  2. The code was checking for a non-existent step-finish event instead of text-end

Fixes Applied:

1. Fixed stream event handling in client.ts

  • Removed check for non-existent step-finish event
  • Speech is now emitted when we see a tool-call event (before yielding the tool)
  • Added text-end handler for text-only responses
  • Fallback still catches edge cases

2. Updated system prompt in prompts.ts

  • Added CRITICAL: ALWAYS GENERATE SPEECH TEXT section
  • Explicitly tells Claude: "Never call tools without also generating speech"
  • Shows example response flow: generate speech FIRST, then call tools

Files Changed

File Change
src/lib/claude/client.ts Fixed multi-step stream handling for tools with execute()
src/lib/claude/prompts.ts Added mandatory speech requirement to prompt

Multi-Step Tool Execution Flow (NEW)

When a tool has an execute() function (like getExistingVideos), the AI SDK handles it automatically:

Step 1:
  start-step → speechBuffer reset
  text-delta events → Claude's intro speech
  text-end → speech emitted
  tool-call → getExistingVideos
  tool-result → AI SDK executes, returns result

Step 2 (automatic continuation):
  start-step → speechBuffer reset
  text-delta events → Claude's follow-up based on tool result
  text-end → speech emitted

done

Key changes:

  • Added start-step handler to reset speechBuffer for each step
  • Removed speechEmitted flag - now emit speech per-step, not once
  • text-end emits speech immediately (not waiting for tool calls)
  • Safety: also emit speech on tool-call if text-end didn't fire

Session 11: Merge — SSE Streaming + Design System + Manim Videos

Merged two branches:

  1. anton/latency-test — SSE streaming for faster time-to-first-word
  2. HEAD — Sandbox token optimization + Manim video integration + Design system

Key Changes After Merge

Architecture:

  • SSE streaming pipeline: speech arrives early (~1s), avatar starts talking while remaining fields generate
  • respondStream() async generator on TutorBrain — uses client.messages.stream() + regex speech extraction
  • API route returns text/event-stream with ReadableStream
  • Frontend consumes SSE via consumeStream() helper in useTutorBrain

Sandbox Token Optimization (80-90% token reduction):

  • Claude outputs sandboxContent + sandboxAccent (not full HTML)
  • Frontend wraps with Twind template in buildSandboxHtml()
  • Subject-based accent colors (physics=blue, chemistry=emerald, etc.)

Manim Video Integration:

  • manimVideoFile — reuse existing video by filename
  • manimPrompt — generate new video (30-120s)
  • videoUrl — resolved URL added by server
  • Server auto-corrects contentMode to "video" if video fields present

Content Modes: "welcome" | "math" | "sandbox" | "video"

Push-to-Talk Enhancement:

  • avatarFlush() — immediately sends accumulated transcription on Space release
  • Fixes latency from debounce waiting

Files Touched in Merge

File Resolution
src/types/session.ts Keep sandboxContent/sandboxAccent/videoUrl (HEAD)
src/stores/sessionStore.ts Keep HEAD's fields + actions
src/lib/claude/client.ts Merge: SSE streaming + our Zod schema with sandbox/manim fields
src/hooks/useTutorBrain.ts Merge: SSE consumption + content mode validation + video/sandbox handling
src/hooks/useSession.ts Keep HEAD's fields + add avatarFlush
src/app/api/tutor/respond/route.ts Merge: SSE streaming + Manim generation in result event
src/app/student/session/page.tsx Keep HEAD + add avatarFlush to push-to-talk
src/components/session/ContentMode.tsx Keep HEAD's sandboxContent/accent/videoUrl props
src/components/session/SandboxPanel.tsx Keep HEAD's content/accent + Twind template
progress.md Combined both sessions' notes

Session 11A: Design System + AI Prompt Overhaul (HEAD)

Two major changes: (1) Cohesive Soft Lavender (#A78BFA) + Aqua (#67E8F9) design identity across entire app. (2) Complete AI prompt rewrite with sandbox HTML templates and tighter speech rules.

Phase 1: Design System Foundation

  • globals.css — Full lavender/aqua color palette replacing defaults. --font-display variable. SVG grain texture overlay (3% opacity). Safari input fix (-webkit-appearance: none).
  • layout.tsx — Space Grotesk display font via next/font/google. class="dark" on <html>. Body includes ${spaceGrotesk.variable}.

Phase 2: AI Prompt Rewrite

  • prompts.ts — MAJOR rewrite:
    • Fixed HTML skeleton for sandbox (consistent layout every time)
    • 6 layout templates: centered, split, steps, comparison, chart, interactive
    • Subject-based accent colors (Physics=blue, Chemistry=emerald, Biology=green, History=amber, Literature=purple, General=cyan)
    • BANNED PHRASES: "Great question!", "Absolutely!", "Excellent!", "Fantastic!", "Not quite"
    • USE INSTEAD: "yeah that's right", "nice, so...", "hmm what if..."
    • Speech: 1-2 sentences MAX, always end with question, sound like cool older sibling
    • Content routing: first response = visual, follow-ups = speech only unless needed
    • Hard constraints: 3500 chars max, no CDN, no scrolling, clamp() for responsive sizing
  • client.ts — Added sandboxTemplate to Zod schema (enum of 6 templates)

Phase 3: Component Theming (15 files)

  • SandboxPanel — Fade-in transition, lavender empty state, updated viewport CSS
  • ChatSheet — Lavender user bubbles (bg-[#A78BFA]), violet-tinted AI bubbles, 3 bouncing lavender dots for typing indicator, lavender focus ring
  • BottomControlBar — Lavender join button (was green), lavender timer text, -webkit-backdrop-filter for Safari
  • FloatingVideoOverlay — Lavender status dots, lavender thinking pulse/glow (was blue), lavender view mode icons
  • Landing page — Dark bg (#0A0A0A), Space Grotesk headings, lavender TreeHacks badge, lavender feature cards with hover, lavender tech badges, lavender CTA section
  • Login pagefont-display on title
  • Session page — Lavender/aqua/violet mode badge dots, -webkit-backdrop-filter on badge, aqua push-to-talk active state
  • Parent layout — Dark sidebar (bg-[#0E0C18]), lavender logo, lavender nav hover
  • Parent dashboardfont-display title, lavender/aqua stat card borders + values, lavender session badges

Session 11B: SSE Streaming Pipeline (anton/latency-test)

Major change: Rearchitected the tutor response pipeline from single JSON response to SSE streaming. Speech field is extracted early via regex and emitted immediately, so the avatar starts speaking while sandboxHtml/canvasCommands are still generating.

What Changed

  • respondStream() async generator — New method on TutorBrain that uses client.messages.stream() + regex-based speech extraction. Yields speech event as soon as the speech field is complete, then result event with remaining fields.
  • SSE API route/api/tutor/respond now returns text/event-stream with ReadableStream. Events: speech, result, done, error. Perplexity enrichment still runs before stream starts.
  • Frontend SSE consumptionuseTutorBrain reads SSE events via fetch() + ReadableStream reader. Avatar speaks on speech event (fire-and-forget). Sandbox/canvas/progress update on result event.
  • buildClaudeRequest() helper — Extracted shared message-building logic from respond() to avoid duplication with respondStream().
  • Prompt cachingcache_control: { type: "ephemeral" } on system prompts saves ~200-500ms after first request.
  • Module-level Anthropic client — Reuses HTTP connections, avoids TLS handshake per request.

Architecture Notes

  • Speech extraction regex: /"speech"\s*:\s*"((?:[^"\\]|\\.)*)"\s*[,}]/ — detects complete speech value in the JSON token stream. Works because speech is the first field in the Zod schema.
  • Two SSE events: speech (emitted early) + result (everything else, emitted when stream ends). Simpler than per-field events.
  • AbortController cascade: Frontend abort cancels the fetch → SSE ReadableStream cancel fires → server AbortController aborts Claude stream.
  • Backward compatible: respond() still exists as a non-streaming fallback.

Session 10b: Floating Video Overlay + Sandbox Viewport Fix

Major change: Replaced side-by-side react-resizable-panels video grid with a true Zoom-style floating PiP overlay. Reverted CSS design system injection that made sandbox output look generic.

Floating Video Overlay (replaces VideoGrid)

  • Created FloatingVideoOverlay.tsx using react-rnd — draggable + resizable floating PiP
  • Three view modes matching Zoom's actual behavior:
    • Strip (— icon): Thin dark bar showing "Talking: Minerva" or status text
    • Speaker (□ icon): One large video tile with name label + hover controls
    • Gallery (⋮⋮⋮ icon): Two stacked video tiles (avatar top, camera bottom)
  • View mode switch icons + minimize button only visible on hover (group-hover pattern)
  • Video persistence: <video> elements always mounted as sr-only, <canvas> mirrors via requestAnimationFrame + drawImage() — stream never lost across mode/minimize changes
  • Resize handles with stripe patterns (matching Zoom): bottom (horizontal stripes), right (vertical stripes), corner (diagonal lines SVG)
  • lockAspectRatio for speaker mode, per-mode min/max sizes
  • Document detection + scan button preserved on camera tile in gallery mode
  • Minimizable to small pill (top-right corner)
  • Deleted VideoGrid.tsx, removed react-resizable-panels package

Sandbox Viewport Fix

  • Injected minimal CSS: html,body{margin:0;padding:0;overflow:hidden;width:100%;height:100vh;max-height:100vh;}
  • Added "Content MUST fit in one screen" to Claude prompt sandbox rules
  • Reverted CSS design system injection (user feedback: made output look "AI-ish generic")
  • Reverted prompt changes that increased char limit and added design patterns

Build Status

  • npx tsc --noEmit — 0 errors (pending verification after merge)
  • npm run build — compiles successfully (pre-existing DB error on /parent SSR unrelated)

Notes for Next Session

  • Session 11 merge is complete — SSE streaming + sandbox optimization + Manim videos
  • SSE latency benefit: Speech arrives ~1s, avatar starts talking immediately
  • Sandbox token savings: 80-90% reduction (sandboxContent + sandboxAccent vs full HTML)
  • Manim videos: Claude can reuse by filename or generate new (30-120s generation time)
  • Push-to-talk: avatarFlush() sends accumulated text immediately on Space release
  • Design identity: Soft Lavender (#A78BFA) primary + Aqua (#67E8F9) accent on deep purple-black (#0C0A14)
  • Typography: Space Grotesk (display/headlines) + Geist (body). Use font-display class for headings.
  • Safari: -webkit-backdrop-filter added alongside backdrop-filter in key components
  • FloatingVideoOverlay uses react-rnd + canvas mirroring — videos never unmount
  • Pre-existing build error: /parent page fails during static generation (local DB "kimsanov" doesn't exist) — unrelated to our code