A knowledge sidecar for Claude Code.
Prefer just <recipe> over raw commands. See just --list or the justfile for all recipes.
hivedev vs hive: During development, always use hivedev (or uv run python -m keephive) to run the local dev build. The global hive/keephive commands point to the last uv tool install and may lag behind. The justfile defines hivedev as a variable for this reason. Terminal E2E tests use python -m keephive because they run inside an activated venv.
| Command | What |
|---|---|
just test |
Run all unit/integration tests (<35s) |
just test-e2e |
Terminal E2E tests (real tmux, ~65s) |
just test-golden |
Regenerate golden file baselines |
just test-llm |
LLM E2E tests (real claude -p, slow) |
just test-one tests/test_X.py |
Single file, stop on first failure, verbose |
just test-one "-k test_verify" |
Run by name pattern |
just test-integration |
Multi-step state machine tests |
just lint |
ruff check + format check |
just fmt |
ruff format in place |
just serve |
Live dashboard with hot reload |
just test-save |
Run tests, save output to .test-results.txt |
just test-save-one tests/test_X.py |
Single file with output saved to disk |
just check |
All checks (test + lint + secrets) |
hivedev s |
Status against real data (local dev build) |
hivedev v |
Verify stale facts (needs claude -p, 10-20s) |
cli.py: Dispatch table. Maps command names to (module, function) tuples._CANONICALmaps aliases to canonical names (used for stats tracking and help).claude.py: ALL Anthropic API interaction. One function, Pydantic validation. THE critical module. Models: haiku 4.5, sonnet 4.6. Contains the privacy gate:run_claude_pipe()raisesLLMPausedErrorimmediately when.llm-pausedflag exists — before any subprocess or network call.llm/exceptions.py: Exception hierarchy.ClaudePipeErroris the base (all hookexceptblocks catch this silently).LLMPausedError(ClaudePipeError)signals a blocked call due tohive privacy on.clock.py: Centralized time functions.get_today()/get_now()respectHIVE_DATEenv var for time-travel testing.models.py: Pydantic models for every structured response (VerifyResponse, PreCompactResponse, ReflectAnalyzeResponse, VaultPerspective, CleanerPerspective, StrategistPerspective, AuditSynthesis, SubagentExtractionResponse).output.py: Console markup,prompt_yn(),prompt_choice(). Shared output helpers.nudge.py: Shared nudge infrastructure. Priority-based lifecycle state machine (TODOs > stale facts > pending facts > unreflected logs > context-specific). Intervals: prompt/tool every 5, stop every 8. Recency gate:_RECENCY_THRESHOLD = 15suppresses priorities 1-4 if the same category fired within 15 calls this session. Priority 5 (generic fallback) is never gated. State stored aslast_surfaceddict inside counter files ({count, session_id, last_surfaced: {category: count}}).read_recency(name, session_id)andrecord_surfaced(name, category, count, session_id)manage the state. All three getter functions (get_prompt_nudge,get_tool_nudge,get_stop_nudge) acceptsession_idand pass it through to_lifecycle_nudge. Priority 1.5: KB queue (recency gate 3, fires when kb_queue_depth > 0). Priority 4.5: review lag (recency gate 8, fires when last_cmd_date for v/a/mem > 1 day, only when .stats.json exists).storage.py: All file I/O for the hive directory. Includes stats tracking, profiles, session metrics.read_cc_sessions()reads Claude Code session-meta as authoritative session data (user messages, tools, duration, code impact).session_metrics()prefers CC data, falls back to keephive data. Privacy helpers:llm_paused_file()→.llm-paused,is_llm_paused()→ presence check,set_llm_paused(bool)→ touch/unlink.force_cli_file()→.force-cli,is_force_cli()→ presence check,set_force_cli(bool)→ touch/unlink.kb_queue_file()→.kb-queue.md.append_kb_message(),read_kb_queue(),kb_queue_depth(),clear_kb_queue().last_cmd_date(cmd)scans.stats.jsondaily keys backward.pending_rules_file()→.pending-rules.md. Auto-trust flag:auto_improve_trusted_file()→.auto-improve-trusted,is_auto_improve_trusted()→ presence check,set_auto_improve_trusted(bool)→ touch/unlink. Closed-loop:daemon_hints_file()→.daemon-hints.json,read/write_daemon_hints().improvement_history_file()→.improvement-history.json,read_improvement_history(),append_improvement_history()(rolling 200-item cap).experiment_baselines_file()→.experiment-baselines.json,read/write_experiment_baselines().insights.py: Pure deterministic aggregation of Claude Code/insightssession data. Reads~/.claude/usage-data/facets/+session-meta/, joins on session_id, produces outcome/type/satisfaction/goal/friction distributions and cross-field pattern detection. No LLM.commands/audit.py: Three-perspective LLM audit (parallel) + Cook synthesis. Usesrun_claude_pipe()for all 4 calls.commands/memory.py:hive memandhive rule(add/remove/learn/review/try).rule learnreads/insightsfriction data from~/.claude/usage-data/facets/, maps to behavioral rules, deduplicates via trigram overlap, queues in.pending-rules.md.rule try "text" [--days N]adds experimental rules with[experiment:Nd:expiry_date]tag + friction baseline snapshot in.experiment-baselines.json.expire_experimental_rules()removes expired rules at session start, logsEXPIRED-RULE:to daily log.commands/edit.py:hive etargets (memory, rules, claude, today, todo, etc.). Opens$EDITOR.commands/knowledge.py: List, view, create/edit knowledge guides and prompt templates.commands/note.py: Multi-slot scratchpad. Open, copy, clear, list, restore, template start.hive n todoextracts TODOs via edit-buffer (full note, candidates pre-marked-, mtime cancel detection).hive 4 "text"quick-appends without editor.commands/profile.py: Profile CRUD (create/list/use/delete). Sibling directories:~/.keephive/hive-{name}/.commands/seed.py: Demo data seeder. Deterministic RNG (Random(42)), loads fromdata/demo/entries.json.commands/session.py: Interactive session launcher. Reusesbuild_context()from sessionstart, replaces process withclaudeviaos.execvpe. kb mode: dynamically builds prompt fromread_kb_queue()pending messages +read_soul()summary. Queue not cleared here — soul_update does that.commands/skill.py: Plugin/skill system for extensible commands.commands/stats.py: Usage statistics with per-project breakdown, streaks, and activity sparklines._display_full()renders: Pipeline health, Capture mix, Sessions, Session Quality, Trends, Activity chart, Command Activity (new), Most Recalled, Projects, Pipeline Actions, KingBee status._display_command_activity(data, days=7)aggregatescommands.*,daemon_tasks.*, andloops.*from.stats.jsonusing_sum_counters()and renders a two-column grid (top 8 commands | all 6 daemon task run counts) plus a loops summary line. Empty-state: shows "none yet — enable with hive daemon" whendaemon_taskshas no entries..stats.jsonnow tracks three activity categories:commands(per-command invocations via cli.py),daemon_tasks(per-task-name, only on True return),loops(started + iteration counts). Token line appends blended cost estimate (60% haiku / 40% sonnet rate).commands/todo.py: TODO lifecycle: list, add, done, edit, recurring. Fuzzy dedup at 0.8 similarity threshold.commands/transfer.py: Export/import hive data as tar.gz with manifest.json.commands/verify.py: LLM-powered fact verification. Checks facts against codebase, auto-corrects when deterministic.commands/reflect.py: Five-stage flow: scan (deterministic) → analyze (LLM) → apply (interactive review) → draft (generate guide) → insights (session quality patterns from /insights facets, no LLM).commands/standup.py: Standup generation from daily logs + GitHub PR data. Weekend-aware cutoff, clipboard copy.commands/doctor.py: Health check (hooks, MCP, deps, data). Uses LLM for semantic TODO deduplication.commands/growth.py:hive growth [--json](alias:gr). 30-day visible compounding metrics: knowledge state (facts, freshness, guides, recall), sparkline trends, week-over-week deltas, deterministic growth narrative. Usestrend_metrics()andgrowth_snapshot()from storage.py. No LLM.commands/setup.py: Registers MCP server in ~/.claude.json and hooks in ~/.claude/settings.json. Auto-syncs global install.hooks/sessionstart.py: Injects context at session start (memory, rules, TODOs, matched guides, cross-project hints). No LLM call.extract_style_hint(days=7)is fully deterministic: reads 7 days of daily logs, computes average entry length + dominant category (FACT/DECISION/etc.) + recurring themes viaSTOPWORDSfromwander.py. Returns[style: ~N chars/entry; dominant: CAT/CAT; active themes: x, y]or""if fewer than 5 entries exist. Injected intobuild_context()after the rules block.hooks/precompact.py: Layer 1 extraction (deterministic) + Layer 2 auto-write to daily log with project attribution (claude -p). TODO discipline: max 2 user-requested TODOs per compaction, speculative TODOs demoted to FACT. Auto-close: passes open TODOs to LLM, writes DONE for resolved items. Dedup threshold 0.8.hooks/posttooluse.py: Counter-based periodic nudge after Edit/Write tool use.hooks/userpromptsubmit.py: Counter-based periodic nudge. Injects.ui-queuecontent before nudge when present. KB detection: regex matches 'KingBee'/'King B'/'@KB' → appends to.kb-queue.md(deduped) + injects [DIRECT MESSAGE TO KINGBEE] context block (replaces nudge that turn).hooks/stop.py: Stop hook. Increments turn counter per session, periodic micro-nudge (interval 8) to capture decisions or mark TODOs done. Iteration continuation: framed progress banner with PROGRESS CHECK line. Loop completion: ASCII box closing ceremony. Passes req['task'] to loop-extract subprocess for auto-TODO close.hooks/sessionend.py: SessionEnd hook. Finalizes session stats with accurate end timestamp. No stdout.hooks/taskcompleted.py: TaskCompleted hook. Auto-logs DONE entry to daily log when a task is marked complete.commands/serve.py: Live web dashboard (HTTP server, 7 views, markdown rendering, SSE real-time updates,/ui-feedbackPOST endpoint). Zero external deps. Uses_ThreadedHTTPServer(ThreadingMixIn + HTTPServer) so each SSE connection gets its own thread._file_watcher_threadpolls HIVE_HOME mtimes every 0.5s;_broadcast_panel_updatesre-renders only changed panels (md5 hash compare) and pushespanel-updateSSE events to all connected tabs. Client swaps individual[data-panel-id]panels without full-view refresh.visibilitychangelistener reconnects SSE on tab focus (guarantees fresh time-sensitive panels).stats-tokenspanel (_get_stats_tokens_data/_render_stats_tokens_panel) shows 30-day token totals, 7-day bar chart, and blended cost estimate; appears in the/statsview.commands/ui.py: UI feedback queue CLI (hive ui/ui-install/ui-clear/ui log) + bookmarklet source asjavascript:URL.hive ui logscans daily logs newest-first for persisted[UI Feedback]entries (last 30 days).commands/daemon.py: KingBee background daemon.hive daemon [start|stop|status|run|edit|log|enable|disable]. Manages soul-update, self-improve, morning-briefing, stale-check, standup-draft, and wander tasks. Enable/disable toggles per-task via daemon.json.hive daemon logtails daemon.log. Wander task usesWebSearch(built-in, not MCP): passtools=["WebSearch"],restrict_mcp=True,max_turns=3.restrict_mcp=Trueprevents 9 MCP servers from loading before inference. All other tasks pass no tools (existing behavior unchanged)._VOICE_DISCIPLINEconstant is injected into all four daemon task prompts (morning_briefing, stale_check, soul_update, wander): blocks "Here is"/"I will" openers, hedging ("might"/"could"), empty affirmations ("Great!"), and markdown headers. Silence is explicitly valid. soul_update reads KB queue (.kb-queue.mdpending messages) + pending-rules count; clears kb-queue after successful run. Wander task: when actionable hypothesis, appends to.pending-improvements.json(type: run|edit|rule) — shares queue with self_improve._execute_task()callstrack_event("daemon_tasks", task_name)afterfn()returns True — throttled/skipped runs (False return) are not counted._task_self_improve()marks skill and rule proposals withtrusted: True(tasks/edits remain untrusted);_execute_task()calls_run_auto_apply()after self-improve succeeds and logs[AUTO-APPLIED ...]to daemon.log. Closed-loop:_has_priority_boost()reads.daemon-hints.jsonand can override day-of-week scheduling constraints when a task has boost > 1.0._write_reflect_hints()writes hints after reflect-draft finds an uncovered theme (boosts stale-check + soul-update 1.5x for 7 days).commands/loop.py: Autonomous iteration loop (hive run)._extract_soul_wisdom()extracts "What I've Learned" bullets from SOUL.md._build_first_iter_output()shows ASCII box banner with SOUL wisdom._do_loop_extract()accepts task arg for auto-TODO close (0.55 threshold). Loop tracking:_cmd_run_task()and_launch_background()both calltrack_event("loops", "started")after writing the loop state file.cmd_loop_extract()callstrack_event("loops", "iteration")at entry — this subprocess runs once per completed iteration.commands/wander.py: Wander CLI + seed selection (select_wander_seed()). Seed priority: user-queued > cross-pollination > recurring-topic > stale-todo. STOPWORDS constant filters common words from recurring-topic detection.random.Random(today.isoformat())ensures intra-day determinism without mocking.commands/checkup.py:hive checkup/hive ck. 7-stage read-only health monitor: Stage 0 privacy gate, hook pipeline, daemon task freshness, queue depths, SOUL.md age, JSON integrity, magic number audit. Stage 3 also checks KB queue depth + wander activity (days since last wander doc).--snapshot/--diffuse git-in-hive-dir.--jsonincludesprivacy_pausedandforce_cliflags. No LLM calls.commands/inbox.py:hive inbox [--days N](alias:ib). Surfaces recent KingBee daemon output (wander docs, standup drafts, stale-check findings) and pending review queue depths (facts, rules, improvements, TODOs). KingBee entries parsed via regex (_KINGBEE_RE) from daily logs.--daysclamped to [1, 30], default 2. Navigation hints are type-specific (_TYPE_HINTS): wander →hive wander show, stale-check →hive verify. Long entries with no type-specific hint fall back tohive log.commands/improve.py:hive improve [list|review|clear-stale|trust on|trust off](alias:im). Review and apply KingBee self-improvement proposals from.pending-improvements.json. Four types: skill (add guide), task (add daemon task), rule (queue to.pending-rules.md), edit (LLM merge into existing guide via haiku, falls back to$EDITORonClaudePipeError). Dismissed items stored in.dismissed-improvements.jsonwith rolling cap of 100._skill_sync()auto-fires after skill/task installs. Auto-trust pipeline:trust on/offtoggles.auto-improve-trustedflag;_auto_apply_trusted()filters items withtrusted=Trueandtype in (skill, rule), applies them, and returns the remainder;_run_auto_apply()reads queue → applies → writes back;_improve_review()runs auto-apply before the interactive loop;hive improve listshows[auto]tag on trusted items. Effectiveness tracking:_record_applied()writes every accepted improvement to.improvement-history.json(rolling 200-item cap) viaappend_improvement_history().commands/privacy.py:hive privacy [on|off|cli|status](alias:pv). Toggles.llm-pausedflag (kill switch) or.force-cliflag (CLI-only routing) viaset_llm_paused()/set_force_cli().offclears both. Shows state with flag file paths. No subcommand = status display.
keephive and Claude Code track sessions independently. They serve different purposes and must not be confused.
Claude Code (source of truth for session analytics):
- Location:
~/.claude/usage-data/session-meta/{session_id}.json - Written at session end. Contains:
user_message_count,tool_counts(all tools),duration_minutes,lines_added/removed,input/output_tokens,git_commits - Also:
facets/{session_id}.jsonhas outcome, satisfaction, friction. Read byinsights.py. - Read by:
storage.read_cc_sessions(), called fromsession_metrics(),_session_productivity(), and serve.py panels.
keephive (source of truth for workflow analytics):
- Location:
~/.keephive/hive/.stats.json - Written by hooks during session. Contains: command usage, hourly patterns, project breakdown, daily aggregates, compacted flag
- Also:
.prompt-counter,.tool-counter,.stop-counterfor nudge cadence (tuned for hook invocation frequency, not user counts)
Critical rules:
- Session IDs from hooks and session-meta are different ID spaces. Zero overlap. Never join on session_id across these systems.
- keephive's
session["prompts"]is hook invocation count (~71x real user messages). Never display as "user messages." - keephive's
session["tools"]only has Edit/Write (2 of ~15). For full tool data, use session-metatool_counts. - For session display in
hive statsandhive serve: useread_cc_sessions()which reads session-meta. - Test isolation:
HIVE_CC_META_DIRenv var overrides session-meta path. Set to empty temp dir inhive_envfixture.
Every claude -p callsite uses run_claude_pipe() with a Pydantic response model. No raw JSON parsing anywhere else. If you add a new claude -p call, it goes through claude.py.
Tests must catch real bugs. test_claude_pipe.py uses the ACTUAL response format from production (including system init messages in the array). If a test passes but production fails, the test is wrong, not the code.
Every test must answer: "What bug would this catch?"
Before committing any test, it must pass all three:
- Can this test fail? If the SUT has a bug, will this test actually catch it? If the test passes regardless of SUT correctness (e.g., asserting a mock returns what you told it), delete it.
- Is this test unique? Does it exercise a different code path than existing tests? If two tests differ only in input values but hit the same branch, keep only the boundary case.
- Does this test assert correctness?
assert resultis not a test.assert result == expected_valueis.assert "keyword" in outputis acceptable only when exact output is non-deterministic.
- Testing that a mock returns what you configured it to return
- 3+ tests for the same function with trivially different inputs (keep 1 representative + 1 boundary)
- Tests that only assert "no crash" without checking the actual result
- Tests where the assertion is weaker than the function's contract (e.g.,
assert len(result) > 0when you know the exact expected length)
- Every test class needs at least one negative test (error input, missing data, corrupt state)
- State-changing functions need a "roundtrip" test (write -> read back -> verify)
- Functions with thresholds/boundaries need tests AT the boundary, not just well within it
LLM-dependent tests use llm_hive_env fixture + @pytest.mark.llm.
Run: just test-llm
HIVE_SKIP_LLM=1 is ONLY for fast-path fixtures. NEVER use it to "test" an LLM feature — that skips the feature entirely and proves nothing.
Functions that open $EDITOR via subprocess.run([editor, path]) use mtime to detect cancel (no write = mtime unchanged). Test mocks must account for this:
# WRONG — no-op mock looks like cancel, 0 TODOs added
monkeypatch.setattr("subprocess.run", lambda *a, **kw: None)
# RIGHT — touch updates mtime, content (already written) is read back
def accept_all(*args, **kwargs):
Path(args[0][1]).touch()
monkeypatch.setattr("subprocess.run", accept_all)
# RIGHT — test cancel explicitly with no-op
monkeypatch.setattr("subprocess.run", lambda *a, **kw: None)
# assert nothing was added
# RIGHT — delete a specific line
def delete_first_todo(*args, **kwargs):
path = Path(args[0][1])
lines = [ln for ln in path.read_text().splitlines() if not ln.startswith("- ")][:1_000]
path.write_text("\n".join(lines))args[0] is the command list [editor, str(path)], so args[0][1] is the file path.
keephive has three test tiers. Each answers different questions.
just test # all tests
just test-one tests/test_X.py # single fileFast, isolated, mocked. Uses hive_env fixture (temp dir + HIVE_HOME). Tests individual functions, data transformations, file I/O. No real terminal, no real LLM.
Use for: Pure logic, parsing, storage operations, model validation, error paths, edge cases where you control all inputs.
just test-e2e # run all
just test-one "-m terminal -k test_single_fact -v -o addopts=" # one test
just test-golden # regen baselinesReal terminal sessions via tmux. Types actual commands, reads actual screen output. HIVE_DATE env var enables time-travel without mocking. Rich renders real ANSI to a real TTY.
Use for: Multi-command workflows, output format validation, time-travel scenarios (staleness, lifecycle), CLI argument handling, profile isolation, anything where the user experience matters.
Fixtures: term (empty hive), term_seeded (45 days of demo data), save_terminal_output (JSON artifact), update_golden (baseline flag).
Golden files: tests/e2e_outputs/golden/*.txt stores baseline output. Tests compare against baselines and fail with unified diff on mismatch. --update-golden regenerates them.
just test-llmReal LLM calls. Tests the full pipeline: prompt -> claude -p -> Pydantic validation -> CLI output. Expensive and slow. Uses llm_hive_env fixture.
Use for: Verifying LLM prompt quality, response parsing, model behavior changes. Run before releases or after changing prompts/models.
| Scenario | Tier |
|---|---|
| Testing a pure function | Unit (Tier 1) |
| Testing file read/write logic | Unit (Tier 1) |
| Testing Pydantic model validation | Unit (Tier 1) |
| Testing CLI output format | Terminal (Tier 2) |
| Testing multi-day workflow | Terminal (Tier 2) |
| Testing Rich rendering/colors | Terminal (Tier 2) |
| Testing time-sensitive behavior (staleness) | Terminal (Tier 2) |
| Testing command interaction sequences | Terminal (Tier 2) |
| Verifying LLM prompt produces valid output | LLM (Tier 3) |
| Testing after changing a prompt template | LLM (Tier 3) |
| Regression testing after model upgrade | LLM (Tier 3) |
Use the terminal emulator (Tier 2) when testing things the user sees and interacts with. The emulator gives you a real shell with persistent env vars, real Rich/ANSI rendering, real command sequencing. Use it for:
- Verifying CLI output text and formatting
- Multi-command workflows (remember -> recall -> verify staleness)
- Time-travel with
HIVE_DATEacross multiple commands - Testing that commands create the right files
Use direct commands (hive a, hive v, hive stats) when you need to verify something works against real user data, not test data. Direct commands hit your actual ~/.keephive/hive/ directory. Use them for:
- Smoke-testing a fix against real accumulated data
- Verifying LLM-dependent features (audit, verify) actually call the model
- Checking serve dashboard renders with real content
- Quick validation before committing
Use unit tests (Tier 1) for everything that doesn't need a terminal or real LLM. Logic, parsing, storage, validation. These run in <35s and catch 80% of bugs.
Write the terminal test first, then make it pass. This is the standard approach for any feature that affects CLI behavior.
@pytest.mark.terminal
class TestNewFeature:
def test_basic_workflow(self, term, save_terminal_output):
"""New feature does X when user does Y."""
term.type("python -m keephive new-command arg").has("expected output")
save_terminal_output("new_feature/basic", term)just test-one "-m terminal -k test_basic_workflow -v -o addopts="Cover error paths, validation, boundary conditions in Tier 1 tests where mocking is faster.
just test-goldenjust test && just test-e2eThe tmux driver lives at tests/terminal.py. Key patterns:
# Basic: type command, assert output
term.type("python -m keephive s").has("keephive")
# Chain assertions
term.type("python -m keephive todo").has("Task A").lacks("completed")
# Time travel
term.set_date("2026-01-01")
term.type("python -m keephive r 'FACT: past event'")
term.set_date("2026-02-01")
term.type("python -m keephive s").has("stale")
# Read files created by commands
content = term.read_file("daily/2026-01-01.md")
assert "past event" in content
# Check ANSI rendering
term.type("python -m keephive s").has_ansi()
# Regex match
term.type("python -m keephive --version").matches(r"keephive v\d+\.\d+")
# Line count range
term.type("seq 1 50").line_count_between(49, 51)
# Save history artifact
save_terminal_output("scenario_name", term)Gotchas:
- TODO text must be distinct enough to survive fuzzy dedup (0.8 SequenceMatcher threshold). "Task A"/"Task B" will dedup. Use descriptive names.
- Single quotes in
send-keysargs need care. Prefer double quotes for fact text. HIVE_HOMEisolation means commands never touch real~/.keephive/hive/.- Each
termfixture creates a fresh tmux session with unique name. Cleanup is automatic.