zebbra · griase94 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/.agent/knowledge/_shared/ai-knowledge-base-guide.md b/.agent/knowledge/_shared/ai-knowledge-base-guide.md
@@ -0,0 +1,137 @@
+# AI Knowledge Base Architecture Guide
+
+How to set up, maintain, and extend the neops AI knowledge base for effective AI-assisted development.
+
+## Architecture Overview
+
+The knowledge base follows Anthropic's context engineering principles: token efficiency, progressive disclosure, and graceful degradation. Every file must justify its existence by providing high-signal context that agents cannot infer from code alone.
+
+### File Structure (Per Repository)
+
+```
+AGENTS.md                          # Universal AI context (~100 lines, always loaded)
+CLAUDE.md                          # Claude Code pointer to AGENTS.md
+.cursor/rules/                     # Cursor-specific rules with glob matching
+  project-context.mdc              # Always-on project context
+  documentation-writing.mdc        # Triggered when editing docs/**
+.agent/knowledge/
+  _shared/                         # Shared across all neops repos (future: neops-ai-context submodule)
+    neops-ecosystem-overview.md    # Platform overview, components, data flow
+    component-architectures.md     # All component architectures in one file
+    documentation-playbook.md      # How to write docs for any neops component
+    documentation-personas.md      # Review persona definitions
+    cross-project-patterns.md      # Cross-repo conventions, testing patterns
+    ai-knowledge-base-guide.md    # This file
+  <project-specific files>         # Unique to each repo (audits, link tracking, etc.)
+```
+
+### Design Principles
+
+1. **Progressive disclosure**: AGENTS.md gives enough context for any agent to start working. Deeper knowledge is discovered on-demand when agents explore `.agent/knowledge/`.
+2. **Token efficiency**: ~570 lines across 6 shared files (down from 1400+ across 12+ duplicated files). No filler, no redundancy with linter/CI enforcement.
+3. **Single source of truth**: `_shared/` files are identical across all repos. Future plan: extract to `neops-ai-context` repo as git submodule at `.agent/shared/`.
+4. **Graceful degradation**: If an agent only reads AGENTS.md, it can still function. If `_shared/` isn't available, project-specific files and AGENTS.md provide sufficient context.
+5. **Agent-agnostic**: Works with Claude Code, Cursor, GitHub Copilot, Codex, Gemini CLI, and any tool that reads markdown files from the repo.
+
+### Root File Strategy
+
+| File | Purpose | Loaded by |
+|---|---|---|
+| `AGENTS.md` | Primary AI context (vendor-neutral, AGENTS.md open standard) | Cursor, Copilot, Codex, Gemini CLI, Claude Code |
+| `CLAUDE.md` | Pointer to AGENTS.md + Claude-specific notes | Claude Code |
+| `.cursor/rules/*.mdc` | Glob-matched rules (e.g., docs/** triggers writing conventions) | Cursor only |
+
+## Maintenance Guidelines
+
+### When to Update
+
+- **After implementing a new feature**: update component-architectures.md and AGENTS.md
+- **After writing documentation**: update project-specific audit files
+- **After discovering implementation gaps**: update ecosystem overview's status section
+- **After establishing new conventions**: update documentation-playbook.md or cross-project-patterns.md
+- **After a persona review round**: update documentation-personas.md if review process changed
+
+### How to Keep Shared Files in Sync
+
+Until the `neops-ai-context` repo exists, shared files must be manually kept identical:
+
+1. Edit the file in one repo
+2. Copy to the other two repos: `cp neops-workflow-engine/.agent/knowledge/_shared/file.md neops-worker-sdk-py/.agent/knowledge/_shared/file.md`
+3. Commit in all repos referencing the same change
+
+Future: `neops-ai-context` repo as submodule at `.agent/shared/` eliminates manual sync.
+
+### What Goes Where
+
+| Content Type | Location |
+|---|---|
+| Ecosystem-wide knowledge | `_shared/` |
+| Component architecture details | `_shared/component-architectures.md` |
+| Project-specific doc audit | Project root `.agent/knowledge/` |
+| Project-specific link tracking | Project root `.agent/knowledge/` |
+| Review findings | Project root `.agent/knowledge/` |
+
+## Bootstrapping a New Neops Component
+
+### Prompt Template
+
+Use this prompt with any AI coding agent to bootstrap knowledge files for a new neops component repository:
+
+---
+
+**Prompt for AI agents:**
+
+> You are setting up AI knowledge files for the `{REPO_NAME}` repository, a component of the neops network automation platform.
+>
+> **Step 1: Copy shared knowledge**
+> Copy all files from an existing neops repo's `.agent/knowledge/_shared/` directory to this repo's `.agent/knowledge/_shared/`. These files contain ecosystem-wide context that must be identical across all repos.
+>
+> **Step 2: Create AGENTS.md**
+> Create an `AGENTS.md` at the repo root following this structure (~100-120 lines):
+> - Overview (3-4 sentences about this specific component)
+> - Tech Stack (bullet list)
+> - Architecture (key concepts and data flow, brief)
+> - Development (build, test, lint commands — copy from README or Makefile)
+> - Project Structure (key directories with one-line descriptions)
+> - Conventions (coding style, naming patterns, import rules)
+> - Neops Ecosystem (brief context, ~20 lines, pointer to `.agent/knowledge/_shared/`)
+> - AI Agent Guidance (how agents should approach work in this repo)
+>
+> **Step 3: Create CLAUDE.md**
+> Create a `CLAUDE.md` with: "See AGENTS.md for full project context. For deeper knowledge, explore .agent/knowledge/."
+>
+> **Step 4: Create .cursor/rules/**
+> Create `.cursor/rules/project-context.mdc` (alwaysApply: true) with project-specific context.
+> If the repo has documentation, create `.cursor/rules/documentation-writing.mdc` (globs: docs/**) with writing conventions.
+>
+> **Step 5: Create project-specific knowledge**
+> Explore the codebase and create project-specific knowledge files in `.agent/knowledge/`:
+> - `{repo-name}-docs-audit.md` (if docs exist: structure, quality, gaps)
+> - `missing-external-links.md` (cross-project links that need resolution)
+>
+> **Step 6: Verify**
+> - Ensure `_shared/` files are byte-identical to other repos (use `diff` to confirm)
+> - Verify commands in AGENTS.md Development section actually run successfully
+> - Verify Project Structure section matches the filesystem (`ls -la`)
+> - Verify configuration table matches actual environment variables in source code
+> - Ensure .cursor/rules/ glob patterns match actual directory names (e.g., `docs/**` not `documentation/**`)
+> - Ensure .cursor/rules/ don't duplicate content already in AGENTS.md
+>
+> **Fallback for Step 1**: If no other neops repo is available locally, clone any neops component repo
+> and copy its `_shared/` directory. The canonical files are kept in sync across all repos.
+
+---
+
+### Checklist for New Component
+
+- [ ] `.agent/knowledge/_shared/` contains all 6 shared files (identical to other repos)
+- [ ] `AGENTS.md` exists at repo root with accurate project context
+- [ ] `CLAUDE.md` exists at repo root pointing to AGENTS.md
+- [ ] `.cursor/rules/project-context.mdc` exists with project-specific context
+- [ ] `.cursor/rules/documentation-writing.mdc` exists if repo has docs
+- [ ] Project-specific knowledge files created in `.agent/knowledge/`
+- [ ] `.gitignore` does NOT exclude `.cursor/` (rules should be committed)
+
+## Future: neops-ai-context Repository
+
+Planned: extract `_shared/` to a dedicated `neops-ai-context` git submodule at `.agent/shared/` in each repo, eliminating manual sync. The current structure is designed for trivial extraction.
diff --git a/.agent/knowledge/_shared/component-architectures.md b/.agent/knowledge/_shared/component-architectures.md
@@ -0,0 +1,89 @@
+# Neops Component Architectures
+
+Consolidated architecture reference for all neops platform components.
+
+## Workflow Engine (NestJS/TypeScript)
+
+### Workflow Definition (YAML)
+
+Top-level fields: `label`, `package`, `name`, `majorVersion`/`minorVersion`/`patchVersion`, `seedEntity` (device|interface|group|global), `description`, `parameterSchema`, `acquire[]`, `type: workflow`, `steps[]`.
+
+Step types: `functionBlock` (execute registered FB), `workflow` (inline nested), `workflowReference` (reference another definition — not yet implemented).
+
+Parameters use `{{ jmesPath }}` interpolation against the execution context. Conditions (`condition.jmes`) skip steps; assertions (`assert.jmes`) fail execution.
+
+### Blackboard & Job Lifecycle
+
+`PENDING` → `POLLED` (assigned to worker) → `PUSHED` (result received). Job types: `ACQUIRE`, `EXECUTE`, `ROLLBACK`.
+
+Worker API: `/workers/register` (POST), `/workers/:uuid/ping` (POST heartbeat), `/function-blocks/register` (POST), `/blackboard/job` (POST poll), `/blackboard/job/result` (POST push).
+
+Worker states: ONLINE → UNREACHABLE (2min no ping) → OFFLINE (6min) → deleted (24h). Stuck jobs (>12min POLLED) auto-failed.
+
+### Pure/Idempotent Semantics
+
+Engine tracks `isPureExecution` and `isIdempotentExecution` across steps. Failed workflow with only pure steps → `FAILED_SAFE`. Non-pure failure → `FAILED_UNSAFE`. Auto-retry for pure/idempotent is planned but not implemented; retry count is hardcoded to 3 in `job-executor.ts`.
+
+### Configuration
+
+Database: PostgreSQL (host port 5434 default, container 5432). CMS: GraphQL endpoint. Port: 3030. Schema: `GET /schema`. Swagger: `/api`. Health: `/health/`. Local dev: `docker-compose.yml` at repo root (engine + postgres + monitor app). Build override: `docker-compose.build.yml` for local source builds.
+
+## Worker SDK (Python 3.12+)
+
+### Function Block System
+
+```python
+class FunctionBlock(Generic[ParamsT, ResultDataT], ABC):
+    async def run(params, context) -> FunctionBlockResult[ResultDataT]
+    async def acquire(params) -> FunctionBlockAcquireResult
+    async def rollback(params, context, result_from_failed) -> FunctionBlockRollbackResult
+```
+
+Registration via `@register_function_block(Registration(name, package, version, run_on, fb_type, is_pure, is_idempotent))`. ParamsT: Pydantic model (`extra="ignore"`). ResultDataT: Pydantic model (`extra="forbid"`).
+
+### Worker Architecture
+
+Hybrid sync/async: main loop (async) handles heartbeat/polling/API; FBs execute sync in `ThreadPoolExecutor(max_workers=1)`. Sequential job processing. Blocking detector warns on sync calls in async loop.
+
+Config: `URL_BLACKBOARD`, `DIR_FUNCTION_BLOCKS`, `WORKER_NAME`. Entry point: `neops_worker`.
+
+### Connection System (Three Layers)
+
+1. **Capability interfaces**: abstract method contracts (e.g., `DeviceInfoCapability.get_version()`)
+2. **ConnectionProxy**: user-facing API, composes capabilities via inheritance, delegates to plugin at runtime
+3. **ConnectionPlugin**: platform/library implementations. Resolution: platform → connection_type → library → capabilities
+
+Base plugins: Scrapli, Netmiko, Napalm, NETCONF, RESTCONF, API. ProxyMeta metaclass generates fallback methods raising `NotImplementedForThisPlatform`.
+
+### Data & Context
+
+WorkflowContext holds entity state (devices, groups, interfaces). Change tracking via deep-copy snapshot at init; `compute_db_updates()` diffs current vs. snapshot to generate `EntityCreateDto`/`EntityPatchDto`/`EntityDeleteDto`.
+
+## CMS (Django/GraphQL)
+
+### Data Models
+
+- **Device**: hostname, ip, username, password (encrypted), platform (FK), groups (M2M), connection_state (NEW|UNREACHABLE|NOSSH|AUTHFAILURE|OK), facts/checks (JSON auto-aggregated), soft-deletable, lockable
+- **Interface**: name, ifindex, device (FK CASCADE), state (UP|DOWN|ADMIN_SHUTDOWN|ERROR_DISABLED), neighbor (self one-to-one), facts/checks
+- **DeviceGroup**: name (unique), title, devices (M2M), facts/checks
+- **Facts/Checks**: versioned records (key, value JSON, valid_till, purge_at), auto-aggregated into parent entity
+
+### Integration Pattern (Acquire → Execute → Unlock)
+
+1. Engine calls `getAndLockResources` GraphQL mutation → CMS locks entities, resolves Elasticsearch queries
+2. Locked entities serialized as DTOs → passed to workers as job context
+3. Workers modify entities in memory → compute diff
+4. Diff sent as `dbUpdates` in job result → engine aggregates
+5. Engine calls `unlockResources` with aggregated updates → CMS applies atomically
+
+Authentication: JWT with RS256, JWKS at `/.well-known/jwks.json`, role-based permissions (BitField).
+
+## Remote Lab (FastAPI)
+
+Session-based lab allocation: `POST /session` → wait in FIFO queue → `ACTIVE` → use lab → `DELETE /session`. Heartbeat required (300s timeout). One lab at a time per session.
+
+Lab lifecycle: upload topology (`POST /lab`), topology hash comparison for reuse, reference counting for shared labs, release/destroy.
+
+Worker SDK integration via pytest fixtures: `remote_lab_fixture("tests/topologies/simple_iol.yml")`. Available topologies: `simple_iol` (2 Cisco IOL), `simple_frr` (2 FRRouting). `RemoteLabDevice.to_neops_device()` converts lab devices to `DeviceTypeDto`.
+
+Config: `REMOTE_LAB_URL` (unset = local mode), `REMOTE_LAB_REQUEST_TIMEOUT` (30s), `REMOTE_LAB_SESSION_TIMEOUT` (300s).
diff --git a/.agent/knowledge/_shared/cross-project-patterns.md b/.agent/knowledge/_shared/cross-project-patterns.md
@@ -0,0 +1,70 @@
+# Cross-Project Patterns
+
+Conventions for documentation, examples, and testing that span multiple neops repositories.
+
+## Cross-Project URL Convention
+
+Links between project docs use absolute paths rooted at the project directory:
+- Engine → SDK: `/neops-worker-sdk-py/docs/...`
+- SDK → Engine: `/neops-workflow-engine/docs/...`
+
+These resolve in the unified MkDocs multi-repo build (the `neops/` mono-repo includes all component repos as submodules under `docs/`).
+
+## Terminology Alignment
+
+Canonical definitions for shared terms live in `neops-ecosystem-overview.md` (Key Concepts table). When adding a new term to one project's glossary, add it to all relevant projects. Terms MUST match across all glossaries.
+
+## Example Alignment
+
+- All getting-started examples use **`fb.examples.neops.io`** package
+- Workflow YAML `name`, `package`, and `version` fields must match between engine and SDK examples
+- Engine CI: `make validate-examples` validates YAML against JSON schema
+- SDK CI: pytest validates FB signatures and test cases
+- When updating an example in one repo, check the other repo and update accordingly
+
+### Runnable vs. Illustrative Examples
+
+- **Runnable**: `echo`, `show_version`, `ping`, `configBackup` have real SDK implementations
+- **Illustrative**: intermediate/advanced workflow examples use hypothetical FB names to demonstrate patterns
+- Always label clearly which examples are runnable and which are illustrative
+
+## Implementation Status Sync
+
+Both projects must agree on implementation status for shared features:
+- If the engine marks a feature as unimplemented, SDK docs must not describe it as available
+- Use identical admonition style: `!!! warning "Implementation Status"`
+- Periodically audit cross-project status to catch drift
+
+## Cross-Project Onboarding
+
+Each project's Getting Started links to the other:
+- Engine "Your First Workflow" → SDK "Write Your First Function Block"
+- SDK "Your First Function Block" → Engine "Run Your First Workflow"
+
+This creates a complete onboarding loop regardless of entry point.
+
+## Testing Patterns
+
+### Worker SDK Testing
+
+Two test decorators:
+- `@fb_test_case(description, params, context, succeeds, assertions)` — local tests with mocked context
+- `@fb_test_case_with_lab(description, params, remote_lab_fixture, assertions)` — remote lab with real devices
+
+Context factory: `create_workflow_context(run_on, entity_id, devices, device_groups, interfaces)`.
+
+Available lab topologies: `simple_iol` (2 Cisco IOL), `simple_frr` (2 FRRouting).
+
+### Workflow Engine Testing
+
+- Unit tests: Jest (`npm run test`)
+- E2E tests: Supertest + PostgreSQL (`npm run test:e2e`)
+- Example validation: `make validate-examples` (JSON schema validation of all YAML examples)
+- CI in Docker: multi-stage Dockerfile with `--target run-ci`
+
+### Common Pitfalls
+
+- Hyphenated directories (`examples/getting-started/`) can't be imported as Python packages — use `sys.path` manipulation
+- `DeviceTypeDto.platform` must be a `PlatformTypeDto`, not a string
+- Plugin imports register via decorators at import time — order matters
+- Default pytest config excludes `function_block` marker; use `-m "function_block"` explicitly
diff --git a/.agent/knowledge/_shared/documentation-personas.md b/.agent/knowledge/_shared/documentation-personas.md
@@ -0,0 +1,37 @@
+# Documentation Personas
+
+Reusable persona definitions for reviewing and writing neops documentation.
+
+## Sam — Junior Network Engineer
+
+- **Experience**: 2 years in network operations
+- **Skills**: Python basics, YAML from Ansible playbooks, CLI comfort. No typed Python (Pydantic, dataclasses) or TypeScript.
+- **Mindset**: Eager to learn, needs hand-holding, appreciates fun and approachable tools
+- **Reading pattern**: Getting Started → Concepts → Workflows. Skips architecture docs initially.
+- **Success criteria**: Can run a hello-world workflow end-to-end within 30 minutes following only the docs
+
+## Priya — Senior Network Engineer
+
+- **Experience**: 15+ years across multi-vendor environments
+- **Skills**: Expert in Ansible, Nornir, custom Python tooling, NETCONF/YANG. Familiar with CI/CD.
+- **Mindset**: Critical, pragmatic, demands clear ROI before adopting a new tool
+- **Reading pattern**: Architecture and Concepts first, then advanced features (acquire, retry, rollback). Compares with existing tools.
+- **Success criteria**: Understands why neops is better than Ansible/Nornir for transaction-safe multi-device automation
+
+## Marcus — Implementation Wizard
+
+- **Experience**: Staff-level engineer, modern Python and TypeScript fluency
+- **Skills**: Pydantic, NestJS, gRPC, MikroORM. Reads source when docs fall short.
+- **Mindset**: Demands precision, completeness, and internal consistency. Notices mismatched types, missing edge cases.
+- **Reading pattern**: Source-level docs, extension points, schema references first. Tutorials only if they show non-obvious patterns.
+- **Success criteria**: Can extend neops with a custom handler, gateway, or FB type without asking questions
+
+## Diana — Technical Writer (Meta-Reviewer)
+
+- **Experience**: Multi-product technical documentation across developer platforms
+- **Skills**: Evaluates structure, maintainability, audience awareness, cross-project consistency
+- **Mindset**: Documentation as product. Cares about navigation, progressive disclosure, long-term maintainability.
+- **Reading pattern**: Full nav structure review, then spot-checks for consistency, voice, completeness.
+- **Success criteria**: Docs are navigable, internally consistent, each page serves a clear audience
+
+For the persona review process, see `documentation-playbook.md` (QA phase).