Build, evaluate and understand grounded AI systems — from RAG pipelines to multi-agent orchestration.
A practical learning laboratory for AI Engineering: RAG, agents, evals, observability and safety — built incrementally from a working local foundation.
⚠️ This project is a learning and experimentation platform for modern AI systems. It intentionally exposes internal mechanics such as RAG pipelines, agent orchestration, evaluation, observability and safety layers.
New here? Pick the path that fits your goal:
| I want to… | Go to |
|---|---|
| Try the working RAG pipeline right now | ▶ What works today |
| Run it in 5 minutes | ⚙ Quick start |
| Understand the code end to end | 📄 Phase 1 RAG internals guide |
| See what changed in Conceitos Lab UX | 🧭 Phase 7 summary |
| Understand what we're building and why | 🎯 Objectives |
| See the architecture | 🏗 Architecture |
| Learn AI concepts behind the code | 📖 Concepts |
| Follow a guided learning path | 📘 Study Tracks |
| Understand key design decisions | 📐 ADRs |
| Operate auth/jobs locally | 🛠 Operational Runbook |
| Contribute | 🤝 Contributing |
This section tracks what is actually runnable, not what is planned. Planned work lives in the Roadmap.
- Uniform Document Schema (
SourceDocument,NormalizedDocument) inpackages/core - ETL pipeline ingests text and PDF files →
NormalizedDocumentwith sections and lineage - Image and audio extractors exist as registered stubs (return
NOT_IMPLEMENTED) - Sample dataset registered in
datasets/registry.jsonwith checksum and metadata npm run ingest:smokeruns the full ETL pipeline locally
- Upload a document (inline text or text/PDF file) via API or CLI
- Index it: chunk → embed → store in local in-memory or persisted index (
.groundedos/indexes/) - Ask a grounded question: process query → retrieve top-K chunks → rerank → extractive answer → Dev Mode output
- Dev Mode output per request: chunk IDs, relevance scores, source metadata, offsets, embedding provider, cache, cost, workflow steps and retrieval spans
- Embedding providers:
api-lexical(default, no server required),local-hash(deterministic),ollama(opt-in, requires Ollama) - Index management API: list, delete persisted indexes
- Phase 1 is complete. Baseline metrics recorded in
datasets/golden/baselines/phase-1-baseline.json.
- Query understanding runs before retrieval and is visible in Dev Mode
- Hybrid retrieval and reranking are implemented in the API path
- Semantic cache, cost tracking and rolling trade-off metrics are exposed through API and web
- Session-scoped memory persists locally under
.groundedos/memory/sessions/whensessionIdis supplied - Hybrid-vs-dense benchmark artifact is recorded in
datasets/golden/baselines/phase-2-hybrid-benchmark.json
@groundedos/agentsexposes aDocumentQAAgent, reasoning loop and tool registry with timeout handling- API exposes
POST /agents/executefordocument-qa @groundedos/safetyincludes prompt-injection, PII, jailbreak, hallucination, prompt-leakage and indirect-injection guardrails@groundedos/evalsincludes faithfulness, relevance and recall evaluators@groundedos/test-harnessprovides first-slice reusable helpers for provider, RAG, eval, agent, jobs, replay and experiment harness flows
- Prompt A/B testing is executable with
npm run experiment:prompts - Model/provider benchmarking is executable with
npm run benchmark:models - Persisted-index embedding visualization is available in the web app with section cluster labels
- Completed local-vs-cloud benchmark artifact: Ollama (
qwen2.5:0.5b) + Groq (llama-3.1-8b-instant, free-tier cloud) both completed in the same run —datasets/golden/baselines/phase-4-model-benchmark.jsonrecordsphase4ModelBenchmarkPassed: true
- Fine-tuning, LoRA and distillation each have real measured artifacts under
datasets/experiments/phase-5/ - Quantization has a runnable measured lexical-vector benchmark that preserves retrieval quality while reducing memory; it is the current Phase 5 quantization slice, not a full model-weight quantization pipeline
GET /lab/experimentsexposes Phase 5 experiment summaries through the API and the web lab surface
Authentication & Authorization:
- JWT login/refresh/logout, API keys, admin-gated routes, audit logging and per-user rate limiting are implemented in API.
- Auth enforcement is opt-in in local dev and defaults to enabled in
non-dev/non-test environments when
AUTH_ENFORCEMENTis unset. - Optional PostgreSQL-backed auth users/sessions are available with memory
fallback (
AUTH_USER_BACKEND=postgres,AUTH_SESSION_BACKEND=postgres).
Queue Hardening & Observability:
- ✅ Async jobs via BullMQ (
/jobs/*) with centralized retry policies per job type - ✅ Queue hardening: fixed/exponential backoff, DLQ envelope with full metadata, structured lifecycle logging
- ✅ Prometheus-compatible metrics export: success rate, failure rate, retry rate, DLQ depth, duration percentiles
- ✅ DLQ operations: list, inspect, re-drive with audit trail and dry-run validation
- ✅ Grafana dashboard auto-imported with queue hardening visualizations
- ✅ Alert rules for DLQ accumulation, low success rate, high latency, high retry rate
Operational Guides:
Quick Start:
# 1) Start API and worker in separate terminals
npm run api:dev
npm run api:jobs:worker
# 2) Enqueue async jobs
JOB_ID=$(curl -s -X POST http://localhost:3001/jobs/phase5 \
-H 'content-type: application/json' \
-d '{"track": "quantization"}' | jq -r '.jobId')
# 3) Poll job status
curl "http://localhost:3001/jobs/${JOB_ID}"
# 4) View queue metrics
curl "http://localhost:3001/jobs/metrics?format=prometheus"
# 5) Start observability stack for Grafana/Prometheus
docker compose --profile observability up -d
# Grafana: http://localhost:3100Full operational examples and troubleshooting in docs/operational-runbook.md and docs/prometheus-grafana-setup.md.
- Sidebar now supports full-text concept search and category/status filters.
- Concept flow is consolidated in a modal with tabs (Details, Dependencies, Paths).
- Dependency graph now supports multi-level expansion with clear hierarchical columns and mandatory directed edges.
- Node interactions include primary-path highlighting, direct-relations focus mode and auto-scroll into a selected concept summary.
- Educational summary panel is structured in Portuguese for didactic use (definition, when to use, common pitfalls, computational cost, popular libs, and why it matters for RAG).
- Learning Path panel tracks progress and recommends next concepts.
Reference docs:
| Feature | Planned phase |
|---|---|
| OAuth / external identity providers | Phase 6+ |
| Production-grade vector database deployment (Qdrant) | Phase 6+ |
| Redis DLQ persistence (in-process memory working) | Phase 6+ |
| Multi-worker distributed coordination | Phase 6+ |
| Long-term trace storage and retention | Phase 7+ |
| Streaming responses and WebSockets | Phase 7+ |
| Multimodal (complete image + audio extraction) | Phase 7+ |
Requirements: Node.js ≥ 20, npm ≥ 8
# 1. Install dependencies
npm install
# 2. Ask a question against the sample dataset (no config needed)
npm run rag:smoke -- --dataset phase-0-smoke-text --query "What does this command verify?"
# 3. Ask against your own file
npm run rag:ask -- --file datasets/samples/phase-0-smoke.txt --type text --query "Your question"
# 4. Start the local API server (port 3001)
npm run api:dev
# 5. Start the web interface (port 3000) in another terminal
npm run web:devBoth CLI commands print a JSON response with the query, a grounded answer, retrieved chunk IDs, scores, source metadata and offsets. See docs/phase-1-local-rag.md for the full usage guide.
Quick harness smoke check:
npm run eval:harnessThis command runs a minimal deterministic eval harness pipeline and prints a compact report summary.
GroundedOS Lab has one primary goal and two secondary ones. Order matters — if there is ever a conflict, the primary goal wins.
-
Primary — practical learning laboratory A hands-on environment where developers build each component of a grounded AI system from scratch, observe its internals and understand why it works. Every feature exists to teach something, not to ship a product.
-
Secondary — architecture reference A structured, documented monorepo that shows how RAG, agents, evals, observability and safety fit together in a real codebase. Decisions are recorded in ADRs so the reasoning is visible.
-
Tertiary — usable product Once the learning foundation is solid, the system should also work as a usable local AI assistant. This is never the reason to add complexity; it is the outcome of doing 1 and 2 well.
Every feature in this project orbits a single observable loop:
Upload document
↓
Index (chunk → embed → store)
↓
Ask a question
↓
Retrieve relevant chunks
↓
Generate a grounded answer
↓
Show sources + Dev Mode metadata
↓
Evaluate quality (faithfulness, relevance, latency, cost)
↓
Trace the full request
This loop is already runnable through local RAG, persisted indexes, reranking, trade-off metrics and session memory. Agents, guardrails and evals have package-level baselines; fine-tuning and production infrastructure remain later-phase work. Build and understand the loop first.
GroundedOS Lab runs locally first, with optional cloud integration as the project evolves.
- Enable local model execution for experimentation
- Compare local vs cloud performance
- Reduce or eliminate dependency on paid APIs during experimentation
Planned / target integrations:
- Local Transformers (quantized models)
- Ollama-based local execution (opt-in, already available for embeddings)
- OpenAI / Anthropic APIs (optional, Phase 4+)
- Chat with documents, images and audio
- Grounded responses with source attribution
- Memory-aware conversations
-
Inspect:
- retrieved chunks
- latency
- model routing decisions
- orchestration steps and reasoning summaries
- semantic cache hit/quality diagnostics
- per-request eval scores and cost breakdown
- grounding sources
-
Toggle adaptive behavior per request:
- multi-model orchestration
- reasoning summary mode
- shadow retrieval checks
- Prompt A/B testing
- Jailbreak playground
- Model benchmarking
- Embedding visualization
- Cost analysis
The architecture is described at three levels of maturity. Not everything in the Target or Experimental views exists yet — see What works today for the running baseline.
User Input (CLI, HTTP or Web)
↓
apps/api (NestJS) or CLI script
↓
packages/etl → NormalizedDocument
↓
packages/rag → Query Understanding + Chunks + Embeddings + Hybrid Retrieval + Reranking
↓
packages/observability → Cost + Latency + Trade-off Metrics
↓
packages/memory → Optional session memory when sessionId is supplied
↓
Extractive Answer + Dev Mode Output (chunks, scores, offsets, metadata)
This is the working loop. It produces observable output — retrieved chunks with scores and source attribution — on every request.
--------------------------------
| Session / Request Manager | ← lifecycle owner for the entire request
--------------------------------
↓
User Input
↓
Prompt / Context Engineering
↓
(Adaptive RAG Decision)
↓
--------------------------------
| RAG Pipeline |
| - ETL |
| - Chunking |
| - Embeddings |
| - Hybrid Search |
| - Re-ranking |
--------------------------------
↓
--------------------------------
| Semantic Cache (optional) | ← operates on (query + retrieved context)
--------------------------------
↓
Model Routing ← context-informed: considers retrieved content,
↓ context length, cost and reasoning requirements
Multi-Agent Orchestration
↓
Tool Calling Layer
↓
LLM Inference
↓
Self-Reflection / Validation
↓
Guardrails & Safety Layer
↓
Response + Data Lineage
↓
--------------------------------
| Feedback Loop |
| → Evaluation (Evals) |
| → Observability |
| → Memory Update |
--------------------------------
Architecture notes
- Session / Request Manager owns the full request lifecycle. It is the component that will manage state across multiple tool calls in agent flows.
- RAG before Model Routing: retrieved context (volume, domain, complexity) informs which model to use. Routing before retrieval loses this signal.
- Semantic Cache after RAG: the cache key is
(query, retrieved_context), not the raw query alone. Caching the raw query produces false hits when different retrievals produce the same query string but different contexts.
The experiments/ folder holds independent experiments that are not part of the request path. They produce artifacts (fine-tuned weights, quantized models) that can later feed into Model Routing as new provider options.
experiments/
fine-tuning/ ← offline training runs, produce weights
lora/ ← LoRA adapter training
quantization/ ← quantize models for local inference
distillation/ ← teacher → student model compression
jailbreak-defense/ ← red-teaming, produces attack fixtures for packages/safety
bias-tests/ ← produces eval fixtures for packages/evals
These experiments orbit the core loop — they make it better — but they are never prerequisites for the core loop to run.
Some concepts below are already implemented in the runnable Phase 0-1 loop. Others are documented and roadmapped so contributors can learn the system in the order it will be built. See What works today for the implementation boundary.
- LLM
- Transformer
- Weights
- Context Window
- Inference
- RAG
- Embeddings
- Vector Database
- Chunking
- Hybrid Search
- Re-ranking
- Knowledge Graphs (GraphRAG)
- Data Lineage
- Prompt Engineering
- Context Engineering
- System Prompt
- Few-shot / Zero-shot Learning
- Chain-of-Thought (CoT)
- Self-Reflection / Self-Correction
- Grounding
- Context Pruning / Trimming
- Adaptive RAG
- Multi-agents
- Tool Calling / Function Calling
- Memory
- Model Routing
- Quantization
- LoRA
- Distillation
- Fine-tuning
- Temperature
- Top-P / Top-K
- Tokenization
- ETL for LLM
- Data Augmentation
- Synthetic Data Generation
- Uniform Document Schema
- Latency / Throughput
- Semantic Caching
- Evaluation (Evals)
- Observability
- Cost Analysis (Showback/Chargeback)
- A/B Testing of Prompts
- Guardrails
- Hallucination Detection
- Bias Evaluation
- PII Stripping
- Jailbreaking Defense
- Text
- Image
- Audio
- Structured Outputs (planned via Pydantic / schema validation)
GroundedOS Lab is built for:
| Audience | What you get |
|---|---|
| AI/ML Engineers | A structured monorepo to experiment with RAG, agents, evals, and safety in a real-world architecture |
| Backend Engineers | Hands-on exposure to LLM-powered pipelines, model routing, observability, and async workers |
| Students & Researchers | A documented learning map that connects concepts (embeddings, CoT, guardrails) directly to working code |
| Technical Leaders | A reference architecture for grounded AI systems, including cost tracking, evaluation and safety layers |
⚠️ This project assumes basic Python and TypeScript knowledge. No prior AI/ML experience is required — the goal is to build it as you learn.
Use this repository as a structured learning path:
-
New here? Start in 5 minutes: → ⚡ Quick Start and ✅ What works today
-
Want to understand what we're building? → 🎯 Objectives and 🔁 Core Product Loop
-
Interested in the hands-on module concepts? → Jump to 🔬 Laboratory Modules
-
Looking for evals, agents, guardrails, routing, or prompt experimentation? → Browse the
packages/andexperiments/folders — each has its ownREADME.md -
Want to understand AI concepts behind the system? → Start at
docs/concepts/ -
Want a guided learning path by topic? → Explore the 📘 Guided Study Tracks
-
Want to understand why the system is built the way it is? → Read the Architecture Decision Records
-
Want to know how quality is measured? → Read the Evaluation Strategy
Guided Study Tracks are topic-based routes through the existing GroundedOS Lab documentation, packages, experiments and roadmap phases.
Start here:
- Track 1 - LLM Foundations: model basics, Transformer concepts, inference, context windows and generation controls
- Track 2 - Multi-Modal & Agents: multimodal ingestion, tool calling, multi-agent flows and memory
- Track 3 - Open-Source Ecosystem: Hugging Face, local models, quantization and inference trade-offs
- Track 4 - Evaluation & Comparison: evals, observability, cost analysis, A/B testing and benchmarking
- Track 5 - Advanced RAG: embeddings, chunking, vector databases, hybrid search, reranking, grounding and lineage
- Track 6 - Fine-tuning & Adaptation: fine-tuning, LoRA, distillation, data augmentation, synthetic data and RLHF
- Track 7 - Autonomous AI Systems: planning, self-reflection, memory, multi-agent collaboration and guardrails
-
Batch testing:
- prompts
- temperature
- top-p
- models
-
Compare:
- local vs cloud models
- latency
- cost
- quality
- Embedding visualization (t-SNE / UMAP)
- similarity maps
- clustering
- Prompt injection detection
- Jailbreak protection
- PII sanitization
- Output validation
- Grounding enforcement
Authentication and authorization include a multi-tenant isolation baseline for local and staged environments:
- JWT + refresh flow via
POST /auth/login,POST /auth/refreshandPOST /auth/logout - API key management with tenant/user binding and scope-based policies
- Ownership scoping for indexes and session memory (
tenantId,userId,createdBy) - Deny-by-default access checks for cross-tenant/cross-owner retrieval and CRUD operations
- Structured audit hooks for denied access, cross-tenant attempts and API key usage
- Rate limiting on protected paths
Auth enforcement behavior:
- Local development: opt-in (
AUTH_ENFORCEMENT=true) - Non-dev/non-test: enabled by default when
AUTH_ENFORCEMENTis unset
Current limitations / next steps:
- RBAC/ABAC expansion — API key scopes are intentionally simple (
rag:*,jobs:*,admin:*) and can evolve into richer role/policy engines. - External identity providers — OAuth/OIDC provider integration and production-grade identity lifecycle.
- Production hardening — distributed key management, advanced observability and additional tenancy controls for external vector backends.
This strategy is tracked as a Phase 6 success criterion. Implementation decisions will be recorded in docs/adr/.
The Jailbreak Playground (experiments/jailbreak-defense/) is a red-teaming surface that deliberately tests adversarial inputs. Before it is exposed beyond local development:
- All playground inputs are logged with the authenticated user identifier — no anonymous red-teaming.
- Playground outputs (successful jailbreaks, bypass patterns) are never exposed publicly; results are stored in
datasets/with access controls. - External contributors must review the security policy in
experiments/jailbreak-defense/README.mdbefore submitting new attack patterns.
Image and audio extractors are registered stubs. They will re-enter the roadmap when:
- A concrete use case is identified (e.g. PDF-with-images ingestion, audio transcription for meeting notes).
- The relevant privacy and content-moderation implications for user-uploaded media are documented.
- A Phase milestone explicitly includes multimodal success criteria.
- Token usage
- Cost per request
- Latency per stage
- Model usage
- Error rates
- Hallucination signals
- Cache hit rate
Try to break the system and see:
- why it was blocked
- which rule triggered
See:
- which chunks were used
- relevance score
- document origin
Compare:
- latency
- cost
- quality
Compare prompts with automatic eval scoring
The monorepo scaffold below is already created. Each folder contains a
README.mddescribing its responsibilities. Code implementation follows the roadmap phases.
groundedos-lab/
apps/
api/ ← Backend API server (REST, local RAG, agents and metrics)
web/ ← Frontend application (React + Vite)
worker/ ← Async jobs worker (Phase 6 in progress)
packages/
core/ ← Shared types, utilities, and base abstractions
rag/ ← Full RAG pipeline (chunking, embeddings, hybrid search, re-ranking)
agents/ ← Multi-agent orchestration and tool calling layer
memory/ ← Conversation and long-term memory management
model-routing/ ← LLM routing logic (local vs cloud, cost-aware)
safety/ ← Guardrails, PII stripping, jailbreak defense
observability/ ← OpenTelemetry tracing, cost tracking, latency metrics
evals/ ← Evaluation framework (RAGAS, custom scorers)
etl/ ← Document ingestion and preprocessing pipelines
experiment-toolkit/ ← Batch prompt testing, parameter sweeps
benchmarks/ ← Local vs cloud model benchmarking
viz/ ← Embedding visualization (t-SNE / UMAP)
experiments/
fine-tuning/ ← Full fine-tuning experiments
lora/ ← LoRA adapter training
distillation/ ← Knowledge distillation
quantization/ ← Model quantization experiments
jailbreak-defense/ ← Red-teaming and prompt injection defense
bias-tests/ ← Bias evaluation across models and prompts
docs/
concepts/ ← One file per AI concept, linked to code
adr/ ← Architecture Decision Records (why the system is built this way)
study-tracks/ ← Guided learning routes by topic
datasets/ ← Raw, processed and synthetic datasets registry
infra/ ← Docker, Compose, K8s, environment configs
instructions/ ← Agent instruction layer entrypoint and global manifest
agents/ ← Agent behavior profiles (planner, implementer, reviewer)
skills/ ← Skill routing registry
context/ ← Reusable project and contribution context
prompts/ ← Prompt templates by intent
evals/ ← Instruction adherence rubrics
configs/ ← Instruction profile and adapter configuration
The repository includes a reusable instruction layer for Codex, Copilot Chat (VS Code), and GitHub Copilot so contributors do not need to restate project standards in every prompt.
Primary entrypoints:
Local validation:
npm run instructions:validate
npm run instructions:check
npm run instructions:resolve
npm run instructions:migrate:plan -- --from 1.1 --to 1.1
npm run instructions:migrate:apply -- --from 1.0 --to 1.1This check is also wired into CI in warning mode during the current rollout.
No external services required for the default local path:
| Layer | What | Note |
|---|---|---|
| API server | Node.js + NestJS (Fastify adapter) | Runs with npm run api:dev. Migrated from raw Fastify; see ADR-001. |
| Web | React 19 + Vite + TypeScript | Runs with npm run web:dev (Vite dev server, proxies /api to NestJS). |
| Storage | Local JSON files (.groundedos/) |
Persisted indexes, memory sessions and cost ledger |
| Embeddings | api-lexical (built-in) |
Default, no server. local-hash and ollama are opt-in. |
| Retrieval | In-memory hybrid search + reranking (packages/rag) |
No external vector DB required yet |
| Observability | Local cost tracking + trade-off metrics | No tracing server required yet |
| Layer | What | When | Notes |
|---|---|---|---|
| Database | PostgreSQL | Phase 6 | Production persistence for indexes, memory and users |
| Vector DB | pgvector → Qdrant | Phase 6 | pgvector first, migrate when needed. See ADR-002. |
| Queue | Redis + BullMQ | Phase 3+ | API → Worker communication boundary. See ADR-003. |
| Workers | Python (ML pipelines) | Phase 3+ | Consume BullMQ jobs for compute-heavy tasks |
| Observability | OpenTelemetry + Grafana | Phase 6 | Distributed tracing, cost per stage |
| AI providers | OpenAI / Anthropic (optional) | Phase 4+ | Cloud LLM option alongside local Ollama |
| Containers | Docker + docker-compose | Phase 6 | Full local stack in one command |
| CI | GitHub Actions | Phase 6 | Lint, typecheck, test on every PR |
- Uniform Document Schema
- Multimodal ingestion standardization
- ETL pipeline
✅ Success Criteria:
-
packages/coredefinesSourceDocumentandNormalizedDocument— the Uniform Document Schema -
packages/etlingests text and PDF files intoNormalizedDocument - Image and audio ingestion remain registered stubs for a later multimodal slice
- At least one sample dataset registered in
datasets/ - ETL pipeline is runnable locally with a single smoke command
- Chunking
- Embeddings
- Vector DB
- Chat
✅ Success Criteria:
- User can upload a document and ask a question grounded in its content
- Local RAG smoke command can ask a question against a registered dataset
- Retrieved chunks have a documented Dev Mode output contract with relevance scores
-
packages/raghas integration tests covering the full retrieval flow - Phase 1 baseline metrics recorded in
datasets/golden/baselines/phase-1-baseline.jsonbefore Phase 2 begins — see Evaluation Strategy
- Hybrid search (dense + sparse)
- Re-ranking
- Observability
✅ Success Criteria:
- Runtime contract validation is enforced at package and API boundaries (schema-first validation layer in
packages/core) - Query understanding runs before retrieval (rewrite, expansion, intent detection) and is visible in Dev Mode output
- RAG ask execution is exposed as explicit workflow steps with per-step status and duration in Dev Mode
- Semantic cache lookup is integrated before retrieval and reported in Dev Mode (
cache.hit, similarity, hit/miss counters) - Request-level cost tracking with budget enforcement is integrated and exposed in Dev Mode (
cost.breakdown,totalCostUsd,withinBudget) - Trade-off metrics dashboard exposes request/provider aggregates (latency, cost, grounded rate, cache hit rate) via API and web tab
- Hybrid search (dense + sparse) is benchmarked against dense-only retrieval on the Phase 0 smoke dataset — Recall@3 does not regress and expected-chunk score improves in
datasets/golden/baselines/phase-2-hybrid-benchmark.json - Re-ranking is applied and token usage / latency per stage is logged per request
- Retrieval observability spans (chunk count, scores, latency) appear in the Dev Mode output
Note: Persistent memory between sessions is a separate product and infrastructure concern (storage, user identity, privacy) and is tracked in Phase 2b, not here. Mixing it with retrieval quality improvements would cause one to delay the other.
- Conversation memory across sessions
- Storage backend for memory entries
- Memory read/write contracts
✅ Success Criteria:
- Conversation memory persists and is retrievable across independent API restarts using a documented storage contract
- Memory entries are associated with a session identifier; no cross-session leakage
- Memory scope, retention policy and privacy implications are documented in
packages/memory/README.md
Implemented via @groundedos/memory and integrated into POST /rag/ask with optional sessionId plus GET /rag/memory/:sessionId for session inspection.
- Agents
- Tool calling
- Guardrails
- Evals
- Self-reflection / validation layer
✅ Success Criteria:
-
@groundedos/agents: DocumentQAAgent with tool calling and reasoning loop, API integration viaPOST /agents/execute -
@groundedos/safety: Six-risk guardrail suite (prompt injection, PII, jailbreak, hallucination, prompt leakage, indirect injection) with GuardrailChain orchestration -
@groundedos/evals: FaithfulnessEvaluator, RelevanceEvaluator, RecallEvaluator with per-query and aggregate scoring - Tool calling layer fully implemented in agents/src/tools.ts with registry and timeout handling
- Phase 3 baseline metrics recorded in
datasets/golden/baselines/phase-3-baseline.json(avg faithfulness 0.87, relevance 0.92, recall 1.0) - Full test coverage: 197 tests passing (added 42 new tests: 7 agents, 21 guardrails, 14 evals)
- Benchmarking
- A/B testing
- Visualization
- Model routing
✅ Success Criteria:
- A/B prompt test runs automatically and reports winner with statistical summary (sample size, confidence interval) —
npm run experiment:promptswritesdatasets/golden/baselines/phase-4-ab-prompt-test.json; current result is not statistically conclusive because the golden dataset has one query - Benchmark compares at least two models (local Ollama + one cloud provider) on latency, cost and quality using the Phase 0 smoke dataset as the shared baseline —
npm run benchmark:models -- --providers local-extractive,ollama,groqcompleted with Ollama (qwen2.5:0.5b) + Groq (llama-3.1-8b-instant); artifact atdatasets/golden/baselines/phase-4-model-benchmark.jsonrecordsphase4ModelBenchmarkPassed: true - Embedding visualization renders in the web app with section cluster labels for persisted indexes
- LoRA
- Quantization
- Fine-tuning
- Distillation
✅ Success Criteria:
- Initial deterministic scaffolds exist for fine-tuning, LoRA, quantization and distillation with documented Python environment setup
- Scaffold artifacts compare baseline vs candidate variants on task-specific metrics such as faithfulness, relevance, quality, latency, memory or parameter ratio
- Scaffold results are logged under
datasets/experiments/phase-5/with input dataset, hyperparameters and output metrics recorded - Quantization track: INT8 and INT4 symmetric vector quantization with Recall@1=1.0 at 85.8% memory reduction vs FP32 baseline
- LoRA track: real adapter training on GPT-2 (PyTorch + PEFT) — 294k trainable params vs 124M baseline (99.76% reduction),
comparison.passed: true, test inscripts/lora-experiment.test.ts - Fine-tuning track: real SFT on GPT-2 with 6 instruction pairs from the Phase 5 retrieval dataset — loss improved 0.48 over 3 steps, all 124M parameters updated,
comparison.passed: true, test inscripts/sft-experiment.test.ts - Distillation track: real teacher-student run (
gpt2->distilgpt2) with 34.17% compression and quality gate passing, test inscripts/distillation-experiment.test.ts
- Docker and docker-compose for local full-stack environment
- CI pipeline (lint, typecheck, test on every PR)
- Environment configuration and secrets management
- OpenTelemetry observability stack (Jaeger, Prometheus, Grafana)
- Authentication and authorization baseline
- Staging deployment (optional cloud target)
✅ Success Criteria:
-
docker-compose upconfiguration exists for the full local stack (API, web, worker, Redis, Postgres) -
docker-compose --profile observability up -dstarts Jaeger, Prometheus, Grafana with OTEL collection enabled - OTEL trace export is enabled in docker-compose: API and worker send traces to OTEL Collector
- Jaeger receives and visualizes distributed traces with span hierarchy, timing, and attributes
- Prometheus scrapes metrics from API, worker, and infrastructure with working alert rules
- Grafana auto-provisions Prometheus and Jaeger datasources; sample dashboard queries work
- GitHub Actions CI runs lint, typecheck and tests on every PR (
.github/workflows/ci.yml) -
.env.examplefiles for root, API and web document required variables (including OTEL_EXPORT_ENABLED) - Authentication strategy is documented (see ADR-014)
- Security hardening audit complete: auth middleware, rate limiting, audit logging, CORS (see PHASE-6-OBSERVABILITY-VALIDATION.md)
Phases 0, 1, 2, 2b, 3, 4 and 5 are complete. Phase 6 Infrastructure & Deploy is in active rollout.
- Phase 5 is complete with real runs across quantization, LoRA, fine-tuning
and distillation artifacts in
datasets/experiments/phase-5/. - Phase 6 observability stack ✅ complete:
- OTEL trace export enabled (API and worker export to Jaeger)
- Prometheus scrape config for all services
- Grafana auto-provisioning with Prometheus + Jaeger datasources
- Queue hardening metrics: success/failure/retry rates, DLQ depth, duration percentiles
- Queue hardening dashboard: auto-imported with queue operational visualizations
- Alert rules: API/worker/infrastructure + Queue Hardening alerts (DLQ accumulation, low success rate, high latency, high retry rate)
- Validation guide: docs/PHASE-6-OBSERVABILITY-VALIDATION.md
- Setup guide: docs/prometheus-grafana-setup.md
- Phase 6 baseline infra already landed:
docker-compose.ymlwith observability profile,Dockerfile,Dockerfile.web,apps/worker/Dockerfile,.github/workflows/ci.yml. - Authentication/authorization baseline already landed:
POST /auth/login,POST /auth/refresh,POST /auth/logout, API-key management, admin endpoints, owner scoping for indexes and memory, and rate limiting / audit hooks. - Auth enforcement now follows environment defaults:
opt-in in local development, and enabled by default in non-dev/non-test
environments when
AUTH_ENFORCEMENTis unset. - Security hardening verified: bearer token + cookie auth, rate limiting, audit logging, multipart limits, response validation.
- Next technical priorities: OAuth/provider-based identity, production deployment hardening, database scaling, queue observability and multi-worker orchestration.
- Keep roadmap checkboxes and package READMEs synchronized with implementation status.
The local RAG usage guide is documented in
docs/phase-1-local-rag.md.
The Ollama installation and integration guide is documented in
docs/ollama-setup.md.
The complete observability stack validation guide is documented in
docs/PHASE-6-OBSERVABILITY-VALIDATION.md.
Reference environment files live in .env.example,
apps/api/.env.example and
apps/web/.env.example. Node-side commands load
local .env files from the repository root, and app-specific files when
applicable; shell-exported variables keep priority.
A complete observability stack is available for local development and debugging of distributed traces and metrics across the API and worker services.
Start the observability stack:
# Start core services (with OTEL export enabled by default)
docker-compose up -d
# Verify core services are healthy
docker-compose ps | grep -E "api|worker|postgres|redis"
# Start observability services (Jaeger, Prometheus, Grafana, OTEL Collector)
docker-compose --profile observability up -d
# Verify observability services are healthy
docker-compose ps | grep -E "otelcol|jaeger|prometheus|grafana"Access the observability UIs:
- Jaeger UI (distributed traces): http://localhost:16686
- Prometheus UI (metrics, PromQL queries): http://localhost:9090
- Grafana UI (dashboards, alerts): http://localhost:3100 (default: admin/admin)
Trace export is automatically enabled:
- API service:
OTEL_EXPORT_ENABLED=true→ sends traces to OTEL Collector (http://otelcol:4318) - Worker service:
OTEL_EXPORT_ENABLED=true→ sends traces to OTEL Collector - OTEL Collector receives traces via OTLP gRPC (4317) / HTTP (4318) and exports to Jaeger
View traces in Jaeger:
- Open http://localhost:16686
- Select a service from dropdown (e.g., "groundedos-api", "groundedos-worker")
- Click "Find Traces" to view recent requests
- Click a trace to inspect span hierarchy, timing, duration and attributes
Example trace flow for /rag/ask request:
groundedos-api (span)
├── http.request (HTTP handler)
│ └── rag.ask.process (RAG pipeline)
│ ├── query_understanding (rewrite, expansion, intent)
│ ├── rag.retrieve (hybrid search, reranking)
│ └── llm.inference (if applicable)
└── (if worker job enqueued: traceparent propagated to groundedos-worker)
└── job.process (model execution, experiment tracking)
Query metrics in Prometheus:
# Service health
up{job=~"groundedos-.*"}
# Request rate (requests/sec)
rate(http_requests_total[5m])
# Error rate (errors/sec)
rate(http_requests_total{status=~"5.."}[5m])
# P99 latency (milliseconds)
histogram_quantile(0.99, http_request_duration_seconds) * 1000
# Job queue depth (number of pending jobs)
job_queue_depth
# Cache hit rate (percentage)
rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])
Create Grafana dashboards:
- New > Dashboard
- Add panel > Prometheus
- Use queries from above
- Visualize as graph, gauge, or table
- Save and set refresh interval (15s or 30s)
Configure alerts:
- Alert rules are defined in config/prometheus-alert-rules.yml
- Pre-configured alerts:
- API Alerts: APIDown (unavailable > 1 min), APIHighErrorRate (> 5% over 5 min), APIHighLatency (p99 > 1 sec over 5 min)
- Worker Alerts: WorkerDown, QueueDepthHigh, JobFailureRate
- Queue Hardening Alerts: DLQAccumulation (> 10 jobs in DLQ), LowJobSuccessRate (< 95% over 10 min), HighJobDurationP95 (> 5000ms over 5 min), HighRetryRate (> 20% over 5 min)
- Infrastructure Alerts: Database, cache, memory, disk
- To enable notifications, configure Alertmanager (future work)
- For queue-specific metrics and alerts, see docs/prometheus-grafana-setup.md
See docs/PHASE-6-OBSERVABILITY-VALIDATION.md for complete validation guide, troubleshooting, and performance tuning.
Run a registered dataset through ETL, chunking, embeddings, in-memory vector search and Dev Mode retrieval output:
npm run rag:smoke -- --dataset phase-0-smoke-text --query "What does this command verify?"Ask a grounded question against a local text or PDF file:
npm run rag:ask -- --file datasets/samples/phase-0-smoke.txt --type text --query "What does this command verify?"Both commands print JSON containing the query, a simple grounded answer, retrieved chunk IDs, scores, source metadata and offsets.
Run the local API:
npm run api:devThe first API slice exposes GET /health, POST /rag/index, POST /rag/ask,
GET /rag/indexes, and DELETE /rag/indexes/:documentId for inline JSON text,
multipart text/PDF uploads, selectable local embedding providers and persisted
local indexes under .groundedos/indexes/.
ollama requires a running local Ollama server and an embedding model such as
embeddinggemma.
Run the local web surface in another terminal:
npm run web:devThe web server listens on http://localhost:3000 and proxies requests to the
local API. Use the embedding provider select for new inline/upload requests,
Index to persist the current source, select saved indexes from the list, then
Ask to query that saved local index by documentId.
Initial TypeScript workspace tooling is configured so Phase 0 packages can be validated locally.
Current stack:
| Layer | Tool | Purpose |
|---|---|---|
| Package manager | npm workspaces |
Manage JS/TS packages |
| Type checking | TypeScript strict mode |
Static analysis across JS/TS packages |
| Testing (JS/TS) | Vitest |
Unit and integration tests |
Planned additions:
| Layer | Tool | Purpose |
|---|---|---|
| Build orchestration | Turborepo |
Incremental builds, task pipelines |
| Python environment | Poetry (per package) |
Isolate ML package dependencies |
| Linting (JS/TS) | ESLint + Prettier |
Code style and formatting |
| Linting (Python) | Ruff |
Fast Python linter |
| Testing (Python) | pytest |
Unit and integration tests |
| Containers | Docker + docker-compose |
Local environment |
| CI | GitHub Actions | Test, lint and build on every PR |
Repo conventions:
- All packages declare their dependencies explicitly — no implicit sharing
- Active TypeScript packages are validated through root build and test scripts
- The root package scripts define the current validation pipeline
- Python packages pin dependencies via
pyproject.tomlandpoetry.lock
Anyone — from students exploring AI to engineers building production systems. Contributions at all levels are welcome: documentation, experiments, package implementations, bug reports, and feature ideas.
- Fork the repository and create a new branch from
main - Pick a phase from the Roadmap and check the success criteria
- Find or open an issue describing what you want to work on before starting large changes
- Follow the folder conventions: each package or experiment has its own
README.md— keep it updated
| Type | Where | Notes |
|---|---|---|
| AI concept documentation | docs/concepts/ |
Follow the template in docs/concepts/README.md |
| Architecture decision | docs/adr/ |
Write an ADR before implementing a hard-to-reverse decision |
| Package implementation | packages/<name>/ |
Start with the README.md in that package |
| Experiment | experiments/<name>/ |
Include a reproducible notebook or script |
| Dataset | datasets/ |
Include metadata (source, license, size) |
| Bug report / feature request | GitHub Issues | Use the issue templates |
- Write clear, self-contained code with inline comments for non-obvious logic
- Every package must have at least one test before being merged
- Use the tooling defined in ⚙️ Monorepo Tooling
- All Python code must include type hints
- All TypeScript code must pass strict type checking
- Branch is up to date with
main - Code follows the style guide for the language (see Monorepo Tooling)
- Tests pass locally
- Roadmap phase impact is documented in the PR description
-
README.mdis updated when phase status/scope changed - The impacted module
README.mdis updated - The relevant
docs/*contract/runbook/phase file is updated - Docs follow the policy in
docs/documentation-governance.md - The PR description explains what changed and why
Open a GitHub Discussion or comment on an existing issue. No question is too basic.
GroundedOS Lab exists to help developers:
- move beyond basic AI usage
- understand real-world AI systems
- build reliable and observable pipelines
- experiment safely and systematically
- Not a wrapper around an LLM API
- Not just a chatbot interface
- Not a toy project
This project focuses on system design, reliability, and real-world AI engineering.
This is not a demo.
This is a laboratory for understanding grounded AI systems in production.
