The Episodic Memory Platform (EMP) is a production-style AI backend designed to model user knowledge as time-stamped semantic memory events. It leverages the high-performance Endee vector database for metadata-aware semantic search, time-aware re-ranking, and episodic Retrieval-Augmented Generation (RAG).
This project is submitted for the RAGxthon 2026 Hackathon.
flowchart TD
%% User Inputs
UserQuery[/"👤 User Query"/] --> Router{"Agent Intent Classification\n(Recall / Summarize / Recommend)"}
%% Ingestion Flow
MemInput[/"📝/🖼️ Memory Input\n(Text or Image)"/] --> VisionLLM["Vision LLM\n(Gemini Flash)"]
MemInput -.-> |If Image| VisionLLM
MemInput -.-> |If Text| Embeddings1
VisionLLM --> |Descriptive Text| Embeddings1
%% Embedding
Router --> Embeddings2["🧮 Embeddings Model\n(Llama 3 Nemotron)"]
Embeddings1["🧮 Embeddings Model\n(Llama 3 Nemotron)"] --> VDB[("🗄️ Endee Vector Database\n(Custom C++ / SIMD)")]
%% Retrieval Pipeline
Embeddings2 --> Filter[("RoaringBitmap\nMetadata Filters")]
Filter -. "(tags, etc)" .-> VDB
VDB --> |"HNSW Semantic Search"| TopK["Raw Top-K Retrieval"]
TopK --> TimeDecay["⏳ Time-Decay Re-ranking\n(Exponential Decay Formula)"]
TimeDecay --> EpisodicGroup["📅 Episodic Grouping\n(Temporal Clustering)"]
%% Generation
EpisodicGroup --> LLM["🧠 LLM Controller\n(Gemma 3)"]
LLM --> Response[/"💬 Grounded Response"/]
%% Async Processes
VDB -.-> Reflection["🌙 Memory Reflection\n(Background Consolidation)"]
Reflection -.-> |"Synthesized Insights"| VDB
Our implementation extends standard RAG by adding temporal context, intent routing, and cognitive consolidation to solve the "context fragmentation" problem in typical LLM systems:
- Data Source (Episodic Memories): Memories (text or images converted to text via Vision LLMs) are ingested as discrete events with timestamps and categorical metadata tags.
- Embeddings: Text is vectorized using
nvidia/llama-nemotron-embedvia OpenRouter. - Vector Database: We use Endee, a highly optimized, SIMD-accelerated open-source C++ vector database. It executes Pre-Filtering via RoaringBitmaps before HNSW graph traversal, making it incredibly fast for our metadata-heavy queries.
- Retrieval (3-Stage Process):
- Semantic Search: Fast, hardware-accelerated Top-K retrieval in Endee.
- Time-Decay Re-ranking: Re-weights semantic similarity scores using an exponential decay function (
e^(-λ·days_old)), surfacing relevant and recent memories. - Episodic Grouping: Clusters adjacent memories into chronological "Episodes" to maintain narrative flow.
- LLM Generation: The grouped episodes form a strict grounding prompt for
google/gemma-3-27b-it, ensuring hallucination-free output.
Unlike static corpora (like PDFs or Wikipedia dumps), EMP relies on a dynamically growing dataset of user-generated events.
A curated dataset representing 30 days of simulated user memories (tech notes, personal updates, reflections) is included in the project. Reviewers can instantly load this dataset using our seed script to demonstrate the platform's time-awareness and analytical capabilities without needing to manually generate an entire history.
Modern LLM applications often struggle with "context fragmentation" or simply lack a coherent, time-aware memory of past user interactions. Generic "chat with your data" systems rely on naive semantic similarity (e.g., top-K cosine distance), which completely ignores:
- When a memory occurred (temporal context).
- The overarching "episode" or chronological theme tying multiple memories together.
- Metadata categorizations (e.g., extracting intent before querying).
EMP solves this by treating memory not just as text, but as a time-stamped, metadata-rich event. By grouping retrieved memories into temporal "episodes" before passing them to the LLM, the model generates responses that are fundamentally grounded in the user's personal timeline.
Vector databases are essential here because memory retrieval is inherently semantic. When a user asks, "What did we decide about the architecture?", keyword searches will fail if the memory states, "We are going to structure the backend using Domain-Driven Design."
We need a system that can:
- Understand Meaning: Convert text into dense mathematical vectors (embeddings) to measure conceptual proximity.
- Filter by Metadata: Rapidly narrow down the search space using categorical tags (e.g.,
tag: "architecture") before doing the heavy vector math. - Scale: Handle thousands or millions of distinct memory events in milliseconds.
Endee is incredibly central to this platform. It relies on Endee's highly optimized C++ core using SIMD instructions (AVX2/NEON) to achieve extreme performance.
- Ingestion: When a memory is inserted, the text is vectorized locally or via an API (OpenRouter embeddings). The vector is stored in Endee alongside metadata:
text,timestamp(numeric integer), andtags(string list). - Filtering: EMP uses Endee's robust JSON-based filtering (e.g.,
{$in: [...]}). Endee's Pre-Filtering architecture executes these tag filters first (using RoaringBitmaps), guaranteeing fast retrieval. - Retrieval Logic: EMP queries Endee with the semantic vector to fetch the top
Kmost relevant memories matching the metadata filters. Endee seamlessly switches to a brute-force exact match if the filtered subset is very small, bypassing HNSW graph overhead. - Time-Aware Ranking: After Endee returns the semantic top
K, EMP applies an exponential decay function based on thetimestampmetadata. A memory from 5 minutes ago will rank higher than an equally relevant memory from 5 months ago.
The system uses a clean, modular Python backend:
- FastAPI: Provides the asynchronous REST API.
- Endee HTTP Client: A dedicated wrapper to orchestrate Endee index creation, upserts, and filtered queries.
- Embedding & LLM Clients: Connects to OpenRouter to fetch text embeddings (
nvidia/llama-nemotron-embed...) and generate strict, grounded chat responses (google/gemma-3...). - Memory Service: Manages Endee interactions, ingestion, metadata preparation, and the time-decay algorithm.
- Agent Service: Analyzes incoming query intents (
recall,summarize,recommend), applies chronological clustering to form "Context Episodes", and formats the strict grounding LLM prompt.
Ensure you have Docker and Docker Compose installed.
Create a .env file in the root directory and add your OpenRouter API key:
OPENROUTER_API_KEY=your_openrouter_api_key_hereYou can run the entire Episodic Memory Platform (Endee Vector DB + FastAPI Backend + Streamlit UI) with a single command:
docker compose up --buildThe services will be available at:
- Frontend UI:
http://localhost:8501 - Backend API:
http://localhost:8000 - Endee VDB:
http://localhost:8080(Internal)
To see the system in action with a realistic, 30-day timeline of episodic memories, run the seed script:
# If running locally (not in docker):
python scripts/seed_data.py
# If running via Docker Compose:
docker compose exec emp-backend python scripts/seed_data.pyAfter seeding, refresh the frontend UI to interact with the Memory Map and Agent features.
You can interact with the API directly using curl or by writing a simple client script.
1. Ingesting a Memory
curl -X POST "http://localhost:8000/api/memory/add" \
-H "Content-Type: application/json" \
-d '{"text": "I set up the Endee database for the memory platform.", "tags": ["tech"]}'2. Querying the Agent
curl -X POST "http://localhost:8000/api/agent/ask" \
-H "Content-Type: application/json" \
-d '{"user_input": "What did I do today regarding the database?", "intent": "recall"}'