Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2
Open
samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
Open
Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
Conversation
…nd tests - Add server scaffolding with health endpoint and placeholder routes - Implement adaptive chunking models and 3-pattern code detection - Configure auth stubs, rate limiting, and comprehensive test suite - All 28 tests passing, ready for Step 2 ingestion
Implement complete document ingestion pipeline with scanning, parsing, chunking, embedding generation, and Supabase persistence. New Components: - supabase_client.py: Database operations (upsert, manifest, stats) - embed_client.py: OpenAI embeddings with batching and retry logic - ingest.py: Main pipeline orchestrating scan → parse → embed → index - SUPABASE_SCHEMA.md: Complete database schema documentation Features: - Smart document chunking by H1/H2 headings (>1500 tokens, 150 overlap) - Code pattern detection (triple backticks, single backticks, Pine marker) - Deterministic document IDs (SHA256 of filename + chunk_index) - Incremental indexing with manifest-based change detection - Batch embedding generation (100 texts/batch) with tenacity retries - Full reindex support with data clearing - Cost estimation before embedding generation Endpoint Updates: - /internal/index: Fully implemented with incremental/full reindex - /status: Now queries actual document stats from Supabase Tests: - test_ingest.py: 18 tests for parsing, chunking, manifest diffing - test_embed_client.py: 20 tests for batching, retries, cost estimation - Updated test_app_health.py for Step 2 behavior All 65 tests passing with comprehensive coverage.
…ting by token limit (3 chars/token)
… tests Summary: Implements core RAG retrieval and LLM client pieces (vector + BM25 fallback, hybrid merging, context assembly with token budgeting and code-aware trimming), plus unit tests and configuration tuning parameters. Returns structured provenance derived from retrieval metadata for reliable source attribution. Files Added: - server/retriever.py — vector_search(), bm25_search(), hybrid_search(), assemble_context(), _trim_content_for_budget() - server/llm_client.py — system prompt builder, model selection, chat completion wrapper, resilient SDK parsing, structured sources output Files Modified: - server/utils.py — added estimate_prompt_tokens() - server/models.py — added RetrievedDocument DTO - server/config.py — added retrieval/LLM tuning fields - server/tests/* — added tests for retriever and llm client Notes: - Vector search uses existing supabase_client.search_similar_documents RPC. - BM25 fallback is a high-level Supabase full-text search wrapper. - hybrid_search normalizes and merges scores using configurable weight. - assemble_context enforces prompt token budget and avoids splitting triple-backtick code fences when trimming. - Provenance is returned as structured `sources` derived from retrieval metadata. Next steps: - Wire these modules into the /chat endpoint (Step 4). - Optionally implement LLM-emitted JSON provenance parsing and heading-aware trimming.
…orn env config - Merge match_documents into migrations/0001_create_documents_and_file_manifest.sql - Remove redundant migrations/0002_create_match_documents.sql - Add GUNICORN_CMD_ARGS to Dockerfile (default --timeout 180)
- Add support with default - Update for needed runtime packages
… rate-limit key - Implement /chat flow (embedding -> retrieval -> LLM) - Add JWT verification (HS256/JWKS) and JWKS caching - Use JWT sub for rate-limiter key when available
- Add and material - Add integration tests for API flows (mocked external services)
…ical and README a pointer; mark STARTUP.md as developer-focused
feat(rag): add RAG API, ingestion pipeline, DB migration, and deployment docs
Add /health endpoint for platform health checks
…StreamReset errors
Add retry/backoff for Supabase RPC vector search to handle transient …
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…flow ci: add workflow to generate processed docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit refines the existing PineScript docs pipeline by renaming scripts for consistent numeric ordering and updating the processing logic so outputs are deterministic and readable.
Key changes
Why this matters
How to verify (copy/paste friendly)
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python 3_scrap_and_process.pyprocessed_1_*.md,processed_2_*.md, ...).Notes for reviewers
Suggested labels & milestone
Changelog (one-liner)
Normalize script names and processing order; restore missing chapters and remove non‑ToC embedded links for cleaner processed docs.