Skip to content

Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2

Open
samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
samuelymh:main
Open

Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2
samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
samuelymh:main

Conversation

@samuelymh
Copy link

This commit refines the existing PineScript docs pipeline by renaming scripts for consistent numeric ordering and updating the processing logic so outputs are deterministic and readable.

Key changes

  • Renamed/normalized script filenames to a numeric ordering convention (e.g., 1_scrap_docs.py, 2_process_docs.py, 3_scrap_and_process.py) to make the processing order explicit.
  • Changed processing logic so scraped source files and processed outputs are sorted and handled numerically. This guarantees the final combined file (processed_all_docs.md) contains chapters in the intended order.
  • Fixed an issue where some chapters were skipped when pages lacked enough content; processing now includes these chapters reliably.
  • Removed embedded links that were not part of the Table of Contents from the main content so processed Markdown focuses on content, not inline navigation links.
  • Updated README.md (if required) and committed updated processed outputs under pinescript_docs as a snapshot for review.

Why this matters

  • Deterministic ordering makes reviews and diffs meaningful and prevents content reordering noise in future runs.
  • Restoring missing chapters prevents incomplete documentation when source pages have sparse content.
  • Stripping non-ToC embedded links improves readability of the processed documentation and reduces spurious inline link noise for downstream tooling.

How to verify (copy/paste friendly)

  1. Install deps and run the pipeline (zsh):
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python 3_scrap_and_process.py
  1. Confirm ordering:
  • Open processed and verify files are named/ordered numerically (e.g., processed_1_*.md, processed_2_*.md, ...).
  • Open processed_all_docs.md and confirm chapters appear in numeric order.
  1. Confirm missing chapters are present:
  • Compare prior snapshot (if available) and spot-check earlier-missing chapters are now included.
  1. Confirm embedded links removal:
  • In a few processed files and processed_all_docs.md, search for anchor-style in-content links that are not part of the ToC (e.g., inline See section links) and verify they are removed.

Notes for reviewers

  • Focus on the processing changes that implement numeric sorting and the logic that detects/keeps low-content pages.
  • Check that legitimate inline links (e.g., links inside code examples or references) are preserved, while non-ToC navigation links are removed.
  • Consider whether the numeric naming should be a formal CLI option or configurable in a follow-up.

Suggested labels & milestone

  • Labels: enhancement, docs, cleanup
  • Milestone: docs-improvements

Changelog (one-liner)
Normalize script names and processing order; restore missing chapters and remove non‑ToC embedded links for cleaner processed docs.

samuelymh and others added 30 commits October 31, 2025 11:37
…nd tests

- Add server scaffolding with health endpoint and placeholder routes
- Implement adaptive chunking models and 3-pattern code detection
- Configure auth stubs, rate limiting, and comprehensive test suite
- All 28 tests passing, ready for Step 2 ingestion
Implement complete document ingestion pipeline with scanning, parsing,
chunking, embedding generation, and Supabase persistence.

New Components:
- supabase_client.py: Database operations (upsert, manifest, stats)
- embed_client.py: OpenAI embeddings with batching and retry logic
- ingest.py: Main pipeline orchestrating scan → parse → embed → index
- SUPABASE_SCHEMA.md: Complete database schema documentation

Features:
- Smart document chunking by H1/H2 headings (>1500 tokens, 150 overlap)
- Code pattern detection (triple backticks, single backticks, Pine marker)
- Deterministic document IDs (SHA256 of filename + chunk_index)
- Incremental indexing with manifest-based change detection
- Batch embedding generation (100 texts/batch) with tenacity retries
- Full reindex support with data clearing
- Cost estimation before embedding generation

Endpoint Updates:
- /internal/index: Fully implemented with incremental/full reindex
- /status: Now queries actual document stats from Supabase

Tests:
- test_ingest.py: 18 tests for parsing, chunking, manifest diffing
- test_embed_client.py: 20 tests for batching, retries, cost estimation
- Updated test_app_health.py for Step 2 behavior

All 65 tests passing with comprehensive coverage.
… tests

Summary:
Implements core RAG retrieval and LLM client pieces (vector + BM25 fallback, hybrid merging, context assembly with token budgeting and code-aware trimming), plus unit tests and configuration tuning parameters. Returns structured provenance derived from retrieval metadata for reliable source attribution.

Files Added:
- server/retriever.py — vector_search(), bm25_search(), hybrid_search(), assemble_context(), _trim_content_for_budget()
- server/llm_client.py — system prompt builder, model selection, chat completion wrapper, resilient SDK parsing, structured sources output

Files Modified:
- server/utils.py — added estimate_prompt_tokens()
- server/models.py — added RetrievedDocument DTO
- server/config.py — added retrieval/LLM tuning fields
- server/tests/* — added tests for retriever and llm client

Notes:
- Vector search uses existing supabase_client.search_similar_documents RPC.
- BM25 fallback is a high-level Supabase full-text search wrapper.
- hybrid_search normalizes and merges scores using configurable weight.
- assemble_context enforces prompt token budget and avoids splitting triple-backtick code fences when trimming.
- Provenance is returned as structured `sources` derived from retrieval metadata.

Next steps:
- Wire these modules into the /chat endpoint (Step 4).
- Optionally implement LLM-emitted JSON provenance parsing and heading-aware trimming.
…orn env config

- Merge match_documents into migrations/0001_create_documents_and_file_manifest.sql
- Remove redundant migrations/0002_create_match_documents.sql
- Add GUNICORN_CMD_ARGS to Dockerfile (default --timeout 180)
- Add  support with default
- Update  for needed runtime packages
… rate-limit key

- Implement /chat flow (embedding -> retrieval -> LLM)
- Add JWT verification (HS256/JWKS) and JWKS caching
- Use JWT sub for rate-limiter key when available
- Add  and  material
- Add integration tests for API flows (mocked external services)
…ical and README a pointer; mark STARTUP.md as developer-focused
feat(rag): add RAG API, ingestion pipeline, DB migration, and deployment docs
Add /health endpoint for platform health checks
Add retry/backoff for Supabase RPC vector search to handle transient …
OdynBrouwer and others added 5 commits January 19, 2026 14:47
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…flow

ci: add workflow to generate processed docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants