Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links by samuelymh · Pull Request #2 · FaustoS88/PinescriptV6-docs-crawler

samuelymh · 2025-10-31T03:48:58Z

This commit refines the existing PineScript docs pipeline by renaming scripts for consistent numeric ordering and updating the processing logic so outputs are deterministic and readable.

Key changes

Renamed/normalized script filenames to a numeric ordering convention (e.g., 1_scrap_docs.py, 2_process_docs.py, 3_scrap_and_process.py) to make the processing order explicit.
Changed processing logic so scraped source files and processed outputs are sorted and handled numerically. This guarantees the final combined file (processed_all_docs.md) contains chapters in the intended order.
Fixed an issue where some chapters were skipped when pages lacked enough content; processing now includes these chapters reliably.
Removed embedded links that were not part of the Table of Contents from the main content so processed Markdown focuses on content, not inline navigation links.
Updated README.md (if required) and committed updated processed outputs under pinescript_docs as a snapshot for review.

Why this matters

Deterministic ordering makes reviews and diffs meaningful and prevents content reordering noise in future runs.
Restoring missing chapters prevents incomplete documentation when source pages have sparse content.
Stripping non-ToC embedded links improves readability of the processed documentation and reduces spurious inline link noise for downstream tooling.

How to verify (copy/paste friendly)

Install deps and run the pipeline (zsh):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python 3_scrap_and_process.py

Confirm ordering:

Open processed and verify files are named/ordered numerically (e.g., processed_1_*.md, processed_2_*.md, ...).
Open processed_all_docs.md and confirm chapters appear in numeric order.

Confirm missing chapters are present:

Compare prior snapshot (if available) and spot-check earlier-missing chapters are now included.

Confirm embedded links removal:

In a few processed files and processed_all_docs.md, search for anchor-style in-content links that are not part of the ToC (e.g., inline See section links) and verify they are removed.

Notes for reviewers

Focus on the processing changes that implement numeric sorting and the logic that detects/keeps low-content pages.
Check that legitimate inline links (e.g., links inside code examples or references) are preserved, while non-ToC navigation links are removed.
Consider whether the numeric naming should be a formal CLI option or configurable in a follow-up.

Suggested labels & milestone

Labels: enhancement, docs, cleanup
Milestone: docs-improvements

Changelog (one-liner)
Normalize script names and processing order; restore missing chapters and remove non‑ToC embedded links for cleaner processed docs.

…nd tests - Add server scaffolding with health endpoint and placeholder routes - Implement adaptive chunking models and 3-pattern code detection - Configure auth stubs, rate limiting, and comprehensive test suite - All 28 tests passing, ready for Step 2 ingestion

Implement complete document ingestion pipeline with scanning, parsing, chunking, embedding generation, and Supabase persistence. New Components: - supabase_client.py: Database operations (upsert, manifest, stats) - embed_client.py: OpenAI embeddings with batching and retry logic - ingest.py: Main pipeline orchestrating scan → parse → embed → index - SUPABASE_SCHEMA.md: Complete database schema documentation Features: - Smart document chunking by H1/H2 headings (>1500 tokens, 150 overlap) - Code pattern detection (triple backticks, single backticks, Pine marker) - Deterministic document IDs (SHA256 of filename + chunk_index) - Incremental indexing with manifest-based change detection - Batch embedding generation (100 texts/batch) with tenacity retries - Full reindex support with data clearing - Cost estimation before embedding generation Endpoint Updates: - /internal/index: Fully implemented with incremental/full reindex - /status: Now queries actual document stats from Supabase Tests: - test_ingest.py: 18 tests for parsing, chunking, manifest diffing - test_embed_client.py: 20 tests for batching, retries, cost estimation - Updated test_app_health.py for Step 2 behavior All 65 tests passing with comprehensive coverage.

…ting by token limit (3 chars/token)

… tests Summary: Implements core RAG retrieval and LLM client pieces (vector + BM25 fallback, hybrid merging, context assembly with token budgeting and code-aware trimming), plus unit tests and configuration tuning parameters. Returns structured provenance derived from retrieval metadata for reliable source attribution. Files Added: - server/retriever.py — vector_search(), bm25_search(), hybrid_search(), assemble_context(), _trim_content_for_budget() - server/llm_client.py — system prompt builder, model selection, chat completion wrapper, resilient SDK parsing, structured sources output Files Modified: - server/utils.py — added estimate_prompt_tokens() - server/models.py — added RetrievedDocument DTO - server/config.py — added retrieval/LLM tuning fields - server/tests/* — added tests for retriever and llm client Notes: - Vector search uses existing supabase_client.search_similar_documents RPC. - BM25 fallback is a high-level Supabase full-text search wrapper. - hybrid_search normalizes and merges scores using configurable weight. - assemble_context enforces prompt token budget and avoids splitting triple-backtick code fences when trimming. - Provenance is returned as structured `sources` derived from retrieval metadata. Next steps: - Wire these modules into the /chat endpoint (Step 4). - Optionally implement LLM-emitted JSON provenance parsing and heading-aware trimming.

…orn env config - Merge match_documents into migrations/0001_create_documents_and_file_manifest.sql - Remove redundant migrations/0002_create_match_documents.sql - Add GUNICORN_CMD_ARGS to Dockerfile (default --timeout 180)

- Add support with default - Update for needed runtime packages

… rate-limit key - Implement /chat flow (embedding -> retrieval -> LLM) - Add JWT verification (HS256/JWKS) and JWKS caching - Use JWT sub for rate-limiter key when available

- Add and material - Add integration tests for API flows (mocked external services)

…ical and README a pointer; mark STARTUP.md as developer-focused

feat(rag): add RAG API, ingestion pipeline, DB migration, and deployment docs

Add /health endpoint for platform health checks

…StreamReset errors

Add retry/backoff for Supabase RPC vector search to handle transient …

…uest

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…flow ci: add workflow to generate processed docs

samuelymh and others added 30 commits October 31, 2025 11:37

chore: just ship it

9ce6d3d

chore: just ship it

14a0b3b

Add Supabase migration, FK docs, and live-test fixture

eafb380

Use text-embedding-3-small (1536-dim)

7630547

chore: just ship it

99c1d78

feat: added ingest runner and added aux function to handle text split…

775901b

…ting by token limit (3 chars/token)

chore(docker): add GUNICORN_CMD_ARGS env default and update requirements

91e36c5

- Add support with default - Update for needed runtime packages

feat(api): wire /chat and indexing; add JWT verification and per-user…

4087ed3

… rate-limit key - Implement /chat flow (embedding -> retrieval -> LLM) - Add JWT verification (HS256/JWKS) and JWKS caching - Use JWT sub for rate-limiter key when available

docs(tests): add startup instructions and integration tests

843c95d

- Add and material - Add integration tests for API flows (mocked external services)

docs: consolidate deployment guidance — make docs/DEPLOYMENT.md canon…

1a339f1

…ical and README a pointer; mark STARTUP.md as developer-focused

Merge pull request #1 from samuelymh/feat/rag-pinescript

570a03d

feat(rag): add RAG API, ingestion pipeline, DB migration, and deployment docs

Add /health endpoint for platform health checks

3fefa80

Merge pull request #2 from samuelymh/feat/rag-pinescript

3f688a7

Add /health endpoint for platform health checks

Add retry/backoff for Supabase RPC vector search to handle transient …

b314a64

…StreamReset errors

Merge pull request #3 from samuelymh/feat/rag-pinescript

396e342

Add retry/backoff for Supabase RPC vector search to handle transient …

chore: just ship it

2597e91

feat: added email whitelist logic config

80fe241

chg: updated to explicit whitelist

30d218d

feat: updated to include historical chat conversation into prompt req…

9f7bdb3

…uest

chg: added refresh token logic

e723c69

fix: increased max query length

f8ac379

fix: increased prompt token budget

728e0ff

chore: just ship it

c8b6ab8

chore: just ship it

37f6ca2

chore: just ship it

60803a3

ci: add workflow to generate processed docs

93e40fa

OdynBrouwer and others added 5 commits January 19, 2026 14:47

ci: schedule workflow every 3 days

0b43a0f

Update .github/workflows/generate-processed.yml

eee8ee7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update .github/workflows/generate-processed.yml

d183033

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update .github/workflows/generate-processed.yml

d9a992f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #4 from OdynBrouwer/ci/add-generate-processed-work…

9b2fe00

…flow ci: add workflow to generate processed docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2

Normalize script names; enforce numeric ordering, restore missing chapters, and strip embedded non‑ToC links#2
samuelymh wants to merge 35 commits intoFaustoS88:mainfrom
samuelymh:main

samuelymh commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samuelymh commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants