Skip to content

Instant, serverless full-text search engine running entirely in the browser using SQLite WASM and HTTP range requests. Features a premium UI and automated data pipeline.

Notifications You must be signed in to change notification settings

iHiteshAgrawal/serverless-wiki-search

Repository files navigation

Serverless Wikipedia Search 🔍

Instant, serverless full-text search over thousands of documents using SQLite, WASM, and HTTP range requests. Hosted entirely on a CDN.

License Status

🚀 Overview

This project demonstrates a serverless search engine architecture. Instead of running a backend server (like Elasticsearch or Solr), the search engine runs entirely in the user's browser.

It uses sql.js-httpvfs to lazy-load parts of a SQLite database hosted as static files. This allows for:

  • Zero backend maintenance: Just host static files.
  • Instant results: Low latency search using FTS5 (Full Text Search).
  • Cost efficiency: Hosted on free CDNs like GitHub Pages or Cloudflare R2.

✨ Features

  • Massive Scale: Capable of indexing 100k+ articles using streaming ingestion.
  • Premium UI: Polished "Google-like" interface built with Tailwind CSS.
  • Smart Search:
    • Prefix Matching: "comp" matches "Computer".
    • Highlighting: Search terms are bolded in snippets.
    • Instant Feedback: Search stats and total article count.
  • Automated Pipeline: Scripts to download Wikipedia dumps and stream-build the database.

🛠️ Architecture

  1. Data Ingestion: scripts/process-dump.js streams Wikipedia XML dumps (bz2 compressed) and converts them to NDJSON.
  2. Database Build: scripts/build-db.sh pipes the NDJSON into SQLite to create an FTS5 index, optimizes page size, and splits it into chunks.
  3. Frontend: React + Vite app loads the DB engine (WASM) and issues SQL queries.
  4. Hosting: The split DB chunks and the app are served as static assets.

📦 Installation & Usage Guide

1. Clone & Install

git clone https://github.com/ihiteshagrawal/serverless-wiki-search.git
cd serverless-wiki-search
npm install

2. Download Data

You can download the Simple English Wikipedia (~250MB) for testing or the full English Wikipedia (~20GB).

# Download Simple Wikipedia (Recommended for testing)
./scripts/download-dump.sh simplewiki

# OR Download Full English Wikipedia
# ./scripts/download-dump.sh enwiki

3. Build the Database

This script streams the XML dump, parses it, and builds the SQLite database chunks in public/db/.

# Point to the downloaded file
DATA_FILE=simplewiki-latest-pages-articles.xml.bz2 ./scripts/build-db.sh

4. Run Locally

Start the development server to test the search engine.

npm run dev

Open http://localhost:5173 to see it in action.

🚀 Deployment

This project is configured for GitHub Pages.

  1. Build the Database Locally: Ensure you have run step 3 above. The chunks in public/db/ will be committed.
  2. Commit & Push:
    git add .
    git commit -m "feat: deploy new index"
    git push
  3. Watch Deployment: Go to the "Actions" tab in your repository to see the deployment progress.
  4. View Site: Your search engine will be live at https://<username>.github.io/serverless-wiki-search/.

🔧 Customization

  • Data Source: Edit scripts/process-dump.js to change filtering logic (e.g., exclude specific namespaces).
  • UI: Modify src/App.jsx to change the search query or styling.
  • Database: Edit scripts/build-db.sh to tune the chunk size or page size.

📊 Performance Findings

We benchmarked this architecture with 100,000 articles (~120MB database) on GitHub Pages.

Environment Page Size Latency Notes
Localhost 4KB ~20ms Instant. No network overhead.
GitHub Pages 4KB >5s Unusable. Too many round-trips.
GitHub Pages 8KB ~4.6s Still too slow.
GitHub Pages 32KB ~1.5s Sweet spot. Reduced round-trips significantly.

Key Learnings:

  1. Page Size Matters: On high-latency networks (like static hosting), larger SQLite page sizes (32KB+) are critical to reduce the number of HTTP requests required to traverse the B-Tree index.
  2. Caching is Tricky: Aggressive CDN caching can mix old DB chunks with new configs, causing "Malformed Disk Image" errors. Versioned filenames (e.g., db.17328...sqlite3) are mandatory.
  3. Static Optimizations: Pre-calculating stats (stats.json) and warming up the cache (SELECT 1...) make the app feel much faster than it actually is.

🔮 Future Work

To achieve sub-second latency on the web, we recommend:

  1. Better Hosting: Move from GitHub Pages to Cloudflare R2 or AWS S3 + CloudFront, which handle HTTP Range Requests more efficiently.
  2. HTTP/2 Multiplexing: Ensure the host supports multiplexing to allow parallel fetching of DB pages.
  3. Vector Search: Implement semantic search using transformers.js and a WASM vector database (like USearch) for "meaning-based" results.

📄 License

MIT

About

Instant, serverless full-text search engine running entirely in the browser using SQLite WASM and HTTP range requests. Features a premium UI and automated data pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published