Instant, serverless full-text search over thousands of documents using SQLite, WASM, and HTTP range requests. Hosted entirely on a CDN.
This project demonstrates a serverless search engine architecture. Instead of running a backend server (like Elasticsearch or Solr), the search engine runs entirely in the user's browser.
It uses sql.js-httpvfs to lazy-load parts of a SQLite database hosted as static files. This allows for:
- Zero backend maintenance: Just host static files.
- Instant results: Low latency search using FTS5 (Full Text Search).
- Cost efficiency: Hosted on free CDNs like GitHub Pages or Cloudflare R2.
- Massive Scale: Capable of indexing 100k+ articles using streaming ingestion.
- Premium UI: Polished "Google-like" interface built with Tailwind CSS.
- Smart Search:
- Prefix Matching: "comp" matches "Computer".
- Highlighting: Search terms are bolded in snippets.
- Instant Feedback: Search stats and total article count.
- Automated Pipeline: Scripts to download Wikipedia dumps and stream-build the database.
- Data Ingestion:
scripts/process-dump.jsstreams Wikipedia XML dumps (bz2 compressed) and converts them to NDJSON. - Database Build:
scripts/build-db.shpipes the NDJSON into SQLite to create an FTS5 index, optimizes page size, and splits it into chunks. - Frontend: React + Vite app loads the DB engine (WASM) and issues SQL queries.
- Hosting: The split DB chunks and the app are served as static assets.
git clone https://github.com/ihiteshagrawal/serverless-wiki-search.git
cd serverless-wiki-search
npm installYou can download the Simple English Wikipedia (~250MB) for testing or the full English Wikipedia (~20GB).
# Download Simple Wikipedia (Recommended for testing)
./scripts/download-dump.sh simplewiki
# OR Download Full English Wikipedia
# ./scripts/download-dump.sh enwikiThis script streams the XML dump, parses it, and builds the SQLite database chunks in public/db/.
# Point to the downloaded file
DATA_FILE=simplewiki-latest-pages-articles.xml.bz2 ./scripts/build-db.shStart the development server to test the search engine.
npm run devOpen http://localhost:5173 to see it in action.
This project is configured for GitHub Pages.
- Build the Database Locally: Ensure you have run step 3 above. The chunks in
public/db/will be committed. - Commit & Push:
git add . git commit -m "feat: deploy new index" git push
- Watch Deployment: Go to the "Actions" tab in your repository to see the deployment progress.
- View Site: Your search engine will be live at
https://<username>.github.io/serverless-wiki-search/.
- Data Source: Edit
scripts/process-dump.jsto change filtering logic (e.g., exclude specific namespaces). - UI: Modify
src/App.jsxto change the search query or styling. - Database: Edit
scripts/build-db.shto tune the chunk size or page size.
We benchmarked this architecture with 100,000 articles (~120MB database) on GitHub Pages.
| Environment | Page Size | Latency | Notes |
|---|---|---|---|
| Localhost | 4KB | ~20ms | Instant. No network overhead. |
| GitHub Pages | 4KB | >5s | Unusable. Too many round-trips. |
| GitHub Pages | 8KB | ~4.6s | Still too slow. |
| GitHub Pages | 32KB | ~1.5s | Sweet spot. Reduced round-trips significantly. |
Key Learnings:
- Page Size Matters: On high-latency networks (like static hosting), larger SQLite page sizes (32KB+) are critical to reduce the number of HTTP requests required to traverse the B-Tree index.
- Caching is Tricky: Aggressive CDN caching can mix old DB chunks with new configs, causing "Malformed Disk Image" errors. Versioned filenames (e.g.,
db.17328...sqlite3) are mandatory. - Static Optimizations: Pre-calculating stats (
stats.json) and warming up the cache (SELECT 1...) make the app feel much faster than it actually is.
To achieve sub-second latency on the web, we recommend:
- Better Hosting: Move from GitHub Pages to Cloudflare R2 or AWS S3 + CloudFront, which handle HTTP Range Requests more efficiently.
- HTTP/2 Multiplexing: Ensure the host supports multiplexing to allow parallel fetching of DB pages.
- Vector Search: Implement semantic search using
transformers.jsand a WASM vector database (like USearch) for "meaning-based" results.
MIT