Serverless Wikipedia Search 🔍

Instant, serverless full-text search over thousands of documents using SQLite, WASM, and HTTP range requests. Hosted entirely on a CDN.

🚀 Overview

This project demonstrates a serverless search engine architecture. Instead of running a backend server (like Elasticsearch or Solr), the search engine runs entirely in the user's browser.

It uses sql.js-httpvfs to lazy-load parts of a SQLite database hosted as static files. This allows for:

Zero backend maintenance: Just host static files.
Instant results: Low latency search using FTS5 (Full Text Search).
Cost efficiency: Hosted on free CDNs like GitHub Pages or Cloudflare R2.

✨ Features

Massive Scale: Capable of indexing 100k+ articles using streaming ingestion.
Premium UI: Polished "Google-like" interface built with Tailwind CSS.
Smart Search:
- Prefix Matching: "comp" matches "Computer".
- Highlighting: Search terms are bolded in snippets.
- Instant Feedback: Search stats and total article count.
Automated Pipeline: Scripts to download Wikipedia dumps and stream-build the database.

🛠️ Architecture

Data Ingestion: scripts/process-dump.js streams Wikipedia XML dumps (bz2 compressed) and converts them to NDJSON.
Database Build: scripts/build-db.sh pipes the NDJSON into SQLite to create an FTS5 index, optimizes page size, and splits it into chunks.
Frontend: React + Vite app loads the DB engine (WASM) and issues SQL queries.
Hosting: The split DB chunks and the app are served as static assets.

📦 Installation & Usage Guide

1. Clone & Install

git clone https://github.com/ihiteshagrawal/serverless-wiki-search.git
cd serverless-wiki-search
npm install

2. Download Data

You can download the Simple English Wikipedia (~250MB) for testing or the full English Wikipedia (~20GB).

# Download Simple Wikipedia (Recommended for testing)
./scripts/download-dump.sh simplewiki

# OR Download Full English Wikipedia
# ./scripts/download-dump.sh enwiki

3. Build the Database

This script streams the XML dump, parses it, and builds the SQLite database chunks in public/db/.

# Point to the downloaded file
DATA_FILE=simplewiki-latest-pages-articles.xml.bz2 ./scripts/build-db.sh

4. Run Locally

Start the development server to test the search engine.

npm run dev

Open http://localhost:5173 to see it in action.

🚀 Deployment

This project is configured for GitHub Pages.

Build the Database Locally: Ensure you have run step 3 above. The chunks in public/db/ will be committed.

Commit & Push:

git add .
git commit -m "feat: deploy new index"
git push

Watch Deployment: Go to the "Actions" tab in your repository to see the deployment progress.
View Site: Your search engine will be live at https://<username>.github.io/serverless-wiki-search/.

🔧 Customization

Data Source: Edit scripts/process-dump.js to change filtering logic (e.g., exclude specific namespaces).
UI: Modify src/App.jsx to change the search query or styling.
Database: Edit scripts/build-db.sh to tune the chunk size or page size.

📊 Performance Findings

We benchmarked this architecture with 100,000 articles (~120MB database) on GitHub Pages.

Environment	Page Size	Latency	Notes
Localhost	4KB	~20ms	Instant. No network overhead.
GitHub Pages	4KB	>5s	Unusable. Too many round-trips.
GitHub Pages	8KB	~4.6s	Still too slow.
GitHub Pages	32KB	~1.5s	Sweet spot. Reduced round-trips significantly.

Key Learnings:

Page Size Matters: On high-latency networks (like static hosting), larger SQLite page sizes (32KB+) are critical to reduce the number of HTTP requests required to traverse the B-Tree index.
Caching is Tricky: Aggressive CDN caching can mix old DB chunks with new configs, causing "Malformed Disk Image" errors. Versioned filenames (e.g., db.17328...sqlite3) are mandatory.
Static Optimizations: Pre-calculating stats (stats.json) and warming up the cache (SELECT 1...) make the app feel much faster than it actually is.

🔮 Future Work

To achieve sub-second latency on the web, we recommend:

Better Hosting: Move from GitHub Pages to Cloudflare R2 or AWS S3 + CloudFront, which handle HTTP Range Requests more efficiently.
HTTP/2 Multiplexing: Ensure the host supports multiplexing to allow parallel fetching of DB pages.
Vector Search: Implement semantic search using transformers.js and a WASM vector database (like USearch) for "meaning-based" results.

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
public		public
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
data.json		data.json
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
tailwind.config.js		tailwind.config.js
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Serverless Wikipedia Search 🔍

🚀 Overview

✨ Features

🛠️ Architecture

📦 Installation & Usage Guide

1. Clone & Install

2. Download Data

3. Build the Database

4. Run Locally

🚀 Deployment

🔧 Customization

📊 Performance Findings

🔮 Future Work

📄 License

About

Uh oh!

Releases

Packages

Languages

iHiteshAgrawal/serverless-wiki-search

Folders and files

Latest commit

History

Repository files navigation

Serverless Wikipedia Search 🔍

🚀 Overview

✨ Features

🛠️ Architecture

📦 Installation & Usage Guide

1. Clone & Install

2. Download Data

3. Build the Database

4. Run Locally

🚀 Deployment

🔧 Customization

📊 Performance Findings

🔮 Future Work

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages