Primary goal of this project: Build a program that can quickly return the shortest path between 2 Wikipedia pages via page links
- Reading the Wikipedia dump files
-page.sql,-redirect.sqland-pagelinks.sqlusing one file-reading thread (line-by-line) and at least one parsing thread (work is spread with a queue) which uses a regex that parses the sql insert statements - Remapping the page-, redirect- and link-lists into hashmaps that can be (de-)serialized in cbor format
- Shortest-path search using BFS
I've used the derive-db command with the dewiki-20240501 version of wikipedia on a 12-core Apple Silicon M2 Pro machine (while using 10 threads):
- Took 67 seconds
- Used about 2.9 GB RAM max (htop)
- Resulted in a 970.46 MB serialized db
Using the database using the interactive command
- Loading took 6 seconds
- Uses 1.9 GB RAM (htop, before querying)
- A few test queries:
Seekröten>Linux: 6ms (Path length: 3 (including end page, excluding start page))Briefmarke>Linux Torvalds: 53ms (Path length: 3)Wasserrakete>Punktierter Stumpfzangenläufer: 2152ms (Path length: 6)
-
Build the project, ideally in release mode (use
cargo build -ror just usecargo run -r --to run) -
Download needed files from one of the Wikipedia dump mirrors dumps.wikimedia.org/mirrors.html:
- Choose a mirror
- Depending on the mirror go to
{language}wiki/{date of dump} - Download
-page.sql.gz,-pagelinks.sql.gzand-redirect.sql.gz - Extract
.gzfiles
-
Derive Database from the downloaded files:
./target/release/wikipedia-link-db derive-db -p {file prefix}-page.sql -r {file prefix}-redirect.sql -l {file prefix}-pagelinks.sql -o output.db -t {number of threads to use}- Use the
interactivecommand to interactively query paths from the db:./target/release/wikipedia-link-db interactive -d output.db