Open Data Link is a search engine for open data. It supports the following search methods:
- Semantic keyword search over metadata
- Similar dataset search using semantic similarity of metadata
- Joinable table search
- Unionable table search
There are three main components:
- The crawler script downloads datasets and metadata from Socrata.
sketch_columnsandprocess_metadatacreate data sketches and metadata embedding vectors.- The server builds indices on the data columns and metadata and serves the frontend.
scripts/download_socrata_datasets.sh [app token file]
Create the column_sketches table:
sqlite3 opendatalink.sqlite < sql/create_column_sketches_table.sql
Run sketch_columns to sketch (minhash) dataset columns and store them in the
column_sketches table:
go run cmd/sketch_columns/main.go
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip crawl-300d-2M.zip
go run cmd/build_fasttext/main.go < crawl-300d-2M.vec
This will create the fasttext.sqlite database.
Create the metadata and metadata_vectors tables:
sqlite3 opendatalink.sqlite < sql/create_metadata_tables.sql
Run process_metadata:
go run cmd/process_metadata/main.go
This will create metadata embedding vectors for each dataset and save them in
the metadata_vectors table. The metadata is saved in the metadata table.
go run cmd/server/main.go
The server, sketch_columns, and process_metadata look for databases named
opendatalink.sqlite and fasttext.sqlite in the current directory by default.
Alternate paths can be specified in the OPENDATALINK_DB and FASTTEXT_DB
environment variables.