Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
331 commits
Select commit Hold shift + click to select a range
5b1248d
Add quantization option for pgvector with support for halfvec
lucagiac81 Aug 9, 2024
1678077
Add cli support for running benchmark with custom dataset in pgvector…
Sheharyar570 Oct 2, 2024
cba7043
feat: upgrade pgvecto.rs sdk to v0.2.2
cutecutecat Oct 8, 2024
d323821
Randomly pick start idx of test dataset in concurrency search.
Sheharyar570 Oct 14, 2024
7f32779
Added optimzation for Opensearch
navneet1v Oct 16, 2024
467b00a
record the test time; add version / note info for milvus and zillizcl…
alwayslove2013 Oct 18, 2024
a64d326
fix bug: date to datetime
alwayslove2013 Oct 23, 2024
dc73c2a
update leaderboard data
alwayslove2013 Oct 24, 2024
b9a0ce5
fix leaderboard data: zillizcloud version
alwayslove2013 Oct 25, 2024
37b0c7c
Fixed custom_case key error in parameters dict in CLI command.
Sheharyar570 Oct 25, 2024
c5aa67e
Refactored command options for consistency.
Sheharyar570 Oct 25, 2024
61e2808
Updated readme, added custom case related command options information.
Sheharyar570 Oct 25, 2024
b531680
update the instruction for adding custom_case support in new CLI impl…
Sheharyar570 Oct 25, 2024
32c1a53
add key for plotly_chart
alwayslove2013 Oct 28, 2024
523726f
add key for plotly_chart
alwayslove2013 Oct 28, 2024
6939b91
fix pinecone client
alwayslove2013 Oct 28, 2024
d209659
Support for pgdiskann client (#388)
wahajali Oct 29, 2024
6a05c30
increase timeout
alwayslove2013 Oct 29, 2024
1d8b218
Binary Quantization Support for pgvector HNSW Algorithm (#389)
Sheharyar570 Oct 29, 2024
7619acb
fix weaviate client bug
alwayslove2013 Oct 30, 2024
36316d4
Fix code
acanadil Oct 31, 2024
6cb5898
remove older zillizcloud test results from leaderboard
alwayslove2013 Nov 5, 2024
df18476
fixed pgvectorivvflat cli reranking key error bug
Sheharyar570 Nov 5, 2024
653bd39
set default value of quantized_fetch_limit to 100 in case of ivfflat,…
Sheharyar570 Nov 5, 2024
f498e71
update comment
Sheharyar570 Nov 5, 2024
183f47b
Add rate runner
XuanYang-cn Sep 14, 2024
983ea4c
fix conc_latency_p99 calculation; add conc_latency_avg metric; conc_t…
alwayslove2013 Nov 25, 2024
0230184
Added AlloyDB client
Sheharyar570 Nov 25, 2024
a641244
Remove query that set storage to plain.
Sheharyar570 Nov 25, 2024
6a1477a
Add default value for pre_reordering_num_neighbors in cli options.
Sheharyar570 Nov 25, 2024
b0553bc
fix: Donot refresh load
XuanYang-cn Nov 28, 2024
c6a2e79
fix: invalid value for --max-num-levels when using CLI.
Sheharyar570 Nov 29, 2024
2250e62
Add Milvus auth support through user_name and password fields (#416)
teynar Dec 4, 2024
fa14f04
support alibaba cloud elasticsearch (#418)
xingshaomin Dec 11, 2024
0f9d9c8
enhance: refine read write cases
XuanYang-cn Dec 10, 2024
185541a
add aliyun opensearch client
Nov 19, 2024
8a1e18e
bug fix: cost time should be removed from the results of the serial_s…
alwayslove2013 Dec 12, 2024
0b524d9
fix: Opensearch requirements
XuanYang-cn Dec 13, 2024
91d1eed
add aliyun Opensearch requirements
Dec 16, 2024
9aa58f0
update readme
alwayslove2013 Dec 31, 2024
4481615
Removed the Filter Path from Search, so we can get the full response
Jan 7, 2025
ba6bd9b
add support to provide custom port in pgvector
shaharuk-yb Jan 7, 2025
834c440
show_default=True
shaharuk-yb Jan 7, 2025
378841f
updated the opensearch to use column id instead of _id
Jan 7, 2025
370106f
Modified the code to use internal column _id instead of id.
Jan 7, 2025
c0be2c7
added HNSW params to index creation and search
Luka958Pixion Jan 8, 2025
2a0c43b
fixed types
Luka958Pixion Jan 8, 2025
8627b8f
fixed query
Luka958Pixion Jan 8, 2025
c002756
fixed ef_runtime
Luka958Pixion Jan 8, 2025
f9dec08
enhance: Refine the coding style and enable lint-action
XuanYang-cn Jan 8, 2025
4940bf7
fix bug
alwayslove2013 Jan 9, 2025
6814767
fix: Unable to run vebbench and cli
XuanYang-cn Jan 10, 2025
6dfcadf
enhance: Unify optimize and remove ready_to_load
XuanYang-cn Jan 13, 2025
b1c43fd
add mongodb client
zhuwenxing Jan 14, 2025
ac02b14
add mongodb client in readme
zhuwenxing Jan 14, 2025
94f4c6e
add some risk warnings for custom dataset
alwayslove2013 Jan 19, 2025
7a00ca0
Bump grpcio from 1.53.0 to 1.53.2 in /install
dependabot[bot] Jan 20, 2025
7811d73
add mongodb config
zhuwenxing Jan 14, 2025
2359747
Opensearch interal configuration parameters (#463)
Xavierantony1982 Jan 31, 2025
ec439e4
ui control num of concurrencies
Caroline-an777 Feb 10, 2025
e4d987c
Update README.md
xiaofan-luan Feb 12, 2025
9c3446f
environs version should <14.1.0
alwayslove2013 Feb 13, 2025
26b483c
Support GPU_BRUTE_FORCE index for Milvus (#476)
Rachit-Chaudhary11 Feb 24, 2025
040041b
Add table quantization type
lucagiac81 Nov 5, 2024
38a9a32
Support MariaDB database (#375)
HugoWenTD Mar 11, 2025
f7d9210
Add TiDB backend (#484)
breezewish Mar 13, 2025
d2f102e
CLI fix for GPU index (#485)
Rachit-Chaudhary11 Mar 14, 2025
4d0cedd
remove duplicated code
yuyuankang Mar 25, 2025
809024f
feat: initial commit
MansorY23 Apr 8, 2025
d6364b7
Add vespa integration
nuvotex-tk Apr 8, 2025
8432f6f
remove redundant empty_field config check for qdrant and tidb
alwayslove2013 Apr 14, 2025
0a96299
reformat all
alwayslove2013 Apr 14, 2025
6be5c2b
fix cli crush
alwayslove2013 Apr 16, 2025
6d3f4a4
downgrade streamlit version
pauvez Apr 17, 2025
b979e79
add more milvus index types: hnsw sq/pq/prq; ivf rabitq
alwayslove2013 Apr 18, 2025
3f9c498
add more milvus index types: ivf_pq
alwayslove2013 Apr 23, 2025
f7f551e
Add HNSW support for Clickhouse client (#500)
MansorY23 Apr 24, 2025
42af186
fix bugs when use custom_dataset without groundtruth file
alwayslove2013 Apr 30, 2025
9a912f6
fix: prevent the frontend from crashing on invalid indexes in results
s-h-a-d-o-w May 3, 2025
7720bd4
fix ruff warnings
s-h-a-d-o-w May 6, 2025
de9aa90
Fix formatting
s-h-a-d-o-w May 6, 2025
0122126
Add lancedb
s-h-a-d-o-w Apr 26, 2025
75bbdfb
Add --task-label option for cli (#517)
LoveYou3000 May 7, 2025
67f0f2e
Add qdrant cli
s-h-a-d-o-w May 6, 2025
2b966c3
Update README.md
yuyuankang May 12, 2025
4efbe83
Fixing Bugs in Benchmarking ClickHouse with vectordbbench (#523)
yuyuankang May 13, 2025
9f5ea99
Add --concurrency-timeout option to avoid long time waiting (#521)
LoveYou3000 May 14, 2025
aa13197
add alias: VDBBench
alwayslove2013 May 15, 2025
7a1dc5e
LanceDB: Improve serial latency by only selecting id
s-h-a-d-o-w May 15, 2025
a898095
add --num-shards option for milvus performance test case (#526)
LoveYou3000 May 20, 2025
bf03df3
Add a batch cli to support the batch execution of multiple cases. (#530)
LoveYou3000 Jun 9, 2025
6c37626
Fixing bugs in aws opensearch client and added fp16 support (#529)
navneet1v Jun 9, 2025
564b75e
Bugfix: add num_shards option to MilvusHNSW
LoveYou3000 May 30, 2025
ed6a291
BugFix: An error occurs when the password option is not passed.
LoveYou3000 May 30, 2025
f0c88d1
Add support for Qdrant local setup (#533)
ZebinRen Jun 9, 2025
1943931
Fix python import in MemoryDB client
ChristophKoerner Jun 5, 2025
0bda245
upgrade ruff / black, reformat all
alwayslove2013 Jun 9, 2025
2884761
change lancedb vector type to float32
ZebinRen Jun 9, 2025
af8cc1a
add num_shards to MilvusConfig.to_dict()
ZebinRen Jun 10, 2025
4d6d8df
expose lancedb index parameters to the cli interface (#537)
ZebinRen Jun 11, 2025
40da33c
Add parameters of aws opensearch, support hnsw engine options, suppor…
norrishuang Jun 11, 2025
34b3b25
Add OceanBase Database Support to VectorDBBench (#540)
wyfanxiao Jun 14, 2025
ef25859
VectorDBBench 1.0 (#543)
alwayslove2013 Jun 16, 2025
22fe3ed
generate leaderboard_v2 data
alwayslove2013 Jun 17, 2025
9e8003d
update some docs
alwayslove2013 Jun 17, 2025
9edb001
fix bug: set default num_shards to 1
alwayslove2013 Jun 19, 2025
2447d7f
update elastic_cloud results
alwayslove2013 Jun 19, 2025
8b464f5
Fix: Correct typos in README.md (#550)
triplechecker-com Jul 2, 2025
2682e9a
fix bugs: remove None from download_files
alwayslove2013 Jul 2, 2025
d70c751
upgrade aliyun opensearch client (#552)
hust-xing Jul 4, 2025
adcef10
add ivf rabitq for command #553 (#554)
MageChiu Jul 4, 2025
00ad2ec
Fixed the issue where the welcome page image could not be loaded. (#556)
zhuwenxing Jul 7, 2025
aa9ff4b
Fix return to result page error (#557)
zhuwenxing Jul 7, 2025
099d404
Fix run tidb will return error (#559)
JaySon-Huang Jul 11, 2025
e5965fe
feat: Add OSS OpenSearch client support (#562)
akhilpathivada Jul 17, 2025
2a4d0ef
feat: test client for aws s3 vectors
alwayslove2013 Jul 22, 2025
d342beb
upgrade black/ruff and fix lint issues
alwayslove2013 Jul 24, 2025
73d1560
s3vectors standard test results
alwayslove2013 Jul 22, 2025
dc3efcd
add int filter (#545)
Caroline-an777 Jul 24, 2025
caa0979
fix(oss-opensearch): Resolve streaming crashes and improve code relia…
akhilpathivada Jul 25, 2025
d6d1693
feat: Add P95 latency metrics alongside P99 system-wide (#573)
akhilpathivada Jul 25, 2025
3d135a4
update the max level from 3 to 10 for zillizcloud
alwayslove2013 Jul 25, 2025
21dd5d0
Add S3vector Engine for AWS OpenSearch, fixed some bugs of AWS OpenSe…
norrishuang Jul 25, 2025
ea3af3e
Fix missing assets (#575)
emmanuel-ferdman Jul 27, 2025
1a4023a
Add nbits parameter to IVF_PQ index and adapt new filter logic (#576)
wyfanxiao Jul 31, 2025
f253a41
Added support for Product Quantization in pg_diskann (#579)
wahajali Aug 6, 2025
4e01e4a
update leaderboard: add s3vectors results; add streaming results
alwayslove2013 Aug 6, 2025
df52ec8
update leaderboard data: use 90p search stage results as streaming pe…
alwayslove2013 Aug 13, 2025
44fd93e
fix(rate_runner): ensure thread safety for pgvector in concurrent ins…
alwayslove2013 Aug 21, 2025
0a9f309
add env parameters of dataset download from AWS S3 Or Aliyun OSS (#583)
norrishuang Aug 27, 2025
c12fe70
optimize:milvus add replica-number parameter (#588)
liyunqiu666 Aug 27, 2025
b6c3088
Added Alibaba Cloud Hologres support to VectorDBBench. (#591)
xiaolanlianhua Aug 27, 2025
80dcfb7
fix: fix the arguments missing for `LanceDB::insert_embeddings` (#592)
TheR1sing3un Aug 29, 2025
800d7d0
feat: add EnVector support to VectorDBBench
Aug 31, 2025
782e5fe
feat: add EnVector support to VectorDBBench
Aug 31, 2025
0c413da
add ivf flat
euphoria0-0 Sep 25, 2025
0e9ce3e
add ivf flat
euphoria0-0 Sep 25, 2025
53ec88d
fix readme
euphoria0-0 Sep 25, 2025
94b0b7d
fix readme
euphoria0-0 Sep 25, 2025
ce3a636
fix readme
euphoria0-0 Sep 25, 2025
cc11163
fix readme
euphoria0-0 Sep 25, 2025
e619faf
add num_per_batch env var
euphoria0-0 Sep 25, 2025
4487d98
add num_per_batch env var
euphoria0-0 Sep 25, 2025
3120c24
support hyper params
euphoria0-0 Sep 26, 2025
bcdddf3
support hyper params
euphoria0-0 Sep 26, 2025
0819eea
WIP: fix pickle error
euphoria0-0 Sep 26, 2025
6fb69bf
WIP: fix pickle error
euphoria0-0 Sep 26, 2025
dec9d75
fix missing ivf-flat configs
euphoria0-0 Sep 28, 2025
3bf28db
fix missing ivf-flat configs
euphoria0-0 Sep 28, 2025
7f5f432
create_index when no index
euphoria0-0 Sep 30, 2025
c4041be
create_index when no index
euphoria0-0 Sep 30, 2025
92cbf7d
fix batch size
euphoria0-0 Sep 30, 2025
4bee9d6
fix batch size
euphoria0-0 Sep 30, 2025
20f49bd
Update README.md
euphoria0-0 Oct 3, 2025
2a2e921
Update README.md
euphoria0-0 Oct 3, 2025
e9744bf
Update .env.example
euphoria0-0 Oct 3, 2025
faf964d
Update .env.example
euphoria0-0 Oct 3, 2025
5cb4f95
add mm mode
euphoria0-0 Oct 10, 2025
2e83651
add mm mode
euphoria0-0 Oct 10, 2025
79091d7
Merge branch 'fix/config' of github.com:CryptoLabInc/VectorDBBench in…
euphoria0-0 Oct 10, 2025
7c2a91e
Merge branch 'fix/config' of github.com:CryptoLabInc/VectorDBBench in…
euphoria0-0 Oct 10, 2025
4fca725
cli add supported eval mode list
euphoria0-0 Oct 14, 2025
f1b21a7
cli add supported eval mode list
euphoria0-0 Oct 14, 2025
8c1a319
rm indextype
euphoria0-0 Oct 14, 2025
9c745e2
rm indextype
euphoria0-0 Oct 14, 2025
b65af1a
fix choices type
euphoria0-0 Oct 14, 2025
d88c988
fix choices type
euphoria0-0 Oct 14, 2025
9afe622
fix readme && fix NUM_PER_BATCH
euphoria0-0 Oct 15, 2025
5d68a00
fix readme && fix NUM_PER_BATCH
euphoria0-0 Oct 15, 2025
601fe0e
add trained_centroids
euphoria0-0 Oct 30, 2025
2be382d
add trained_centroids
euphoria0-0 Oct 30, 2025
348d630
update
euphoria0-0 Oct 30, 2025
1d9371b
update
euphoria0-0 Oct 30, 2025
505f6e2
revert
euphoria0-0 Nov 4, 2025
34bdc41
revert
euphoria0-0 Nov 4, 2025
756d997
add centroids path
euphoria0-0 Nov 4, 2025
ba9ad94
add centroids path
euphoria0-0 Nov 4, 2025
989d277
fix readme
euphoria0-0 Nov 5, 2025
8f767e3
fix readme
euphoria0-0 Nov 5, 2025
e7b0a7a
add example result in readme
euphoria0-0 Nov 5, 2025
86a9f65
add example result in readme
euphoria0-0 Nov 5, 2025
8411937
WIP: fix lock error
euphoria0-0 Nov 5, 2025
ef85613
WIP: fix lock error
euphoria0-0 Nov 5, 2025
417a748
fix batch size
euphoria0-0 Nov 5, 2025
49cd0b4
fix batch size
euphoria0-0 Nov 5, 2025
edd6747
fix num per batch env var
euphoria0-0 Nov 5, 2025
ab75253
fix num per batch env var
euphoria0-0 Nov 5, 2025
99803c4
fix config
euphoria0-0 Nov 6, 2025
a154a06
fix config
euphoria0-0 Nov 6, 2025
7df4c0d
support virtual cluster tree
euphoria0-0 Nov 17, 2025
7c1bc7f
fix config
euphoria0-0 Nov 17, 2025
b7bb8ca
add benchmark script
euphoria0-0 Nov 17, 2025
e157710
fix log
euphoria0-0 Nov 17, 2025
698d9c2
fix benchmark script
euphoria0-0 Nov 17, 2025
a70f71c
fix numpy sequence stack
euphoria0-0 Nov 18, 2025
e153442
add type config in benchmark
euphoria0-0 Nov 18, 2025
3e85e98
add logs
euphoria0-0 Nov 18, 2025
337b0ca
rm copy state
euphoria0-0 Nov 18, 2025
aa6a5b1
fix eliminated
euphoria0-0 Nov 18, 2025
2e275fa
fix eliminated
euphoria0-0 Nov 18, 2025
2d37643
fix np take
euphoria0-0 Nov 18, 2025
a7cbd4f
fix np take
euphoria0-0 Nov 18, 2025
76d7e64
fix insert metadata
euphoria0-0 Nov 18, 2025
3ae5f92
fix benchmark script
euphoria0-0 Nov 18, 2025
0cabd0c
fix overwrite nprobe
euphoria0-0 Nov 19, 2025
4615ba1
fix
euphoria0-0 Nov 19, 2025
b962650
fix get tree
euphoria0-0 Nov 19, 2025
7c7a857
fix centroid file path
euphoria0-0 Nov 19, 2025
9e84cd3
add debug log for centroid list
euphoria0-0 Nov 19, 2025
8b37bf0
fix to delete the index instead of all indexes when drop_old
euphoria0-0 Nov 20, 2025
134fa04
fix drop_index
euphoria0-0 Nov 20, 2025
7f9d4aa
update vct as dfs version
euphoria0-0 Nov 21, 2025
4feb093
fix
euphoria0-0 Nov 21, 2025
66fa44b
rename centroids as centroids_path
euphoria0-0 Nov 21, 2025
228e91c
rename centroids as centroids_path
euphoria0-0 Nov 21, 2025
6d19492
add envector config file
euphoria0-0 Nov 21, 2025
37bfe16
fix
euphoria0-0 Nov 21, 2025
4f006b9
add neighbors file
euphoria0-0 Nov 21, 2025
4cd059a
fix
euphoria0-0 Nov 21, 2025
4c0870f
fix
euphoria0-0 Nov 21, 2025
1eacc4a
rename
euphoria0-0 Nov 21, 2025
5ff4dd8
fix
euphoria0-0 Nov 21, 2025
7814d0a
fix comment
euphoria0-0 Nov 21, 2025
44cd1ec
rm unused commented
euphoria0-0 Nov 21, 2025
440ff55
fix
euphoria0-0 Nov 21, 2025
1935c65
update
euphoria0-0 Nov 21, 2025
62db666
fix
euphoria0-0 Nov 21, 2025
b84ccaa
update vct
euphoria0-0 Nov 21, 2025
8cb2cfb
fix comments
euphoria0-0 Nov 21, 2025
810c875
fix vct opt
euphoria0-0 Nov 21, 2025
5f035fb
fix readme
euphoria0-0 Nov 21, 2025
8c3b06b
fix
euphoria0-0 Nov 21, 2025
c36621e
fix
euphoria0-0 Nov 21, 2025
9f71c13
fix preprare datset
euphoria0-0 Nov 21, 2025
6b5ae17
fix preprare datset
euphoria0-0 Nov 21, 2025
1aee4b4
fix
euphoria0-0 Nov 21, 2025
b9bd9fd
fix log
euphoria0-0 Nov 21, 2025
71da08a
rm req
euphoria0-0 Nov 21, 2025
ace9835
Merge branch 'fix/config' into feat/add-vct-pubmed-centroids
euphoria0-0 Nov 21, 2025
0b7a3f8
add centroid download
euphoria0-0 Nov 21, 2025
36e92ef
fix readme
euphoria0-0 Nov 21, 2025
0cf66ad
add comments
euphoria0-0 Nov 21, 2025
880cb49
activate comments
euphoria0-0 Nov 21, 2025
8ffa0da
add envector readme link
euphoria0-0 Nov 21, 2025
4eaa3f1
add bloomberg
euphoria0-0 Nov 21, 2025
b494caa
add options
euphoria0-0 Nov 21, 2025
8444440
fix readme
euphoria0-0 Nov 21, 2025
76ee6df
update readme
euphoria0-0 Nov 21, 2025
4884064
Merge pull request #3 from CryptoLabInc/feat/add-vct-pubmed-centroids
euphoria0-0 Nov 21, 2025
16d417b
Update vectordb_bench/backend/clients/envector/config.py
euphoria0-0 Nov 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
# LOG_NAME=
# TIMEZONE=

# NUM_PER_BATCH=
NUM_PER_BATCH=4096
# DEFAULT_DATASET_URL=

DATASET_LOCAL_DIR="/tmp/vectordb_bench/dataset"
DATASET_LOCAL_DIR="/data/vectordb_bench/dataset"

# DROP_OLD = True
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# enVector with ANN (GAS) in VectorDBBench

The guide on how to use enVector with ANN index in VectorDBBench is available in [README_ENVECTOR.md](README_ENVECTOR.md).

The followings are the original contents of README in VectorDBBench:

---

# VectorDBBench(VDBBench): A Benchmark Tool for VectorDB

[![version](https://img.shields.io/pypi/v/vectordb-bench.svg?color=blue)](https://pypi.org/project/vectordb-bench/)
Expand Down Expand Up @@ -422,6 +430,9 @@ python -m vectordb_bench

OR:

If you are using [dev container](https://code.visualstudio.com/docs/devcontainers/containers), create
the following dataset directory first:

```shell
init_bench
```
Expand Down
132 changes: 132 additions & 0 deletions README_ENVECTOR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# enVector with ANN (GAS) in VectorDBBench

This guide demonstrates how to use enVector with an ANN index in VectorDBBench.

Basic usage of enVector with VectorDBBench follows the standard procedure for [VectorDBBench](https://github.com/zilliztech/VectorDBBench).

## Structure

```bash
.
├── centroids
│ └── embeddinggemma-300m
│ ├── centroids.npy # centroids file for ANN
│ └── tree_info.pkl # tree metadata for ANN
├── dataset
│ └── pubmed768d400k # VectorDB ANN benchmark dataset
│ ├── neighbors.parquet
│ ├── test.npy
│ └── train.pkl
├── README_ENVECTOR.md
├── scripts
├── run_benchmark.sh # benchmark script
├── envector_pubmed_config.yml # benchmark config file
└── prepare_dataset.py # download and prepare ground truth neighbors for dataset
```

## Prerequisites

### Install Python Dependencies
```bash
# 1. Create your environment
python -m venv .venv
source .venv/bin/activate

# 2. Install VectorDBBench
pip install -e .

# 3. Install es2
pip install es2==1.2.0a4
```

### Prepare dataset

Prepare the following artifacts for the ANN benchmark with `scripts/prepare_dataset.py`:

- download datasets from HuggingFace
- prepare ground-truth neighbors
- download centroids and tree metadata for the GAS index for corresponding to the embedding model

For the ANN benchmark, we provide two datasets via HuggingFace:
- PUBMED768D400K: [cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m)
- BLOOMBERG768D368K: [cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m)

Also, we provide centroids and tree metadata for the corresponding embedding model used in the ANN benchmark:
- GAS Centroids: [cryptolab-playground/gas-centroids](https://huggingface.co/datasets/cryptolab-playground/gas-centroids)

To prepare dataset, run the following command as example:

```bash
# Prepare dataset
python ./scripts/prepare_dataset.py \
-d cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m \
-e embeddinggemma-300m
```

Then, you can find the following generated files:

```bash
.
├── centroids
│ └── embeddinggemma-300m
│ ├── centroids.npy
│ └── tree_info.pkl
└── dataset
└── pubmed768d400k
├── neighbors.parquet
├── test.npy
└── train.pkl
```

### Prepare enVector Server

To run enVector server with ANN, please refer to the [enVector Deployment repository](https://github.com/CryptoLabInc/envector-deployment).
For example, you can start the server with the following command:

```bash
# Start enVector server
git clone https://github.com/CryptoLabInc/envector-deployment
cd envector-deployment/docker-compose
./start_envector.sh
```

We provide four enVector Docker Images:
- `cryptolabinc/es2e:v1.2.0-alpha.4`
- `cryptolabinc/es2b:v1.2.0-alpha.4`
- `cryptolabinc/es2o:v1.2.0-alpha.4`
- `cryptolabinc/es2c:v1.2.0-alpha.4`

### Set Environment Variables

```bash
# Set environment variables
export DATASET_LOCAL_DIR="./dataset"
export NUM_PER_BATCH=4096
```

## Run Benchmark

Refer to `./scripts/run_benchmark.sh` or `./scripts/envector_benchmark_config.yml` for benchmarks with enVector with ANN (VCT), or use the following command:

```bash
export NUM_PER_BATCH=500000 # set to the database size for efficiency with IVF_FLAT
python -m vectordb_bench.cli.vectordbbench envectorivfflat \
--uri "localhost:50050" \
--eval-mode mm \
--case-type PerformanceCustomDataset \
--db-label "PUBMED768D400K-IVF" \
--custom-case-name PUBMED768D400K \
--custom-dataset-name PUBMED768D400K \
--custom-dataset-dir "" \
--custom-dataset-size 400335 \
--custom-dataset-dim 768 \
--custom-dataset-file-count 1 \
--custom-dataset-with-gt \
--skip-custom-dataset-use-shuffled \
--train-centroids True \
--is-vct True \
--centroids-path "./centroids/embeddinggemma-300m/centroids.npy" \
--vct-path "./centroids/embeddinggemma-300m/tree_info.pkl" \
--nlist 32768 \
--nprobe 6
```
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ dependencies = [
"pydantic<v2",
"scikit-learn",
"pymilvus", # with pandas, numpy, ujson
"ujson",
"pgvector",
"psycopg",
"psycopg[binary]",
"datasets",
"faiss-cpu"
]
dynamic = ["version"]

Expand Down
42 changes: 42 additions & 0 deletions scripts/envector_bloomberg_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
envectorflat:
uri: localhost:50050
eval_mode: mm
case_type: PerformanceCustomDataset
db_label: BLOOMBERG768D368K-FLAT
custom_case_name: BLOOMBERG768D368K
custom_case_description: BLOOMBERG768D368K benchmark (768D, 368K vectors)
custom_dataset_name: BLOOMBERG768D368K
custom_dataset_dir:
custom_dataset_size: 368816
custom_dataset_dim: 768
custom_dataset_file_count: 1
custom_dataset_use_shuffled: false
custom_dataset_with_gt: true
k: 10
drop_old: true
load: true

envectorivfflat:
uri: localhost:50050
eval_mode: mm
case_type: PerformanceCustomDataset
db_label: BLOOMBERG768D368K-IVF
custom_case_name: BLOOMBERG768D368K
custom_case_description: BLOOMBERG768D368K benchmark (768D, 368K vectors)
custom_dataset_name: BLOOMBERG768D368K
custom_dataset_dir:
custom_dataset_size: 368816
custom_dataset_dim: 768
custom_dataset_file_count: 1
custom_dataset_use_shuffled: false
custom_dataset_with_gt: true
k: 10
nlist: 32768
nprobe: 6
train_centroids: true
is_vct: true
centroids_path: centroids/embeddinggemma-300m/centroids.npy
vct_path: centroids/embeddinggemma-300m/tree_info.pkl
drop_old: true
load: true

42 changes: 42 additions & 0 deletions scripts/envector_pubmed_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
envectorflat:
uri: localhost:50050
eval_mode: mm
case_type: PerformanceCustomDataset
db_label: PUBMED768D400K-FLAT
custom_case_name: PUBMED768D400K
custom_case_description: PUBMED768D400K benchmark (768D, 400K vectors)
custom_dataset_name: PUBMED768D400K
custom_dataset_dir:
custom_dataset_size: 400335
custom_dataset_dim: 768
custom_dataset_file_count: 1
custom_dataset_use_shuffled: false
custom_dataset_with_gt: true
k: 10
drop_old: true
load: true

envectorivfflat:
uri: localhost:50050
eval_mode: mm
case_type: PerformanceCustomDataset
db_label: PUBMED768D400K-IVF
custom_case_name: PUBMED768D400K
custom_case_description: PUBMED768D400K benchmark (768D, 400K vectors)
custom_dataset_name: PUBMED768D400K
custom_dataset_dir:
custom_dataset_size: 400335
custom_dataset_dim: 768
custom_dataset_file_count: 1
custom_dataset_use_shuffled: false
custom_dataset_with_gt: true
k: 10
nlist: 32768
nprobe: 6
train_centroids: true
is_vct: true
centroids_path: centroids/embeddinggemma-300m/centroids.npy
vct_path: centroids/embeddinggemma-300m/tree_info.pkl
drop_old: true
load: true

114 changes: 114 additions & 0 deletions scripts/prepare_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import os
import wget
import argparse
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from datasets import load_dataset

import faiss

def get_args():
parser = argparse.ArgumentParser(
description="Prepare dataset and ground truth neighbors for benchmarking."
)
parser.add_argument(
"-d", "--dataset-name",
type=str,
default="cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m",
help="Huggingface dataset name to download.",
choices=[
"cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m",
"cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m",
],
)
parser.add_argument(
"--dataset-dir",
type=str,
default="./dataset/pubmed768d400k",
help="Dataset directory to save the dataset and neighbors.",
)
parser.add_argument(
"-e", "--embedding-model",
type=str,
default="embeddinggemma-300m",
help="Embedding model name to download centroids for.",
)
parser.add_argument(
"--centroids-dir",
type=str,
default="./centroids",
help="Directory to save the centroids and tree info.",
)
return parser.parse_args()

def download_dataset(
dataset_name: str,
output_dir: str = "./dataset/pubmed768d400k"
) -> None:
"""Download dataset from Huggingface and save as Parquet files."""
# load dataset
ds = load_dataset(dataset_name)
train = ds["train"].to_pandas()
test = ds["test"].to_pandas()

# write to parquet
train_table = pa.Table.from_pandas(train)
pq.write_table(train_table, f"{output_dir}/train.parquet")

test_table = pa.Table.from_pandas(test)
pq.write_table(test_table, f"{output_dir}/test.parquet")

def prepare_neighbors(
data_dir: str = "./dataset/pubmed768d400k",
) -> None:
"""Prepare ground truth neighbors using brute-force flat search and save as Parquet."""
# load dataset
train = pd.read_parquet(f"{data_dir}/train.parquet")
test = pd.read_parquet(f"{data_dir}/test.parquet")

train = np.stack(train["emb"].to_list()).astype("float32")
test = np.stack(test["emb"].to_list()).astype("float32")
dim = train.shape[1]

# flat search
index = faiss.IndexFlatIP(dim)
index.add(train)

k = len(test)
distances, indices = index.search(test, k)
print(distances.shape, indices.shape)

# save flat search result as neighbors
df = pd.DataFrame({
"id": np.arange(len(indices)),
"neighbors_id": indices.tolist()
})

table = pa.Table.from_pandas(df)
pq.write_table(table, f"{data_dir}/neighbors.parquet")

def download_centroids(embedding_model: str, dataset_dir: str) -> None:
"""Download pre-computed centroids and tree info for GAS VCT index."""

if embedding_model != "embeddinggemma-300m":
raise ValueError(f"Centroids for {embedding_model} currently not available.")

# https://huggingface.co/datasets/cryptolab-playground/gas-centroids
dataset_link = f"https://huggingface.co/datasets/cryptolab-playground/gas-centroids/resolve/main/{embedding_model}"

# download
os.makedirs(os.path.join(dataset_dir, embedding_model), exist_ok=True)
wget.download(f"{dataset_link}/centroids.npy", out=os.path.join(dataset_dir, embedding_model, "centroids.npy"))
wget.download(f"{dataset_link}/tree_info.pkl", out=os.path.join(dataset_dir, embedding_model, "tree_info.pkl"))


if __name__ == "__main__":
args = get_args()
os.makedirs(args.dataset_dir, exist_ok=True)

download_dataset(args.dataset_name, args.dataset_dir)
prepare_neighbors(args.dataset_dir)
download_centroids(args.embedding_model, args.centroids_dir)
Loading
Loading