forked from zilliztech/VectorDBBench
-
Notifications
You must be signed in to change notification settings - Fork 0
Add enVector #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add enVector #1
Changes from all commits
Commits
Show all changes
331 commits
Select commit
Hold shift + click to select a range
5b1248d
Add quantization option for pgvector with support for halfvec
lucagiac81 1678077
Add cli support for running benchmark with custom dataset in pgvector…
Sheharyar570 cba7043
feat: upgrade pgvecto.rs sdk to v0.2.2
cutecutecat d323821
Randomly pick start idx of test dataset in concurrency search.
Sheharyar570 7f32779
Added optimzation for Opensearch
navneet1v 467b00a
record the test time; add version / note info for milvus and zillizcl…
alwayslove2013 a64d326
fix bug: date to datetime
alwayslove2013 dc73c2a
update leaderboard data
alwayslove2013 b9a0ce5
fix leaderboard data: zillizcloud version
alwayslove2013 37b0c7c
Fixed custom_case key error in parameters dict in CLI command.
Sheharyar570 c5aa67e
Refactored command options for consistency.
Sheharyar570 61e2808
Updated readme, added custom case related command options information.
Sheharyar570 b531680
update the instruction for adding custom_case support in new CLI impl…
Sheharyar570 32c1a53
add key for plotly_chart
alwayslove2013 523726f
add key for plotly_chart
alwayslove2013 6939b91
fix pinecone client
alwayslove2013 d209659
Support for pgdiskann client (#388)
wahajali 6a05c30
increase timeout
alwayslove2013 1d8b218
Binary Quantization Support for pgvector HNSW Algorithm (#389)
Sheharyar570 7619acb
fix weaviate client bug
alwayslove2013 36316d4
Fix code
acanadil 6cb5898
remove older zillizcloud test results from leaderboard
alwayslove2013 df18476
fixed pgvectorivvflat cli reranking key error bug
Sheharyar570 653bd39
set default value of quantized_fetch_limit to 100 in case of ivfflat,…
Sheharyar570 f498e71
update comment
Sheharyar570 183f47b
Add rate runner
XuanYang-cn 983ea4c
fix conc_latency_p99 calculation; add conc_latency_avg metric; conc_t…
alwayslove2013 0230184
Added AlloyDB client
Sheharyar570 a641244
Remove query that set storage to plain.
Sheharyar570 6a1477a
Add default value for pre_reordering_num_neighbors in cli options.
Sheharyar570 b0553bc
fix: Donot refresh load
XuanYang-cn c6a2e79
fix: invalid value for --max-num-levels when using CLI.
Sheharyar570 2250e62
Add Milvus auth support through user_name and password fields (#416)
teynar fa14f04
support alibaba cloud elasticsearch (#418)
xingshaomin 0f9d9c8
enhance: refine read write cases
XuanYang-cn 185541a
add aliyun opensearch client
8a1e18e
bug fix: cost time should be removed from the results of the serial_s…
alwayslove2013 0b524d9
fix: Opensearch requirements
XuanYang-cn 91d1eed
add aliyun Opensearch requirements
9aa58f0
update readme
alwayslove2013 4481615
Removed the Filter Path from Search, so we can get the full response
ba6bd9b
add support to provide custom port in pgvector
shaharuk-yb 834c440
show_default=True
shaharuk-yb 378841f
updated the opensearch to use column id instead of _id
370106f
Modified the code to use internal column _id instead of id.
c0be2c7
added HNSW params to index creation and search
Luka958Pixion 2a0c43b
fixed types
Luka958Pixion 8627b8f
fixed query
Luka958Pixion c002756
fixed ef_runtime
Luka958Pixion f9dec08
enhance: Refine the coding style and enable lint-action
XuanYang-cn 4940bf7
fix bug
alwayslove2013 6814767
fix: Unable to run vebbench and cli
XuanYang-cn 6dfcadf
enhance: Unify optimize and remove ready_to_load
XuanYang-cn b1c43fd
add mongodb client
zhuwenxing ac02b14
add mongodb client in readme
zhuwenxing 94f4c6e
add some risk warnings for custom dataset
alwayslove2013 7a00ca0
Bump grpcio from 1.53.0 to 1.53.2 in /install
dependabot[bot] 7811d73
add mongodb config
zhuwenxing 2359747
Opensearch interal configuration parameters (#463)
Xavierantony1982 ec439e4
ui control num of concurrencies
Caroline-an777 e4d987c
Update README.md
xiaofan-luan 9c3446f
environs version should <14.1.0
alwayslove2013 26b483c
Support GPU_BRUTE_FORCE index for Milvus (#476)
Rachit-Chaudhary11 040041b
Add table quantization type
lucagiac81 38a9a32
Support MariaDB database (#375)
HugoWenTD f7d9210
Add TiDB backend (#484)
breezewish d2f102e
CLI fix for GPU index (#485)
Rachit-Chaudhary11 4d0cedd
remove duplicated code
yuyuankang 809024f
feat: initial commit
MansorY23 d6364b7
Add vespa integration
nuvotex-tk 8432f6f
remove redundant empty_field config check for qdrant and tidb
alwayslove2013 0a96299
reformat all
alwayslove2013 6be5c2b
fix cli crush
alwayslove2013 6d3f4a4
downgrade streamlit version
pauvez b979e79
add more milvus index types: hnsw sq/pq/prq; ivf rabitq
alwayslove2013 3f9c498
add more milvus index types: ivf_pq
alwayslove2013 f7f551e
Add HNSW support for Clickhouse client (#500)
MansorY23 42af186
fix bugs when use custom_dataset without groundtruth file
alwayslove2013 9a912f6
fix: prevent the frontend from crashing on invalid indexes in results
s-h-a-d-o-w 7720bd4
fix ruff warnings
s-h-a-d-o-w de9aa90
Fix formatting
s-h-a-d-o-w 0122126
Add lancedb
s-h-a-d-o-w 75bbdfb
Add --task-label option for cli (#517)
LoveYou3000 67f0f2e
Add qdrant cli
s-h-a-d-o-w 2b966c3
Update README.md
yuyuankang 4efbe83
Fixing Bugs in Benchmarking ClickHouse with vectordbbench (#523)
yuyuankang 9f5ea99
Add --concurrency-timeout option to avoid long time waiting (#521)
LoveYou3000 aa13197
add alias: VDBBench
alwayslove2013 7a1dc5e
LanceDB: Improve serial latency by only selecting id
s-h-a-d-o-w a898095
add --num-shards option for milvus performance test case (#526)
LoveYou3000 bf03df3
Add a batch cli to support the batch execution of multiple cases. (#530)
LoveYou3000 6c37626
Fixing bugs in aws opensearch client and added fp16 support (#529)
navneet1v 564b75e
Bugfix: add num_shards option to MilvusHNSW
LoveYou3000 ed6a291
BugFix: An error occurs when the password option is not passed.
LoveYou3000 f0c88d1
Add support for Qdrant local setup (#533)
ZebinRen 1943931
Fix python import in MemoryDB client
ChristophKoerner 0bda245
upgrade ruff / black, reformat all
alwayslove2013 2884761
change lancedb vector type to float32
ZebinRen af8cc1a
add num_shards to MilvusConfig.to_dict()
ZebinRen 4d6d8df
expose lancedb index parameters to the cli interface (#537)
ZebinRen 40da33c
Add parameters of aws opensearch, support hnsw engine options, suppor…
norrishuang 34b3b25
Add OceanBase Database Support to VectorDBBench (#540)
wyfanxiao ef25859
VectorDBBench 1.0 (#543)
alwayslove2013 22fe3ed
generate leaderboard_v2 data
alwayslove2013 9e8003d
update some docs
alwayslove2013 9edb001
fix bug: set default num_shards to 1
alwayslove2013 2447d7f
update elastic_cloud results
alwayslove2013 8b464f5
Fix: Correct typos in README.md (#550)
triplechecker-com 2682e9a
fix bugs: remove None from download_files
alwayslove2013 d70c751
upgrade aliyun opensearch client (#552)
hust-xing adcef10
add ivf rabitq for command #553 (#554)
MageChiu 00ad2ec
Fixed the issue where the welcome page image could not be loaded. (#556)
zhuwenxing aa9ff4b
Fix return to result page error (#557)
zhuwenxing 099d404
Fix run tidb will return error (#559)
JaySon-Huang e5965fe
feat: Add OSS OpenSearch client support (#562)
akhilpathivada 2a4d0ef
feat: test client for aws s3 vectors
alwayslove2013 d342beb
upgrade black/ruff and fix lint issues
alwayslove2013 73d1560
s3vectors standard test results
alwayslove2013 dc3efcd
add int filter (#545)
Caroline-an777 caa0979
fix(oss-opensearch): Resolve streaming crashes and improve code relia…
akhilpathivada d6d1693
feat: Add P95 latency metrics alongside P99 system-wide (#573)
akhilpathivada 3d135a4
update the max level from 3 to 10 for zillizcloud
alwayslove2013 21dd5d0
Add S3vector Engine for AWS OpenSearch, fixed some bugs of AWS OpenSe…
norrishuang ea3af3e
Fix missing assets (#575)
emmanuel-ferdman 1a4023a
Add nbits parameter to IVF_PQ index and adapt new filter logic (#576)
wyfanxiao f253a41
Added support for Product Quantization in pg_diskann (#579)
wahajali 4e01e4a
update leaderboard: add s3vectors results; add streaming results
alwayslove2013 df52ec8
update leaderboard data: use 90p search stage results as streaming pe…
alwayslove2013 44fd93e
fix(rate_runner): ensure thread safety for pgvector in concurrent ins…
alwayslove2013 0a9f309
add env parameters of dataset download from AWS S3 Or Aliyun OSS (#583)
norrishuang c12fe70
optimize:milvus add replica-number parameter (#588)
liyunqiu666 b6c3088
Added Alibaba Cloud Hologres support to VectorDBBench. (#591)
xiaolanlianhua 80dcfb7
fix: fix the arguments missing for `LanceDB::insert_embeddings` (#592)
TheR1sing3un 800d7d0
feat: add EnVector support to VectorDBBench
782e5fe
feat: add EnVector support to VectorDBBench
0c413da
add ivf flat
euphoria0-0 0e9ce3e
add ivf flat
euphoria0-0 53ec88d
fix readme
euphoria0-0 94b0b7d
fix readme
euphoria0-0 ce3a636
fix readme
euphoria0-0 cc11163
fix readme
euphoria0-0 e619faf
add num_per_batch env var
euphoria0-0 4487d98
add num_per_batch env var
euphoria0-0 3120c24
support hyper params
euphoria0-0 bcdddf3
support hyper params
euphoria0-0 0819eea
WIP: fix pickle error
euphoria0-0 6fb69bf
WIP: fix pickle error
euphoria0-0 dec9d75
fix missing ivf-flat configs
euphoria0-0 3bf28db
fix missing ivf-flat configs
euphoria0-0 7f5f432
create_index when no index
euphoria0-0 c4041be
create_index when no index
euphoria0-0 92cbf7d
fix batch size
euphoria0-0 4bee9d6
fix batch size
euphoria0-0 20f49bd
Update README.md
euphoria0-0 2a2e921
Update README.md
euphoria0-0 e9744bf
Update .env.example
euphoria0-0 faf964d
Update .env.example
euphoria0-0 5cb4f95
add mm mode
euphoria0-0 2e83651
add mm mode
euphoria0-0 79091d7
Merge branch 'fix/config' of github.com:CryptoLabInc/VectorDBBench in…
euphoria0-0 7c2a91e
Merge branch 'fix/config' of github.com:CryptoLabInc/VectorDBBench in…
euphoria0-0 4fca725
cli add supported eval mode list
euphoria0-0 f1b21a7
cli add supported eval mode list
euphoria0-0 8c1a319
rm indextype
euphoria0-0 9c745e2
rm indextype
euphoria0-0 b65af1a
fix choices type
euphoria0-0 d88c988
fix choices type
euphoria0-0 9afe622
fix readme && fix NUM_PER_BATCH
euphoria0-0 5d68a00
fix readme && fix NUM_PER_BATCH
euphoria0-0 601fe0e
add trained_centroids
euphoria0-0 2be382d
add trained_centroids
euphoria0-0 348d630
update
euphoria0-0 1d9371b
update
euphoria0-0 505f6e2
revert
euphoria0-0 34bdc41
revert
euphoria0-0 756d997
add centroids path
euphoria0-0 ba9ad94
add centroids path
euphoria0-0 989d277
fix readme
euphoria0-0 8f767e3
fix readme
euphoria0-0 e7b0a7a
add example result in readme
euphoria0-0 86a9f65
add example result in readme
euphoria0-0 8411937
WIP: fix lock error
euphoria0-0 ef85613
WIP: fix lock error
euphoria0-0 417a748
fix batch size
euphoria0-0 49cd0b4
fix batch size
euphoria0-0 edd6747
fix num per batch env var
euphoria0-0 ab75253
fix num per batch env var
euphoria0-0 99803c4
fix config
euphoria0-0 a154a06
fix config
euphoria0-0 7df4c0d
support virtual cluster tree
euphoria0-0 7c1bc7f
fix config
euphoria0-0 b7bb8ca
add benchmark script
euphoria0-0 e157710
fix log
euphoria0-0 698d9c2
fix benchmark script
euphoria0-0 a70f71c
fix numpy sequence stack
euphoria0-0 e153442
add type config in benchmark
euphoria0-0 3e85e98
add logs
euphoria0-0 337b0ca
rm copy state
euphoria0-0 aa6a5b1
fix eliminated
euphoria0-0 2e275fa
fix eliminated
euphoria0-0 2d37643
fix np take
euphoria0-0 a7cbd4f
fix np take
euphoria0-0 76d7e64
fix insert metadata
euphoria0-0 3ae5f92
fix benchmark script
euphoria0-0 0cabd0c
fix overwrite nprobe
euphoria0-0 4615ba1
fix
euphoria0-0 b962650
fix get tree
euphoria0-0 7c7a857
fix centroid file path
euphoria0-0 9e84cd3
add debug log for centroid list
euphoria0-0 8b37bf0
fix to delete the index instead of all indexes when drop_old
euphoria0-0 134fa04
fix drop_index
euphoria0-0 7f9d4aa
update vct as dfs version
euphoria0-0 4feb093
fix
euphoria0-0 66fa44b
rename centroids as centroids_path
euphoria0-0 228e91c
rename centroids as centroids_path
euphoria0-0 6d19492
add envector config file
euphoria0-0 37bfe16
fix
euphoria0-0 4f006b9
add neighbors file
euphoria0-0 4cd059a
fix
euphoria0-0 4c0870f
fix
euphoria0-0 1eacc4a
rename
euphoria0-0 5ff4dd8
fix
euphoria0-0 7814d0a
fix comment
euphoria0-0 44cd1ec
rm unused commented
euphoria0-0 440ff55
fix
euphoria0-0 1935c65
update
euphoria0-0 62db666
fix
euphoria0-0 b84ccaa
update vct
euphoria0-0 8cb2cfb
fix comments
euphoria0-0 810c875
fix vct opt
euphoria0-0 5f035fb
fix readme
euphoria0-0 8c3b06b
fix
euphoria0-0 c36621e
fix
euphoria0-0 9f71c13
fix preprare datset
euphoria0-0 6b5ae17
fix preprare datset
euphoria0-0 1aee4b4
fix
euphoria0-0 b9bd9fd
fix log
euphoria0-0 71da08a
rm req
euphoria0-0 ace9835
Merge branch 'fix/config' into feat/add-vct-pubmed-centroids
euphoria0-0 0b7a3f8
add centroid download
euphoria0-0 36e92ef
fix readme
euphoria0-0 0cf66ad
add comments
euphoria0-0 880cb49
activate comments
euphoria0-0 8ffa0da
add envector readme link
euphoria0-0 4eaa3f1
add bloomberg
euphoria0-0 b494caa
add options
euphoria0-0 8444440
fix readme
euphoria0-0 76ee6df
update readme
euphoria0-0 4884064
Merge pull request #3 from CryptoLabInc/feat/add-vct-pubmed-centroids
euphoria0-0 16d417b
Update vectordb_bench/backend/clients/envector/config.py
euphoria0-0 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| # enVector with ANN (GAS) in VectorDBBench | ||
|
|
||
| This guide demonstrates how to use enVector with an ANN index in VectorDBBench. | ||
|
|
||
| Basic usage of enVector with VectorDBBench follows the standard procedure for [VectorDBBench](https://github.com/zilliztech/VectorDBBench). | ||
|
|
||
| ## Structure | ||
|
|
||
| ```bash | ||
| . | ||
| ├── centroids | ||
| │ └── embeddinggemma-300m | ||
| │ ├── centroids.npy # centroids file for ANN | ||
| │ └── tree_info.pkl # tree metadata for ANN | ||
| ├── dataset | ||
| │ └── pubmed768d400k # VectorDB ANN benchmark dataset | ||
| │ ├── neighbors.parquet | ||
| │ ├── test.npy | ||
| │ └── train.pkl | ||
| ├── README_ENVECTOR.md | ||
| ├── scripts | ||
| ├── run_benchmark.sh # benchmark script | ||
| ├── envector_pubmed_config.yml # benchmark config file | ||
| └── prepare_dataset.py # download and prepare ground truth neighbors for dataset | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### Install Python Dependencies | ||
| ```bash | ||
| # 1. Create your environment | ||
| python -m venv .venv | ||
| source .venv/bin/activate | ||
|
|
||
| # 2. Install VectorDBBench | ||
| pip install -e . | ||
|
|
||
| # 3. Install es2 | ||
| pip install es2==1.2.0a4 | ||
| ``` | ||
|
|
||
| ### Prepare dataset | ||
|
|
||
| Prepare the following artifacts for the ANN benchmark with `scripts/prepare_dataset.py`: | ||
|
|
||
| - download datasets from HuggingFace | ||
| - prepare ground-truth neighbors | ||
| - download centroids and tree metadata for the GAS index for corresponding to the embedding model | ||
|
|
||
| For the ANN benchmark, we provide two datasets via HuggingFace: | ||
| - PUBMED768D400K: [cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m) | ||
| - BLOOMBERG768D368K: [cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m) | ||
|
|
||
| Also, we provide centroids and tree metadata for the corresponding embedding model used in the ANN benchmark: | ||
| - GAS Centroids: [cryptolab-playground/gas-centroids](https://huggingface.co/datasets/cryptolab-playground/gas-centroids) | ||
|
|
||
| To prepare dataset, run the following command as example: | ||
|
|
||
| ```bash | ||
| # Prepare dataset | ||
| python ./scripts/prepare_dataset.py \ | ||
| -d cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m \ | ||
| -e embeddinggemma-300m | ||
| ``` | ||
|
|
||
| Then, you can find the following generated files: | ||
|
|
||
| ```bash | ||
| . | ||
| ├── centroids | ||
| │ └── embeddinggemma-300m | ||
| │ ├── centroids.npy | ||
| │ └── tree_info.pkl | ||
| └── dataset | ||
| └── pubmed768d400k | ||
| ├── neighbors.parquet | ||
| ├── test.npy | ||
| └── train.pkl | ||
| ``` | ||
|
|
||
| ### Prepare enVector Server | ||
|
|
||
| To run enVector server with ANN, please refer to the [enVector Deployment repository](https://github.com/CryptoLabInc/envector-deployment). | ||
| For example, you can start the server with the following command: | ||
|
|
||
| ```bash | ||
| # Start enVector server | ||
| git clone https://github.com/CryptoLabInc/envector-deployment | ||
| cd envector-deployment/docker-compose | ||
| ./start_envector.sh | ||
| ``` | ||
|
|
||
| We provide four enVector Docker Images: | ||
| - `cryptolabinc/es2e:v1.2.0-alpha.4` | ||
| - `cryptolabinc/es2b:v1.2.0-alpha.4` | ||
| - `cryptolabinc/es2o:v1.2.0-alpha.4` | ||
| - `cryptolabinc/es2c:v1.2.0-alpha.4` | ||
|
|
||
| ### Set Environment Variables | ||
|
|
||
| ```bash | ||
| # Set environment variables | ||
| export DATASET_LOCAL_DIR="./dataset" | ||
| export NUM_PER_BATCH=4096 | ||
| ``` | ||
|
|
||
| ## Run Benchmark | ||
|
|
||
| Refer to `./scripts/run_benchmark.sh` or `./scripts/envector_benchmark_config.yml` for benchmarks with enVector with ANN (VCT), or use the following command: | ||
|
|
||
| ```bash | ||
| export NUM_PER_BATCH=500000 # set to the database size for efficiency with IVF_FLAT | ||
| python -m vectordb_bench.cli.vectordbbench envectorivfflat \ | ||
| --uri "localhost:50050" \ | ||
| --eval-mode mm \ | ||
| --case-type PerformanceCustomDataset \ | ||
| --db-label "PUBMED768D400K-IVF" \ | ||
| --custom-case-name PUBMED768D400K \ | ||
| --custom-dataset-name PUBMED768D400K \ | ||
| --custom-dataset-dir "" \ | ||
| --custom-dataset-size 400335 \ | ||
| --custom-dataset-dim 768 \ | ||
| --custom-dataset-file-count 1 \ | ||
| --custom-dataset-with-gt \ | ||
| --skip-custom-dataset-use-shuffled \ | ||
| --train-centroids True \ | ||
| --is-vct True \ | ||
| --centroids-path "./centroids/embeddinggemma-300m/centroids.npy" \ | ||
| --vct-path "./centroids/embeddinggemma-300m/tree_info.pkl" \ | ||
| --nlist 32768 \ | ||
| --nprobe 6 | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| envectorflat: | ||
| uri: localhost:50050 | ||
| eval_mode: mm | ||
| case_type: PerformanceCustomDataset | ||
| db_label: BLOOMBERG768D368K-FLAT | ||
| custom_case_name: BLOOMBERG768D368K | ||
| custom_case_description: BLOOMBERG768D368K benchmark (768D, 368K vectors) | ||
| custom_dataset_name: BLOOMBERG768D368K | ||
| custom_dataset_dir: | ||
| custom_dataset_size: 368816 | ||
| custom_dataset_dim: 768 | ||
| custom_dataset_file_count: 1 | ||
| custom_dataset_use_shuffled: false | ||
| custom_dataset_with_gt: true | ||
| k: 10 | ||
| drop_old: true | ||
| load: true | ||
|
|
||
| envectorivfflat: | ||
| uri: localhost:50050 | ||
| eval_mode: mm | ||
| case_type: PerformanceCustomDataset | ||
| db_label: BLOOMBERG768D368K-IVF | ||
| custom_case_name: BLOOMBERG768D368K | ||
| custom_case_description: BLOOMBERG768D368K benchmark (768D, 368K vectors) | ||
| custom_dataset_name: BLOOMBERG768D368K | ||
| custom_dataset_dir: | ||
| custom_dataset_size: 368816 | ||
| custom_dataset_dim: 768 | ||
| custom_dataset_file_count: 1 | ||
| custom_dataset_use_shuffled: false | ||
| custom_dataset_with_gt: true | ||
| k: 10 | ||
| nlist: 32768 | ||
| nprobe: 6 | ||
| train_centroids: true | ||
| is_vct: true | ||
| centroids_path: centroids/embeddinggemma-300m/centroids.npy | ||
| vct_path: centroids/embeddinggemma-300m/tree_info.pkl | ||
| drop_old: true | ||
| load: true | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| envectorflat: | ||
| uri: localhost:50050 | ||
| eval_mode: mm | ||
| case_type: PerformanceCustomDataset | ||
| db_label: PUBMED768D400K-FLAT | ||
| custom_case_name: PUBMED768D400K | ||
| custom_case_description: PUBMED768D400K benchmark (768D, 400K vectors) | ||
| custom_dataset_name: PUBMED768D400K | ||
| custom_dataset_dir: | ||
| custom_dataset_size: 400335 | ||
| custom_dataset_dim: 768 | ||
| custom_dataset_file_count: 1 | ||
| custom_dataset_use_shuffled: false | ||
| custom_dataset_with_gt: true | ||
| k: 10 | ||
| drop_old: true | ||
| load: true | ||
|
|
||
| envectorivfflat: | ||
| uri: localhost:50050 | ||
| eval_mode: mm | ||
| case_type: PerformanceCustomDataset | ||
| db_label: PUBMED768D400K-IVF | ||
| custom_case_name: PUBMED768D400K | ||
| custom_case_description: PUBMED768D400K benchmark (768D, 400K vectors) | ||
| custom_dataset_name: PUBMED768D400K | ||
| custom_dataset_dir: | ||
| custom_dataset_size: 400335 | ||
| custom_dataset_dim: 768 | ||
| custom_dataset_file_count: 1 | ||
| custom_dataset_use_shuffled: false | ||
| custom_dataset_with_gt: true | ||
| k: 10 | ||
| nlist: 32768 | ||
| nprobe: 6 | ||
| train_centroids: true | ||
| is_vct: true | ||
| centroids_path: centroids/embeddinggemma-300m/centroids.npy | ||
| vct_path: centroids/embeddinggemma-300m/tree_info.pkl | ||
| drop_old: true | ||
| load: true | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| import os | ||
| import wget | ||
| import argparse | ||
| import numpy as np | ||
| import pandas as pd | ||
| import pyarrow as pa | ||
| import pyarrow.parquet as pq | ||
|
|
||
| from datasets import load_dataset | ||
|
|
||
inkme9 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| import faiss | ||
|
|
||
| def get_args(): | ||
| parser = argparse.ArgumentParser( | ||
| description="Prepare dataset and ground truth neighbors for benchmarking." | ||
| ) | ||
| parser.add_argument( | ||
| "-d", "--dataset-name", | ||
| type=str, | ||
| default="cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m", | ||
| help="Huggingface dataset name to download.", | ||
| choices=[ | ||
| "cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m", | ||
| "cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m", | ||
| ], | ||
| ) | ||
| parser.add_argument( | ||
| "--dataset-dir", | ||
| type=str, | ||
| default="./dataset/pubmed768d400k", | ||
| help="Dataset directory to save the dataset and neighbors.", | ||
| ) | ||
| parser.add_argument( | ||
| "-e", "--embedding-model", | ||
| type=str, | ||
| default="embeddinggemma-300m", | ||
| help="Embedding model name to download centroids for.", | ||
| ) | ||
| parser.add_argument( | ||
| "--centroids-dir", | ||
| type=str, | ||
| default="./centroids", | ||
| help="Directory to save the centroids and tree info.", | ||
| ) | ||
| return parser.parse_args() | ||
|
|
||
| def download_dataset( | ||
| dataset_name: str, | ||
| output_dir: str = "./dataset/pubmed768d400k" | ||
| ) -> None: | ||
| """Download dataset from Huggingface and save as Parquet files.""" | ||
| # load dataset | ||
| ds = load_dataset(dataset_name) | ||
| train = ds["train"].to_pandas() | ||
| test = ds["test"].to_pandas() | ||
|
|
||
| # write to parquet | ||
| train_table = pa.Table.from_pandas(train) | ||
| pq.write_table(train_table, f"{output_dir}/train.parquet") | ||
|
|
||
| test_table = pa.Table.from_pandas(test) | ||
| pq.write_table(test_table, f"{output_dir}/test.parquet") | ||
|
|
||
| def prepare_neighbors( | ||
| data_dir: str = "./dataset/pubmed768d400k", | ||
| ) -> None: | ||
| """Prepare ground truth neighbors using brute-force flat search and save as Parquet.""" | ||
| # load dataset | ||
| train = pd.read_parquet(f"{data_dir}/train.parquet") | ||
| test = pd.read_parquet(f"{data_dir}/test.parquet") | ||
|
|
||
| train = np.stack(train["emb"].to_list()).astype("float32") | ||
| test = np.stack(test["emb"].to_list()).astype("float32") | ||
| dim = train.shape[1] | ||
|
|
||
| # flat search | ||
| index = faiss.IndexFlatIP(dim) | ||
| index.add(train) | ||
|
|
||
| k = len(test) | ||
| distances, indices = index.search(test, k) | ||
| print(distances.shape, indices.shape) | ||
|
|
||
| # save flat search result as neighbors | ||
| df = pd.DataFrame({ | ||
| "id": np.arange(len(indices)), | ||
| "neighbors_id": indices.tolist() | ||
| }) | ||
|
|
||
| table = pa.Table.from_pandas(df) | ||
| pq.write_table(table, f"{data_dir}/neighbors.parquet") | ||
|
|
||
| def download_centroids(embedding_model: str, dataset_dir: str) -> None: | ||
| """Download pre-computed centroids and tree info for GAS VCT index.""" | ||
|
|
||
| if embedding_model != "embeddinggemma-300m": | ||
| raise ValueError(f"Centroids for {embedding_model} currently not available.") | ||
|
|
||
| # https://huggingface.co/datasets/cryptolab-playground/gas-centroids | ||
| dataset_link = f"https://huggingface.co/datasets/cryptolab-playground/gas-centroids/resolve/main/{embedding_model}" | ||
|
|
||
| # download | ||
| os.makedirs(os.path.join(dataset_dir, embedding_model), exist_ok=True) | ||
| wget.download(f"{dataset_link}/centroids.npy", out=os.path.join(dataset_dir, embedding_model, "centroids.npy")) | ||
| wget.download(f"{dataset_link}/tree_info.pkl", out=os.path.join(dataset_dir, embedding_model, "tree_info.pkl")) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| args = get_args() | ||
| os.makedirs(args.dataset_dir, exist_ok=True) | ||
|
|
||
| download_dataset(args.dataset_name, args.dataset_dir) | ||
| prepare_neighbors(args.dataset_dir) | ||
| download_centroids(args.embedding_model, args.centroids_dir) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.