diff --git a/README.md b/README.md index a6784fabd..0a99c28ed 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# enVector with ANN (GAS) in VectorDBBench +# enVector in VectorDBBench The guide on how to use enVector with ANN index in VectorDBBench is available in [README_ENVECTOR.md](README_ENVECTOR.md). @@ -19,7 +19,7 @@ To add more relevance and practicality, we provide cost-effectiveness reports pa Closely mimicking real-world production environments, we've set up diverse testing scenarios including insertion, searching, and filtered searching. To provide you with credible and reliable data, we've included public datasets from actual production scenarios, such as [SIFT](http://corpus-texmex.irisa.fr/), [GIST](http://corpus-texmex.irisa.fr/), [Cohere](https://huggingface.co/datasets/Cohere/wikipedia-22-12/tree/main/en), and a dataset generated by OpenAI from an opensource [raw dataset](https://huggingface.co/datasets/allenai/c4). It's fascinating to discover how a relatively unknown open-source database might excel in certain circumstances! -Prepare to delve into the world of VDBBench, and let it guide you in uncovering your perfect vector database match. +Prepare to delve into the world of VDBBench, and let it guide you in uncovering your perfect vector database match. VDBBench is sponsored by Zilliz,the leading opensource vectorDB company behind Milvus. Choose smarter with VDBBench - start your free test on [zilliz cloud](https://zilliz.com/) today! @@ -38,35 +38,35 @@ pip install vectordb-bench **Install all database clients** ``` shell -pip install 'vectordb-bench[all]' +pip install vectordb-bench[all] ``` **Install the specific database client** ```shell -pip install 'vectordb-bench[pinecone]' +pip install vectordb-bench[pinecone] ``` All the database client supported | Optional database client | install command | |--------------------------|---------------------------------------------| | pymilvus, zilliz_cloud (*default*) | `pip install vectordb-bench` | -| all (*clients requirements might be conflict with each other*) | `pip install 'vectordb-bench[all]'` | -| qdrant | `pip install 'vectordb-bench[qdrant]'` | -| pinecone | `pip install 'vectordb-bench[pinecone]'` | -| weaviate | `pip install 'vectordb-bench[weaviate]'` | -| elastic, aliyun_elasticsearch| `pip install 'vectordb-bench[elastic]'` | -| pgvector, pgvectorscale, pgdiskann, alloydb | `pip install 'vectordb-bench[pgvector]'` | -| pgvecto.rs | `pip install 'vectordb-bench[pgvecto_rs]'` | -| redis | `pip install 'vectordb-bench[redis]'` | -| memorydb | `pip install 'vectordb-bench[memorydb]'` | -| chromadb | `pip install 'vectordb-bench[chromadb]'` | -| awsopensearch | `pip install 'vectordb-bench[opensearch]'` | -| aliyun_opensearch | `pip install 'vectordb-bench[aliyun_opensearch]'` | -| mongodb | `pip install 'vectordb-bench[mongodb]'` | -| tidb | `pip install 'vectordb-bench[tidb]'` | -| vespa | `pip install 'vectordb-bench[vespa]'` | -| oceanbase | `pip install 'vectordb-bench[oceanbase]'` | -| hologres | `pip install 'vectordb-bench[hologres]'` | +| all (*clients requirements might be conflict with each other*) | `pip install vectordb-bench[all]` | +| qdrant | `pip install vectordb-bench[qdrant]` | +| pinecone | `pip install vectordb-bench[pinecone]` | +| weaviate | `pip install vectordb-bench[weaviate]` | +| elastic, aliyun_elasticsearch| `pip install vectordb-bench[elastic]` | +| pgvector, pgvectorscale, pgdiskann, alloydb | `pip install vectordb-bench[pgvector]` | +| pgvecto.rs | `pip install vectordb-bench[pgvecto_rs]` | +| redis | `pip install vectordb-bench[redis]` | +| memorydb | `pip install vectordb-bench[memorydb]` | +| chromadb | `pip install vectordb-bench[chromadb]` | +| awsopensearch | `pip install vectordb-bench[opensearch]` | +| aliyun_opensearch | `pip install vectordb-bench[aliyun_opensearch]` | +| mongodb | `pip install vectordb-bench[mongodb]` | +| tidb | `pip install vectordb-bench[tidb]` | +| vespa | `pip install vectordb-bench[vespa]` | +| oceanbase | `pip install vectordb-bench[oceanbase]` | +| hologres | `pip install vectordb-bench[hologres]` | ### Run @@ -198,7 +198,7 @@ Options: --number-of-shards INTEGER Number of primary shards for the index --number-of-replicas INTEGER Number of replica copies for each primary shard - # Indexing Performance + # Indexing Performance --index-thread-qty INTEGER Thread count for native engine indexing --index-thread-qty-during-force-merge INTEGER Thread count during force merge operations @@ -214,13 +214,9 @@ Options: --engine TEXT type of engine to use valid values [faiss, lucene, s3vector] # Memory Management --cb-threshold TEXT k-NN Memory circuit breaker threshold - - --ondisk Ondisk mode with binary quantization(32x compression) - --oversample-factor Controls the degree of oversampling applied to minority classes in imbalanced datasets to improve model performance by balancing class distributions.(default 1.0) - # Quantization Type - --quantization-type TEXT which type of quantization to use valid values [fp32, fp16, bq] + --quantization-type TEXT which type of quantization to use valid values [fp32, fp16] --help Show this message and exit. ``` ### Run OceanBase from command line @@ -292,18 +288,12 @@ Options: ### Run Hologres from command line -It is recommended to use the following code for installation. -```shell -pip install 'vectordb-bench[hologres]' 'psycopg[binary]' pgvector -``` - Execute tests for the index types: HGraph. ```shell -NUM_PER_BATCH=10000 vectordbbench hologreshgraph --host Hologres_Endpoint --port 80 \ ---user ACCESS_ID --password ACCESS_KEY --database DATABASE_NAME \ ---m 64 --ef-construction 400 --case-type Performance768D10M \ ---index-type HGraph --ef-search 400 --k 10 --num-concurrency 1,60,70,75,80,90,95,100,110,120 +vectordbbench hologreshgraph --host xxx --port xxx --user ACCESS_ID --password ACCESS_KEY --database test \ +--m 64 --ef-construction 400 --case-type Performance768D1M \ +--index-type HGraph --ef-search 51 --k 10 ``` To list the options for Hologres, execute `vectordbbench hologreshgraph --help`, The following are some Hologres-specific command-line options. @@ -331,8 +321,8 @@ Options: The vectordbbench command can optionally read some or all the options from a yaml formatted configuration file. -By default, configuration files are expected to be in vectordb_bench/config-files/, this can be overridden by setting -the environment variable CONFIG_LOCAL_DIR or by passing the full path to the file. +By default, configuration files are expected to be in vectordb_bench/config-files/, this can be overridden by setting +the environment variable CONFIG_LOCAL_DIR or by passing the full path to the file. The required format is: ```yaml @@ -361,7 +351,7 @@ milvushnsw: drop_old: False load: False ``` -> Notes: +> Notes: > - Options passed on the command line will override the configuration file* > - Parameter names use an _ not - @@ -369,8 +359,8 @@ milvushnsw: The vectordbbench command can read a batch configuration file to run all the test cases in the yaml formatted configuration file. -By default, configuration files are expected to be in vectordb_bench/config-files/, this can be overridden by setting -the environment variable CONFIG_LOCAL_DIR or by passing the full path to the file. +By default, configuration files are expected to be in vectordb_bench/config-files/, this can be overridden by setting +the environment variable CONFIG_LOCAL_DIR or by passing the full path to the file. The required format is: ```yaml @@ -399,7 +389,7 @@ milvushnsw: drop_old: False load: False ``` -> Notes: +> Notes: > - Options can only be passed through configuration files > - Parameter names use an _ not - @@ -414,11 +404,11 @@ To facilitate the presentation of test results and provide a comprehensive perfo ### Scoring Rules -1. For each case, select a base value and score each system based on relative values. - - For QPS and QP$, we use the highest value as the reference, denoted as `base_QPS` or `base_QP$`, and the score of each system is `(QPS/base_QPS) * 100` or `(QP$/base_QP$) * 100`. - - For Latency, we use the lowest value as the reference, that is, `base_Latency`, and the score of each system is `(base_Latency + 10ms)/(Latency + 10ms) * 100`. +1. For each case, select a base value and score each system based on relative values. + - For QPS and QP$, we use the highest value as the reference, denoted as `base_QPS` or `base_QP$`, and the score of each system is `(QPS/base_QPS) * 100` or `(QP$/base_QP$) * 100`. + - For Latency, we use the lowest value as the reference, that is, `base_Latency`, and the score of each system is `(base_Latency + 10ms)/(Latency + 10ms) * 100`. - We want to give equal weight to different cases, and not let a case with high absolute result values become the sole reason for the overall scoring. Therefore, when scoring different systems in each case, we need to use relative values. + We want to give equal weight to different cases, and not let a case with high absolute result values become the sole reason for the overall scoring. Therefore, when scoring different systems in each case, we need to use relative values. Also, for Latency, we add 10ms to the numerator and denominator to ensure that if every system performs particularly well in a case, its advantage will not be infinitely magnified when latency tends to 0. @@ -482,7 +472,7 @@ All standard benchmark results are generated by a client running on an 8 core, 3 1. Initially, you select the systems to be tested - multiple selections are allowed. Once selected, corresponding forms will pop up to gather necessary information for using the chosen databases. The db_label is used to differentiate different instances of the same system. We recommend filling in the host size or instance type here (as we do in our standard results). 2. The next step is to select the test cases you want to perform. You can select multiple cases at once, and a form to collect corresponding parameters will appear. 3. Finally, you'll need to provide a task label to distinguish different test results. Using the same label for different tests will result in the previous results being overwritten. -Now we can only run one task at the same time. +Now we can only run one task at the same time. ![image](vectordb_bench/fig/run_test_select_db.png) ![image](vectordb_bench/fig/run_test_select_case.png) ![image](vectordb_bench/fig/run_test_submit.png) @@ -523,11 +513,11 @@ We have strict requirements for the data set format, please follow them. - Vectors data files: The file must be named `train.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`. - Query test vectors: The file must be named `test.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`. - We recommend limiting the number of test query vectors, like 1,000. - When conducting concurrent query tests, Vdbbench creates a large number of processes. - To minimize additional communication overhead during testing, + When conducting concurrent query tests, Vdbbench creates a large number of processes. + To minimize additional communication overhead during testing, we prepare a complete set of test queries for each process, allowing them to run independently. - However, this means that as the number of concurrent processes increases, - the number of copied query vectors also increases significantly, + However, this means that as the number of concurrent processes increases, + the number of copied query vectors also increases significantly, which can place substantial pressure on memory resources. - Ground truth file: The file must be named `neighbors.parquet` and should have two columns: `id` corresponding to query vectors and `neighbors_id` as an array of `int`. @@ -557,10 +547,10 @@ VDBBench aims to provide a more comprehensive, multi-faceted testing environment **Step 2: Implement new_client.py and config.py** -1. Open new_client.py and define the NewClient class, which should inherit from the clients/api.py file's VectorDB abstract class. The VectorDB class serves as the API for benchmarking, and all DB clients must implement this abstract class. +1. Open new_client.py and define the NewClient class, which should inherit from the clients/api.py file's VectorDB abstract class. The VectorDB class serves as the API for benchmarking, and all DB clients must implement this abstract class. Example implementation in new_client.py: new_client.py -```python +```python from ..api import VectorDB class NewClient(VectorDB): # Implement the abstract methods defined in the VectorDB class @@ -589,7 +579,7 @@ class NewDBCaseConfig(DBCaseConfig): In this final step, you will import your DB client into clients/__init__.py and update the initialization process. 1. Open clients/__init__.py and import your NewClient from new_client.py. -2. Add your NewClient to the DB enum. +2. Add your NewClient to the DB enum. 3. Update the db2client dictionary by adding an entry for your NewClient. Example implementation in clients/__init__.py: @@ -687,14 +677,14 @@ def ZillizAutoIndex(**parameters: Unpack[ZillizTypedDict]): ) ``` 3. Update cli by adding: - 1. Add database specific options as an Annotated TypedDict, see ZillizTypedDict above. + 1. Add database specific options as an Annotated TypedDict, see ZillizTypedDict above. 2. Add index configuration specific options as an Annotated TypedDict. (example: vectordb_bench/backend/clients/pgvector/cli.py) 1. May not be needed if there is only one index config. - 2. Repeat for each index configuration, nesting them if possible. + 2. Repeat for each index configuration, nesting them if possible. 2. Add a index config specific function for each index type, see Zilliz above. The function name, in lowercase, will be the command name passed to the vectordbbench command. 3. Update db_config and db_case_config to match client requirements 4. Continue to add new functions for each index config. - 5. Import the client cli module and command to vectordb_bench/cli/vectordbbench.py (for databases with multiple commands (index configs), this only needs to be done for one command) + 5. Import the client cli module and command to vectordb_bench/cli/vectordbbench.py (for databases with multiple commands (index configs), this only needs to be done for one command) 6. Import the `get_custom_case_config` function from `vectordb_bench/cli/cli.py` and use it to add a new key `custom_case` to the `parameters` variable within the command. @@ -712,7 +702,7 @@ For the system under test, we use the default server-side configuration to maint For the Client, we welcome any parameter tuning to obtain better results. ### Incomplete Results Many databases may not be able to complete all test cases due to issues such as Out of Memory (OOM), crashes, or timeouts. In these scenarios, we will clearly state these occurrences in the test results. -### Mistake Or Misrepresentation +### Mistake Or Misrepresentation We strive for accuracy in learning and supporting various vector databases, yet there might be oversights or misapplications. For any such occurrences, feel free to [raise an issue](https://github.com/zilliztech/VectorDBBench/issues/new) or make amendments on our GitHub page. ## Timeout In our pursuit to ensure that our benchmark reflects the reality of a production environment while guaranteeing the practicality of the system, we have implemented a timeout plan based on our experiences for various tests. diff --git a/README_ENVECTOR.md b/README_ENVECTOR.md index 68bd785de..dfea67f56 100644 --- a/README_ENVECTOR.md +++ b/README_ENVECTOR.md @@ -1,4 +1,4 @@ -# enVector with ANN (GAS) in VectorDBBench +# enVector with ANN (IVF-GAS) in VectorDBBench This guide demonstrates how to use enVector with an ANN index in VectorDBBench. @@ -63,7 +63,7 @@ python ./scripts/prepare_dataset.py \ -e embeddinggemma-300m ``` -Then, you can find the following generated files: +Then, you can find the generated files as follow: ```bash . @@ -104,16 +104,29 @@ export DATASET_LOCAL_DIR="./dataset" export NUM_PER_BATCH=4096 ``` -## Run Benchmark +## Run Our Benchmarks -Refer to `./scripts/run_benchmark.sh` or `./scripts/envector_benchmark_config.yml` for benchmarks with enVector with ANN (VCT), or use the following command: +We provide two benchmark datasets: +- `PUBMED768D400K` +- `BLOOMBERG768D368K` + +Run the provided shell scripts (`./scripts/run_benchmark.sh`) as the following: ```bash -export NUM_PER_BATCH=500000 # set to the database size for efficiency with IVF_FLAT +./scripts/run_benchmark.sh --type flat # FLAT +./scripts/run_benchmark.sh --type ivf # IVF-FLAT with random centroids +./scripts/run_benchmark.sh --type ivf-trained # IVF-FLAT with trained centroids (w/ k-means clustering, etc.) +./scripts/run_benchmark.sh --type ivf-gas # IVF-FLAT with enVector-customized ANN (GAS) +``` + +For more details, please refer to `run_benchmark.sh` or `envector_{benchmark}_config.yml` in scripts directory for benchmarks with enVector with ANN (VCT), or you can use the following command: + +```bash +# ivf-gas: IVF-FLAT with our ANN (GAS) +export NUM_PER_BATCH=500000 # set to the database size for efficiency python -m vectordb_bench.cli.vectordbbench envectorivfflat \ --uri "localhost:50050" \ - --eval-mode mm \ - --case-type PerformanceCustomDataset \ + --case-type "PerformanceCustomDataset" \ --db-label "PUBMED768D400K-IVF" \ --custom-case-name PUBMED768D400K \ --custom-dataset-name PUBMED768D400K \ @@ -123,10 +136,65 @@ python -m vectordb_bench.cli.vectordbbench envectorivfflat \ --custom-dataset-file-count 1 \ --custom-dataset-with-gt \ --skip-custom-dataset-use-shuffled \ + --eval-mode mm \ --train-centroids True \ --is-vct True \ --centroids-path "./centroids/embeddinggemma-300m/centroids.npy" \ --vct-path "./centroids/embeddinggemma-300m/tree_info.pkl" \ --nlist 32768 \ --nprobe 6 -``` \ No newline at end of file +``` + +### Run VectorDBBench Case + +```bash +# flat +python -m vectordb_bench.cli.vectordbbench envectorflat \ + --uri "localhost:50050" \ + --case-type "Performance1536D500K" \ + --db-label "Performance1536D500K-FLAT" + +# ivf: IVF-FLAT with random centroids +export NUM_PER_BATCH=500000 # set database size for efficiency +python -m vectordb_bench.cli.vectordbbench envectorivfflat \ + --uri "localhost:50050" \ + --case-type "Performance1536D500K" \ + --db-label "Performance1536D500K-IVF-FLAT" \ + --nlist 250 \ + --nprobe 6 + +# ivf-trained: IVF-FLAT with trained centroids via k-means +export NUM_PER_BATCH=500000 # set to the database size for efficiency +python -m vectordb_bench.cli.vectordbbench envectorivfflat \ + --uri "localhost:50050" \ + --case-type "Performance1536D500K" \ + --db-label "Performance1536D500K-IVF-FLAT" \ + --train-centroids True \ + --centroids-path "./centroids/kmeans_centroids.npy" \ + --nlist 250 \ + --nprobe 6 +``` + +Note that the benchmark provided by VectorDBBench, including Performance1536D500K, uses **unknown** embedding model (just notified as openai's one), we cannot use our IVF-GAS approach for ANN. + +### CLI Options + +enVector Types for VectorDBBench +- `envectorflat`: FLAT as index type for enVector +- `envectorivfflat`: IVF_FLAT as index type for enVector + +Common Options for enVector +- `--uri`: enVector server URI +- `--eval-mode`: FHE evaluation mode on server. Use `mm` for enhanced performance. + +ANN Options for enVector +- `--nlist`: Number of coarse clusters for IVF_FLAT +- `--nprobe`: Number of clusters to scan during search for IVF_FLAT +- `--train-centroids`: whether to use trained centroids for IVF_FLAT +- `--centroids-path`: path to the trained centroids +- `--is-vct`: whether to use VCT approach for IVF_GAS +- `--vct-path`: path to the trained VCT metadata for IVF_GAS + +Benchmark Options: + follows conventions of VectorDBBench, + see details in [VectorDBBench Options](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file#custom-dataset-for-performance-case) \ No newline at end of file diff --git a/scripts/run_benchmark.sh b/scripts/run_benchmark.sh index 37037df90..3b2a6aba4 100755 --- a/scripts/run_benchmark.sh +++ b/scripts/run_benchmark.sh @@ -5,9 +5,14 @@ set -euo pipefail export DATASET_LOCAL_DIR="./dataset" export NUM_PER_BATCH=4096 +CASE_TYPE="PerformanceCustomDataset" +DATASET_NAME="PUBMED768D400K" CENTROID_PATH=centroids/embeddinggemma-300m/centroids.npy VCT_PATH=centroids/embeddinggemma-300m/tree_info.pkl ENVECTOR_URI="localhost:50050" +NLIST=32768 +NPROBE=6 + REQUESTED_TYPE="" while [[ $# -gt 0 ]]; do @@ -28,18 +33,18 @@ while [[ $# -gt 0 ]]; do done case "$REQUESTED_TYPE" in - ""|flat|ivf) ;; + ""|flat|ivf|ivf-trained|ivf-gas) ;; *) - echo "Invalid --type: $REQUESTED_TYPE (expected: flat or ivf)" >&2 + echo "Invalid --type: $REQUESTED_TYPE (expected: flat / ivf / ivf-trained / ivf-gas)" >&2 exit 1 ;; esac COMMON_ARGS=( --uri "$ENVECTOR_URI" --eval-mode mm - --case-type PerformanceCustomDataset - --custom-case-name PUBMED768D400K - --custom-dataset-name PUBMED768D400K + --case-type "$CASE_TYPE" + --custom-case-name "$DATASET_NAME" + --custom-dataset-name "$DATASET_NAME" --custom-dataset-dir "" --custom-dataset-size 400335 --custom-dataset-dim 768 @@ -60,16 +65,32 @@ run_case() { } if [[ -z "$REQUESTED_TYPE" || "$REQUESTED_TYPE" == "flat" ]]; then - run_case envectorflat "PUBMED768D400K-FLAT" + run_case envectorflat "$DATASET_NAME-FLAT" fi if [[ -z "$REQUESTED_TYPE" || "$REQUESTED_TYPE" == "ivf" ]]; then export NUM_PER_BATCH=500000 # set database size for efficiency - run_case envectorivfflat "PUBMED768D400K-IVF" \ + run_case envectorivfflat "$DATASET_NAME-IVF-RANDOM" \ + --nlist "$NLIST" \ + --nprobe "$NPROBE" +fi + +if [[ -z "$REQUESTED_TYPE" || "$REQUESTED_TYPE" == "ivf-trained" ]]; then + export NUM_PER_BATCH=500000 # set database size for efficiency + run_case envectorivfflat "$DATASET_NAME-IVF-FLAT" \ + --train-centroids True \ + --centroids-path "$CENTROID_PATH" \ + --nlist "$NLIST" \ + --nprobe "$NPROBE" +fi + +if [[ -z "$REQUESTED_TYPE" || "$REQUESTED_TYPE" == "ivf-gas" ]]; then + export NUM_PER_BATCH=500000 # set database size for efficiency + run_case envectorivfflat "$DATASET_NAME-IVF-GAS" \ --is-vct True \ --train-centroids True \ --centroids-path "$CENTROID_PATH" \ --vct-path "$VCT_PATH" \ - --nlist 32768 \ - --nprobe 6 + --nlist "$NLIST" \ + --nprobe "$NPROBE" fi