Skip to content

Support IVF-RaBitQ in cuVS Library#1866

Open
Stardust-SJF wants to merge 160 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq
Open

Support IVF-RaBitQ in cuVS Library#1866
Stardust-SJF wants to merge 160 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq

Conversation

@Stardust-SJF
Copy link
Copy Markdown

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

jamxia155 and others added 30 commits November 3, 2025 08:03
- Currently built as a separate library.
- To be merged with existing `cuvs_objs` library.
- Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work
is completed.
- CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1
- Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ

* Align line formatting && Delete unused variables in robust_prune.cuh
* Download Eigen automatically by rapids-cmake

* Disable FAISS and DISKANN benchmarks

* add config files and update readme

* Update Readme and openai_1M config

* Update python bench command line

* update README

* update README

---------

Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking
- Stream-ordered CUDA calls
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented Mar 25, 2026

/ok to test 52793fc

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a complete IVF‑RaBitQ ANN implementation (public API, GPU index, quantizer, rotator, initializer, searcher pipelines, persistence), CMake wiring and benchmark integration, FAISS wrapper, tests, Python benchmark config, utilities, and documentation.

Changes

IVF‑RaBitQ Feature

Layer / File(s) Summary
Public API / Types
cpp/include/cuvs/neighbors/ivf_rabitq.hpp
Adds public parameter structs, search_mode enum, index<IdxT> class template, and build/search/serialize/deserialize declarations and overloads.
Facade Implementation
cpp/src/neighbors/ivf_rabitq.cu
Implements public build/search/serialize/deserialize and index methods; selects memory/streaming strategy, runs k‑means sampling/training, and delegates to detail GPU index/search paths.
Internal Types / Constants
cpp/src/neighbors/ivf_rabitq/defines.hpp
Adds internal constants/types (FAST_SIZE, PID, Candidate, ExFactor) used by GPU index.
I/O / Memory / Utilities
cpp/src/neighbors/ivf_rabitq/utils/*
Adds file IO helpers (IO.hpp), aligned allocation (memory.hpp), timer (StopW.hpp), space ops (space.hpp), and searcher helpers (searcher_gpu_utils.*).
GPU Rotator
cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.{cu,cuh}
Adds RotatorGPU: padded rotation matrix construction, load/save, and rotate() (GEMM) on GPU.
GPU Initializer
cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.{cu,cuh}
Adds InitializerGPU base and FlatInitializerGPU concrete centroid storage with GetCentroid/AddVectors/Load/Save implementations.
GPU Quantizer
cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.{cu,cuh}
Implements DataQuantizerGPU with fused kernels for RaBitQ/ExRaBitQ: gather/pad/rotate, subtract/normalize/binarize, pack short/long codes, compute factors, const/per‑query scaling computations, and buffer management.
IVF GPU Core
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.{cu,cuh}
Introduces IVFGPU: device/host allocations, load/save_transposed, construct/construct_on_gpu/construct_on_gpu_streaming, cluster metadata management, quantize_cluster integration, and batched search orchestration.
Searcher Implementations
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu*.{cu,cuh}
Adds SearcherGPU and three search pipelines: baseline LUT path, shared‑memory BF16‑optimized path, and quantized‑query bitwise path (block‑sort/direct), plus common kernel params and quantized‑query kernels.
Kernel Common / Helpers
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
Introduces shared kernel parameter structs and device helpers (e.g., extract_code) used by searcher kernels.
Persistence / IO Helpers
cpp/src/neighbors/ivf_rabitq/utils/IO.hpp
Adds file helpers to load raw vector files, file existence/size helpers used by centroids/rotator loading and duplicated-row readers.
Build/Wiring: CMake
cpp/CMakeLists.txt, cpp/bench/ann/CMakeLists.txt, cpp/tests/CMakeLists.txt
Adds new static ivf_rabitq target and includes ivf_rabitq.cu in cuvs objects; introduces CMake options for bench toggles and registers ANN bench/test targets referencing ivf_rabitq.
Benchmark Wrappers & Param Parsing
cpp/bench/ann/src/cuvs/*, cpp/bench/ann/src/faiss/*
Adds cuvs_ivf_rabitq wrapper and explicit instantiation, FAISS CPU faiss_cpu_ivfrabitq wrapper, parameter parsing for RabitQ build/search params, and factory wiring to create algorithm/search_param instances.
Bench Compilation Unit
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq.cu
Provides explicit template instantiation for cuvs_ivf_rabitq<float,int64_t>.
Tests - C++
cpp/tests/*, cpp/tests/neighbors/ann_ivf_rabitq.*
Adds GoogleTest parameterized harness and TU to exercise device/host builds, serialization/deserialization, forced streaming, multiple bits/modes/probe counts, and registers the CMake test target.
Python Benchmark Config & Tests
python/cuvs_bench/.../algorithms.yaml, python/cuvs/cuvs/tests/test_compute_matrix_product.py
Adds RabitQ algorithm entries to bench config and a pytest module for matrix‑product iteration behavior.
Documentation
docs/source/cpp_api/neighbors.rst, docs/source/cpp_api/neighbors_ivf_rabitq.rst
Adds Sphinx toctree entry and new API page documenting cuvs::neighbors::ivf_rabitq with Doxygen group wiring.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main feature being added: IVF-RaBitQ support in the cuVS library.
Description check ✅ Passed The description clearly explains what IVF-RaBitQ is, its benefits, and context within the cuVS library.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (16)
cpp/CMakeLists.txt-1070-1070 (1)

1070-1070: ⚠️ Potential issue | 🟠 Major

Add ivf_rabitq to the static cuvs_static target.

The shared cuvs target links ivf_rabitq, but cuvs_static does not. Since src/neighbors/ivf_rabitq.cu is included in cuvs_objs for static builds, calls into the helper target become unresolved for cuvs_static consumers.

Proposed fix
     target_link_libraries(
       cuvs_static
       INTERFACE $<COMPILE_ONLY:rmm::rmm>
       PUBLIC raft::raft
              cuvs::cuvs_cpp_headers
              ${CUVS_CTK_MATH_DEPENDENCIES}
              $<TARGET_NAME_IF_EXISTS:NCCL::NCCL>
              $<BUILD_LOCAL_INTERFACE:$<TARGET_NAME_IF_EXISTS:hnswlib::hnswlib>>
       PRIVATE rmm::rmm
               $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
               CUDA::nvJitLink
               CUDA::nvrtc
+              ivf_rabitq
               $<$<BOOL:${CUVS_NVTX}>:CUDA::nvtx3>
               $<COMPILE_ONLY:nvidia::cutlass::cutlass>
               $<COMPILE_ONLY:cuco::cuco>
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` at line 1070, The static target cuvs_static is missing
linkage to ivf_rabitq causing unresolved symbols from
src/neighbors/ivf_rabitq.cu; update the CMake target definition that builds
cuvs_static (which aggregates cuvs_objs) to link the ivf_rabitq library the same
way the shared cuvs target does (i.e., add ivf_rabitq to the
target_link_libraries or equivalent linkage list for cuvs_static so consumers of
cuvs_static get the helper resolved).
cpp/CMakeLists.txt-228-253 (1)

228-253: ⚠️ Potential issue | 🟠 Major

Condition OpenMP linking and compilation flags on the DISABLE_OPENMP option.

The ivf_rabitq target unconditionally links OpenMP::OpenMP_CXX and injects -fopenmp compiler flags, even when DISABLE_OPENMP=ON. This breaks the build because OpenMP is not discovered when that option is enabled. Other targets in this file (lines 980–1130) correctly use the $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> pattern to make OpenMP optional.

Use conditional generator expressions for the OpenMP target dependency and remove the hardcoded -fopenmp flags from the compile options:

Proposed fix
-  target_link_libraries(ivf_rabitq PRIVATE OpenMP::OpenMP_CXX CUDA::cudart raft::raft rmm)
+  target_link_libraries(
+    ivf_rabitq
+    PRIVATE $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+            CUDA::cudart
+            raft::raft
+            rmm::rmm
+  )

   target_compile_options(
     ivf_rabitq
-    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: $<$<CONFIG:Debug>:-G;-g> --extended-lambda
-            --expt-relaxed-constexpr -Xcompiler=-fopenmp > $<$<COMPILE_LANGUAGE:CXX>:-fopenmp>
+    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:$<$<CONFIG:Debug>:-G;-g> --extended-lambda
+            --expt-relaxed-constexpr>
   )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 228 - 253, The ivf_rabitq target currently
always links OpenMP and always injects -fopenmp into compile options; change
target_link_libraries(ivf_rabitq ...) to use the optional generator expression
$<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> instead of OpenMP::OpenMP_CXX, and
remove the hardcoded -fopenmp from target_compile_options so the CXX and CUDA
compile flags do not force OpenMP when DISABLE_OPENMP=ON (make any CUDA
-Xcompiler=-fopenmp and CXX -fopenmp additions conditional on the OpenMP target
existing or the option being enabled).
cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h-14-14 (1)

14-14: ⚠️ Potential issue | 🟠 Major

Guard the RaBitQ FAISS include and wrapper behind the build option.

The header is shared by all FAISS CPU benchmarks. Unconditional inclusion of <faiss/IndexIVFRaBitQ.h> and the faiss_cpu_ivfrabitq class definition will fail the entire build if FAISS lacks RaBitQ support, even when CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ is disabled.

Proposed fix
+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ
 `#include` <faiss/IndexIVFRaBitQ.h>
+#endif
+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ
 template <typename T>
 class faiss_cpu_ivfrabitq : public faiss_cpu<T> {
  public:
   struct build_param : public faiss_cpu<T>::build_param {};
@@
   }
 };
+#endif
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h` at line 14, Guard the
RaBitQ-specific include and wrapper behind the build option by wrapping the
`#include` <faiss/IndexIVFRaBitQ.h> and the faiss_cpu_ivfrabitq class definition
with an `#if` defined(CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ) / `#endif` block;
locate the unconditional include and the faiss_cpu_ivfrabitq symbol in
faiss_cpu_wrapper.h and ensure both the header inclusion and the class (or any
RaBitQ-specific functions/typedefs) are only compiled when
CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ is defined to avoid build failures when
FAISS lacks RaBitQ support.
cpp/src/neighbors/ivf_rabitq/utils/tools.hpp-10-15 (1)

10-15: ⚠️ Potential issue | 🟠 Major

Header must include <cstddef> to be self-contained.

size_t is used without including a declaring header. Although this currently works in memory.hpp due to transitive includes (<cstdlib>), the header is not self-contained and including it directly in other translation units could fail depending on include order. Add #include <cstddef>.

The codebase convention uses bare size_t, not std::size_t, so no style change is needed.

Minimal fix
 `#pragma` once
+
+#include <cstddef>
 
 namespace cuvs::neighbors::ivf_rabitq::detail {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/tools.hpp` around lines 10 - 15, The
header uses size_t in the inline functions div_rd_up and rd_up_to_multiple_of
but doesn't include the declaring header; add `#include` <cstddef> at the top of
the file so the header is self-contained and other translation units can include
tools.hpp directly without relying on transitive includes (keep using bare
size_t as per project convention).
cpp/bench/ann/CMakeLists.txt-51-51 (1)

51-51: ⚠️ Potential issue | 🟠 Major

Add IVF-RaBitQ to the cuVS benchmark option bookkeeping.

CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ defaults ON, but it is not disabled in BUILD_CPU_ONLY and is omitted from the CUVS_ANN_BENCH_USE_CUVS aggregate. CPU-only builds can still try to configure the .cu RabitQ benchmark/link ivf_rabitq, and cuVS benchmark detection can be wrong when only RabitQ is enabled.

Proposed fix
 if(BUILD_CPU_ONLY)
   set(CUVS_FAISS_ENABLE_GPU OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB OFF)
   set(CUVS_ANN_BENCH_USE_GGNN OFF)
   set(CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_MG OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_VAMANA OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_DISKANN OFF)
+  set(CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ OFF)
 else()
   set(CUVS_FAISS_ENABLE_GPU ON)
 endif()
 
 set(CUVS_ANN_BENCH_USE_CUVS OFF)
 if(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ
    OR CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE
    OR CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT
+   OR CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB
    OR CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE

Also applies to: 84-110, 251-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/CMakeLists.txt` at line 51, CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
is defaulted ON but not included in the CUVS_ANN_BENCH_USE_CUVS aggregate and
not disabled under BUILD_CPU_ONLY, so CPU-only builds can attempt to
configure/link the .cu ivf_rabitq target; update the CMake logic to (1) add
CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ into the aggregated CUVS_ANN_BENCH_USE_CUVS
option and (2) wrap/force-disable CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ when
BUILD_CPU_ONLY is set (same treatment as other GPU-only CUVS flags) so
ivf_rabitq is not considered/configured for CPU-only builds.
cpp/src/neighbors/ivf_rabitq/defines.hpp-23-25 (1)

23-25: ⚠️ Potential issue | 🟠 Major

Make operator> false for equal distances.

!(*this < other) returns true when distances are equal, which breaks comparator expectations.

Proposed fix
-  bool operator>(const Candidate& other) const { return !(*this < other); }
+  bool operator>(const Candidate& other) const { return distance > other.distance; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 23 - 25, The current
Candidate::operator>(const Candidate& other) uses '!(*this < other)' which
yields true when distances are equal; update operator> to compare distances
strictly (e.g., return distance > other.distance or return other.distance <
distance) so it returns false for equal distances and preserves strict weak
ordering; modify the Candidate struct's operator< and operator> pair
accordingly.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh-1-10 (1)

1-10: ⚠️ Potential issue | 🟠 Major

Add an include guard to this CUDA header.

This .cuh defines constants, a struct, and a device function; including it twice in one translation unit will cause redefinition errors.

Proposed fix
 /*
  * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION.
  * SPDX-License-Identifier: Apache-2.0
  */
 
+#pragma once
+
 //
 // Created by Stardust on 4/14/25.
 //
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh` around lines
1 - 10, This CUDA header lacks an include guard and can be included multiple
times causing redefinition of its constants, struct, and device function; fix it
by adding a header guard (or a top-line `#pragma once`) around the entire
contents of searcher_gpu.cuh using a unique macro name (e.g.
SEARCHER_GPU_COMMON_CUH_) so the declarations inside (the constants, the struct,
and the device function defined in this file) are only processed once per
translation unit.
cpp/src/neighbors/ivf_rabitq/defines.hpp-8-12 (1)

8-12: ⚠️ Potential issue | 🟠 Major

Include <cstddef> and use std::size_t.

The header only includes <cstdint>, which does not declare std::size_t per C++17 standard. Line 12 uses bare size_t without a qualifying namespace or using declaration, which will not compile on conforming implementations.

Proposed fix
+#include <cstddef>
 `#include` <cstdint>
 
 namespace cuvs::neighbors::ivf_rabitq::detail {
 
-constexpr size_t FAST_SIZE = 32;
+constexpr std::size_t FAST_SIZE = 32;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 8 - 12, The header
defines FAST_SIZE using bare size_t but only includes <cstdint>, so add `#include`
<cstddef> and change the type to std::size_t for FAST_SIZE in the
cuvs::neighbors::ivf_rabitq::detail namespace (symbol: FAST_SIZE in defines.hpp)
to ensure portable, standards-conforming compilation.
cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu-36-64 (1)

36-64: ⚠️ Potential issue | 🟠 Major

Fix type safety and synchronization for the probed vector reduction.

The reduction over d_probed_vectors_count (stored as unsigned long long) mixes incompatible types: init 0 is int, comparator is thrust::maximum<size_t>(), and the memset uses sizeof(size_t) instead of sizeof(unsigned long long). Additionally, raft::copy on line 57 is asynchronous, yet the function returns without synchronizing the stream. Callers immediately access max_probed_vectors_count on the next line, creating a race condition.

Introduce a type alias for the vector count, use it consistently in memset and reduce operations, and synchronize the stream before returning.

Proposed fix
+  using CountT = unsigned long long;
+
   auto d_max_probed_cluster_size = raft::make_device_scalar<uint32_t>(handle, 0);
-  auto d_probed_vectors_count    = raft::make_device_vector<unsigned long long, int64_t>(
-    handle, get_max_probed_vectors_count ? num_queries : 0);
+  auto d_probed_vectors_count =
+    raft::make_device_vector<CountT, int64_t>(handle,
+                                              get_max_probed_vectors_count ? num_queries : 0);
   // raw pointers for passing by value to device lambda
   auto d_max_probed_cluster_size_ptr = d_max_probed_cluster_size.data_handle();
   auto d_probed_vectors_count_ptr    = d_probed_vectors_count.data_handle();
   if (get_max_probed_vectors_count) {
     RAFT_CUDA_TRY(cudaMemsetAsync(
-      d_probed_vectors_count_ptr, 0, num_queries * sizeof(size_t), stream));  // Initialize to 0
+      d_probed_vectors_count_ptr, 0, num_queries * sizeof(CountT), stream));  // Initialize to 0
   }
   
   auto count = thrust::make_counting_iterator<int64_t>(0);
   thrust::for_each(
     raft::resource::get_thrust_policy(handle), count, count + num_pairs, [=] __device__(int64_t i) {
       auto [cluster_idx, query_idx] = d_cluster_query_pairs[i];
       auto cluster_size             = d_cluster_meta[cluster_idx].num;
       atomicMax(d_max_probed_cluster_size_ptr, cluster_size);
       if (get_max_probed_vectors_count)
         atomicAdd(&d_probed_vectors_count_ptr[query_idx],
                   static_cast<unsigned long long>(cluster_size));
     });
   raft::copy(&max_probed_cluster_size, d_max_probed_cluster_size_ptr, 1, stream);
   if (get_max_probed_vectors_count) {
-    max_probed_vectors_count = thrust::reduce(raft::resource::get_thrust_policy(handle),
-                                              d_probed_vectors_count_ptr,
-                                              d_probed_vectors_count_ptr + num_queries,
-                                              0,
-                                              thrust::maximum<size_t>());
+    auto max_count = thrust::reduce(raft::resource::get_thrust_policy(handle),
+                                    d_probed_vectors_count_ptr,
+                                    d_probed_vectors_count_ptr + num_queries,
+                                    CountT{0},
+                                    thrust::maximum<CountT>());
+    max_probed_vectors_count = static_cast<size_t>(max_count);
   }
+  RAFT_CUDA_TRY(cudaStreamSynchronize(stream));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu` around lines 36 -
64, The reduction mixes types and is unsynchronized: introduce a vector-count
alias (e.g., using vec_count_t = unsigned long long) and use it for
d_probed_vectors_count / d_probed_vectors_count_ptr, change the cudaMemsetAsync
size to num_queries * sizeof(vec_count_t) and initialize the reduce identity to
static_cast<vec_count_t>(0) while using thrust::maximum<vec_count_t>() for
thrust::reduce; after copying device scalar to max_probed_cluster_size and after
the thrust::reduce that writes max_probed_vectors_count, synchronize the stream
(e.g., cudaStreamSynchronize(stream)) before returning or before any host reads
of max_probed_vectors_count to avoid the race; update references in the code
paths guarded by get_max_probed_vectors_count as needed.
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-120-125 (1)

120-125: ⚠️ Potential issue | 🟠 Major

Wire refine_ratio into search or remove the advertised refinement path.

needs_dataset() requests the dataset for refine_ratio > 1.0f, but search() never uses refine_ratio_ or dataset_. Benchmark runs configured with refinement will silently report unrefined IVF-RaBitQ results.

Also applies to: 155-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 120 - 125, The
set_search_param stores refine_ratio_ but the search() path never uses
refine_ratio_ or dataset_, so either wire the advertised refinement into the
search pipeline or remove the refinement plumbing; specifically, in
cuvs_ivf_rabitq<T, IdxT>::search() detect refine_ratio_ > 1.0f and perform the
additional refinement step using dataset_ and search_params_ (e.g., extra
re-ranking / distance computations on candidates) so needs_dataset() is valid,
or if refinement is unimplemented remove refine_ratio_ handling from
cuvs_ivf_rabitq<T, IdxT>::set_search_param and stop requesting the dataset in
needs_dataset(); update references to search_params_, refine_ratio_, dataset_,
needs_dataset(), and search() accordingly.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu-1451-1457 (1)

1451-1457: ⚠️ Potential issue | 🟠 Major

Guard zero-norm query quantization before computing delta.

For an all-zero query, norm and norm_quan can be zero, making delta NaN. That value is later used as query_width, so distances can become NaN and poison top-k selection.

Proposed fix
-    float norm_quan      = sqrtf(fmaxf(xu_sq, 0.f));
-    float cos_similarity = ip_resi_xucb / (norm * norm_quan);
-    float delta          = norm / norm_quan * cos_similarity;
+    float norm_quan = sqrtf(fmaxf(xu_sq, 0.f));
+    float delta     = 0.0f;
+    if (norm > 0.0f && norm_quan > 0.0f) {
+      float cos_similarity = ip_resi_xucb / (norm * norm_quan);
+      delta = norm / norm_quan * cos_similarity;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 1451 - 1457, The computation of delta can produce NaN when norm or
norm_quan are zero (all-zero query); update the block around tid==0 to guard
divisions: check if norm_quan <= 0 or norm <= 0 and in that case set delta to 0
(or another defined safe fallback) instead of performing the division, then
store that safe value into d_delta[row]; ensure you reference and modify the
delta computation using the existing symbols norm_quan, norm, cos_similarity,
delta and d_delta (and row) so downstream use as query_width cannot become NaN.
cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu-41-58 (1)

41-58: ⚠️ Potential issue | 🟠 Major

Honor the documented row-major serialization layout.

load() and save() copy the matrix linearly, but the header says files are row-major and device storage is transposed for column-major GEMM use. This breaks compatibility with any writer/reader following the documented format.

Proposed fix direction
-  for (size_t i = 0; i < D * D; ++i) {
-    input.read(reinterpret_cast<char*>(&host_buf(i)), sizeof(float));
-  }
+  for (size_t r = 0; r < D; ++r) {
+    for (size_t c = 0; c < D; ++c) {
+      float v;
+      input.read(reinterpret_cast<char*>(&v), sizeof(float));
+      host_buf(c * D + r) = v;  // row-major file -> column-major/device layout
+    }
+  }

Apply the inverse transpose in save(), or update the header and all serializers to declare the raw layout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu` around lines 41 - 58,
The code currently copies the matrix bytes linearly between host_buf and
rotation_matrix_ but device storage is transposed for column-major GEMM while
files are documented as row-major; update RotatorGPU::load and RotatorGPU::save
to perform the inverse transpose when moving between disk and device.
Concretely: in load(), read the file into host_buf (row-major) then transpose
that host buffer into the device layout expected by rotation_matrix_ (or copy
host_buf to a temporary host_transposed and raft::copy that into
rotation_matrix_); in save(), copy rotation_matrix_ back to a host buffer in
device layout, transpose that buffer to row-major order, then write the
row-major bytes to output. Use the existing symbols (RotatorGPU::load,
RotatorGPU::save, rotation_matrix_, host_buf, D, stream_, handle_) and a host or
GPU transpose utility (e.g., raft transpose or a small custom transpose) to
perform the conversions so on-disk layout remains row-major while device storage
stays transposed for GEMM.
cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu-961-968 (1)

961-968: ⚠️ Potential issue | 🟠 Major

Add stream synchronization after the D2H copy before reading the host buffer.

raft::copy() is asynchronous; the CPU loop can read h_rand_row_normalized_abs before the copy completes, causing stale or undefined data. The fully-GPU path at line 1450 correctly syncs before reading—apply the same pattern here.

Proposed fix
   raft::copy(h_rand_row_normalized_abs.data_handle(),
              rand.data_handle(),
              kConstNum * dim,
              raft::resource::get_cuda_stream(handle));
+  raft::resource::sync_stream(handle);
 
   double sum = 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu` around lines 961 -
968, The host loop reads h_rand_row_normalized_abs immediately after an
asynchronous raft::copy from rand; add a CUDA stream synchronization using the
same stream from raft::resource::get_cuda_stream(handle) after the raft::copy
and before the loop that calls best_rescale_factor so the D2H transfer is
complete; locate the raft::copy(...) call and the subsequent use of
h_rand_row_normalized_abs and insert a stream synchronization (using the stream
handle retrieved from raft::resource::get_cuda_stream(handle) or an equivalent
RAFT sync helper) between them.
cpp/src/neighbors/ivf_rabitq/utils/IO.hpp-19-35 (1)

19-35: ⚠️ Potential issue | 🟠 Major

Return -1 as size_t creates a potential SIZE_MAX allocation.

The function get_filesize returns -1 on error, which as a size_t becomes SIZE_MAX. While the file_exits check mitigates this in the happy path, a TOCTOU race condition exists: the file could be deleted or become inaccessible between the existence check and the stat64 call, causing get_filesize to return SIZE_MAX and subsequent division on line 88 to produce an enormous row count.

Additionally, these helper functions should be marked inline since they're defined in a header file used by multiple instantiations of template functions.

Proposed fix
-size_t get_filesize(const char* filename)
+inline std::optional<size_t> get_filesize(const char* filename)
 {
   struct stat64 stat_buf;
   int rc = stat64(filename, &stat_buf);
-  return rc == 0 ? stat_buf.st_size : -1;
+  if (rc != 0) { return std::nullopt; }
+  return static_cast<size_t>(stat_buf.st_size);
 }
 
-bool file_exits(const char* filename)
+inline bool file_exits(const char* filename)

Then handle the empty optional before computing rows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/IO.hpp` around lines 19 - 35, get_filesize
currently returns size_t and uses -1 for errors which converts to SIZE_MAX
(risking huge allocations if the file disappears between file_exits and stat64);
mark both helpers inline, change get_filesize to return an optional-like error
indicator (e.g., std::optional<size_t> or ssize_t) instead of size_t so errors
are represented safely, update callers (the code that computes rows/divisions)
to check the optional/negative value and bail out early rather than performing
divisions, and keep file_exits inline as well; specifically update get_filesize
and file_exits signatures and ensure the place that computes rows (where
get_filesize() is used) validates the returned value before using it.
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu-452-527 (1)

452-527: ⚠️ Potential issue | 🟠 Major

int grid/block arithmetic can overflow for large inputs.

Several launch-config computations mix size_t operands into int results, which silently truncates or overflows when num_vectors, num_centroids, or batch_size * num_centroids exceeds ~2³¹:

  • Line 453 / 526 / 539 / 695: int num_blocks = (num_vectors + block_size - 1) / block_size;
  • Line 468 / 618: int num_levels = num_centroids + 1; (passed as int to CUB)
  • Lines 1238/1346/1455: int grid = num_centroids + batch_size;
  • Lines 1254/1362/1471: (batch_size * num_centroids + add_threads - 1) / add_threads assigned to int

Given this index supports datasets too large to fit in GPU memory (streaming path), num_vectors in the tens-of-billions range is plausible. Please use size_t/int64_t for these intermediates (CUB histogram/scan use int num_levels, so a runtime check that num_centroids fits in int is also warranted).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 452 - 527,
The grid/block arithmetic and CUB parameters use int and can overflow; change
intermediate counters like num_blocks, num_levels, and grid calculations to
size_t or int64_t (compute e.g. num_blocks = (size_t(num_vectors) + block_size -
1) / block_size and similar for (batch_size * num_centroids + add_threads - 1) /
add_threads), keep block_size as an int for launch, and cast to the appropriate
launch type only when invoking kernels such as build_cluster_meta_kernel;
additionally add runtime checks that values passed into CUB APIs (num_centroids
-> num_levels) fit into int and fail/throw if not, and ensure d_offsets/
histogram allocations use matching wider index types (size_t) so no truncation
occurs before any kernel or cub call.
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh-163-166 (1)

163-166: ⚠️ Potential issue | 🟠 Major

Default constructor leaves scalar members uninitialized.

The default constructor only initializes handle_, initializer, and Rota, but num_vectors, num_dimensions, num_padded_dim, num_centroids, max_cluster_length, and ex_bits (declared at lines 381-386) are left indeterminate. If anything reads these (e.g. a getter call, or an early exit in save/load_transposed) before load_transposed populates them, you get UB. Also, passing a hard-coded 128 to RotatorGPU is wasted work since load_transposed immediately replaces Rota.

🛡️ Proposed fix
   IVFGPU(raft::resources const& handle)
-    : handle_(handle), initializer(nullptr), Rota(std::make_unique<RotatorGPU>(handle_, 128))
+    : handle_(handle),
+      num_vectors(0),
+      num_dimensions(0),
+      num_padded_dim(0),
+      num_centroids(0),
+      max_cluster_length(0),
+      ex_bits(0),
+      initializer(nullptr),
+      DQ(nullptr),
+      Rota(nullptr)
   {
   }

Alternatively, give the data members default member initializers where declared.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh` around lines 163 - 166,
The IVFGPU(raft::resources const& handle) constructor leaves scalar members
num_vectors, num_dimensions, num_padded_dim, num_centroids, max_cluster_length,
and ex_bits uninitialized and eagerly constructs Rota with a hard-coded 128; fix
by initializing those scalars to safe defaults (e.g. 0) either via default
member initializers where they are declared or by initializing them in the
IVFGPU constructor initializer list, and avoid wasteful RotatorGPU construction
by initializing Rota to nullptr or deferring its construction until
load_transposed replaces it (or construct with the correct parameter), ensuring
consistency with load_transposed and initializer usage.
🟡 Minor comments (12)
cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp-18-40 (1)

18-40: ⚠️ Potential issue | 🟡 Minor

Use floating-point durations to preserve elapsed time precision.

All four methods cast to integral duration types before calling .count(), which truncates fractional time. For example, duration_cast<seconds> drops sub-second precision, causing short phases to report 0.0 seconds. Use floating-point duration types directly.

Proposed fix
-  float getElapsedTimeSec()
+  float getElapsedTimeSec() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::seconds>(time_end - time_begin).count());
+    return std::chrono::duration<float>(std::chrono::steady_clock::now() - time_begin).count();
   }
 
-  float getElapsedTimeMili()
+  float getElapsedTimeMili() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::milliseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::milli>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
 
-  float getElapsedTimeMicro()
+  float getElapsedTimeMicro() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::microseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::micro>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
 
-  float getElapsedTimeNano()
+  float getElapsedTimeNano() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::nanoseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::nano>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp` around lines 18 - 40, The
current
getElapsedTimeSec/getElapsedTimeMili/getElapsedTimeMicro/getElapsedTimeNano
functions use duration_cast to integral duration types which truncates
fractions; change each to compute elapsed = std::chrono::steady_clock::now() -
time_begin and return elapsed.count() from a floating-point chrono::duration
(e.g., std::chrono::duration<float> for seconds, std::chrono::duration<float,
std::milli> for milliseconds, std::chrono::duration<float, std::micro> for
microseconds, and std::chrono::duration<float, std::nano> for nanoseconds) so
the functions return fractional values instead of being truncated to integers.
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-122-126 (1)

122-126: ⚠️ Potential issue | 🟡 Minor

Replace the release-stripped assert with runtime validation.

n_probes comes from benchmark config, and assert is compiled out under NDEBUG; invalid configs can then reach the search path unchecked.

Proposed fix
-  assert(search_params_.n_probes <= index_params_.n_lists);
+  if (search_params_.n_probes > index_params_.n_lists) {
+    throw std::runtime_error("ivf_rabitq n_probes must be <= n_lists");
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 122 - 126,
Replace the release-only assert with a runtime validation: after assigning
search_params_ and refine_ratio_ check if search_params_.n_probes >
index_params_.n_lists and if so throw a std::invalid_argument (or another
appropriate exception) with a clear message including the offending values
(e.g., mention search_params_.n_probes and index_params_.n_lists); update the
block that currently uses assert(...) to this if-check so invalid benchmark
configs are rejected at runtime (refer to the variables search_params_,
index_params_.n_lists, and refine_ratio_ to locate the code).
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu-459-460 (1)

459-460: ⚠️ Potential issue | 🟡 Minor

Remove redundant uint32_t* cast in queue.store() calls.

The explicit (uint32_t*) cast is unnecessary since params.d_topk_pids is already PID*, which is defined as uint32_t*. Removing the cast improves code clarity without affecting functionality.

Proposed fix
-      queue.store(params.d_topk_dists + output_offset,
-                  (uint32_t*)(params.d_topk_pids + output_offset));
+      queue.store(params.d_topk_dists + output_offset,
+                  params.d_topk_pids + output_offset);

Also applies to: 787-788

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 459 - 460, The calls to queue.store currently include an unnecessary cast
(uint32_t*) on params.d_topk_pids; remove the redundant cast so the call becomes
queue.store(params.d_topk_dists + output_offset, params.d_topk_pids +
output_offset). Update both occurrences (the one around queue.store(...) at the
shown location and the second occurrence around lines ~787-788) to drop the
explicit (uint32_t*) cast; retain all other arguments and behavior unchanged and
rely on PID being typedef'd to uint32_t*.
cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu-37-75 (1)

37-75: ⚠️ Potential issue | 🟡 Minor

Silent UB when D is not a multiple of 4.

ComputeDistancesKernelWarp reinterprets centroid/query pointers as float4* and iterates over D_vec = D/4, but the host call site uses integer division D / 4 and passes without checking. If D is ever not a multiple of 4, the last few scalar dimensions are silently dropped (and the float4 cast reads past the valid range). In the current pipeline D = num_padded_dim is a multiple of 64, but nothing in this TU enforces that — add a defensive RAFT_EXPECTS(D % 4 == 0, ...) in ComputeCentroidsDistances (and/or at construction) to fail loudly if a future caller violates the assumption.

Also applies to: 115-134

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 37 -
75, The kernel ComputeDistancesKernelWarp assumes vectors are packed as float4
(D_vec = D/4) which causes silent UB if D is not a multiple of 4; add a
defensive runtime check using RAFT_EXPECTS(D % 4 == 0, "D must be multiple of
4") in the host-side function ComputeCentroidsDistances (and/or the
class/constructor that sets num_padded_dim) before computing/passing D_vec, and
add the same guard where similar kernels are invoked (the other occurrence
around lines 115-134) so any future caller fails loudly instead of
reading/padding past valid memory.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-560-594 (1)

560-594: ⚠️ Potential issue | 🟡 Minor

Threshold tightening loop skips legitimate zero distances.

In both block-sort variants, the "find max topk distance" scan filters with dist > 0 && ... && dist < INFINITY. Valid query-vector pairs can yield dist == 0.0f (e.g., identical vectors, or after numerical cancellation). When every top-k distance is 0, max_topk_dist remains -INFINITY and the threshold is never updated — which is safe but wastes the tightening opportunity; when only some slots are 0, they are excluded from the max, which is still correct but inconsistent. Consider dist >= 0 && dist < INFINITY (or just dist < INFINITY, since the init value is INFINITY) to let zero distances participate.

Also applies to: 891-924

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 560 -
594, The max-top-k scan currently excludes zero distances via `if (dist > 0 &&
...)` so zero-valued valid distances are ignored; update the condition in the
scan that computes max_topk_dist (the loop that reads params.d_topk_dists at
output_offset using query_idx, probe_slot and params.topk) to include zeros
(e.g., `if (dist >= 0 && dist < INFINITY)`) or simply `if (dist < INFINITY)`)
and apply the same change in the other block-sort variant (the similar loop
around lines 891-924) so atomic threshold tightening uses zero distances too
before performing the `atomicMin` on threshold_ptr.
cpp/include/cuvs/neighbors/ivf_rabitq.hpp-91-97 (1)

91-97: ⚠️ Potential issue | 🟡 Minor

Document search_mode enumerators.

Each enumerator (LUT16, LUT32, QUANT4, QUANT8) affects recall/throughput/memory trade-offs materially, but the only inline doc is the group comment "A type for specifying the mode...". Please add per-value Doxygen explaining what each mode does and when to pick it; users will otherwise have to read the .cu implementation to choose a mode.

As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` around lines 91 - 97, Add Doxygen
comments for each enumerator of enum class search_mode (LUT16, LUT32, QUANT4,
QUANT8) describing what the mode does, its impact on recall/throughput/memory,
and guidance on when to choose it (e.g., LUT16: smaller LUT, fastest lookup,
lower recall; LUT32: larger LUT, higher recall but more memory; QUANT4/QUANT8:
quantized ADC trade-offs with QUANT4 being lowest memory/best speed and QUANT8
higher accuracy). Place the /// or /** */ comments directly above each
enumerator so the public API header documents per-value behavior for consumers.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh-65-91 (1)

65-91: ⚠️ Potential issue | 🟡 Minor

Doxygen for SearchClusterQueryPairs is out of sync with the signature.

The @param list references all_topk_results, h_query, d_centroid, and stream, none of which are parameters of the declared function. Conversely, topk, d_final_dists, and d_final_pids are undocumented. Please bring the doc comment in line with the actual signature (and apply the same check to SearchClusterQueryPairsSharedMemOpt / SearchClusterQueryPairsQuantizeQuery, which have no Doxygen at all).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh` around lines 65 -
91, The Doxygen block above SearchClusterQueryPairs is stale: remove references
to non-existent params (all_topk_results, h_query, d_centroid, stream) and add
documentation for the actual parameters topk, d_final_dists, and d_final_pids,
describing their purpose and types to match the signature of
SearchClusterQueryPairs(const IVFGPU& cur_ivf, IVFGPU::GPUClusterMeta*
d_cluster_meta, ClusterQueryPair* d_sorted_pairs, size_t num_queries, const
float* d_query, const float* d_G_k1xSumq, const float* d_G_kbxSumq, size_t
nprobe, size_t topk, float* d_final_dists, PID* d_final_pids); also add or
update matching Doxygen for the related functions
SearchClusterQueryPairsSharedMemOpt and SearchClusterQueryPairsQuantizeQuery so
their comments reflect their exact parameter lists.
cpp/tests/neighbors/ann_ivf_rabitq.cuh-370-391 (1)

370-391: ⚠️ Potential issue | 🟡 Minor

Tests never exercise build_only() — only the post-serialize path is validated.

The comment on lines 372-373 states that deserialization is required to reorganize data for correct search. As a result, every TEST_P macro (build_serialize_search, build_host_input_serialize_search, build_forced_streaming) runs through serialize→deserialize before calling search. There is no test that confirms build_only() (the path users call when they don't persist the index) produces correct results — which means a regression that broke direct-build search would not be caught here.

If the library is expected to require a roundtrip for correctness, that should be enforced at the API level (e.g., search refuses to run on a non-finalized index) rather than documented only in a test comment. Otherwise, please add a TEST_BUILD_SEARCH variant that calls build_only() and asserts recall, so the no-persistence workflow has regression coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_ivf_rabitq.cuh` around lines 370 - 391, Add a test
variant that exercises the direct-build path instead of always going through
serialize/deserialize: create a TEST_BUILD_SEARCH(type) macro (mirroring
TEST_BUILD_SERIALIZE_SEARCH) whose TEST_P registers build_search and inside
calls this->run([this]() { return this->build_only(); }); then ensure the
resulting test calls search/asserts recall the same way the other tests do;
reference the existing TEST_P macros and the build_only(), build_serialize(),
build_host_input_serialize(), build_with_forced_streaming(), and search() flow
so the new macro is wired into the same test fixture and assertions.
cpp/include/cuvs/neighbors/ivf_rabitq.hpp-293-293 (1)

293-293: ⚠️ Potential issue | 🟡 Minor

Doxygen typo: "IVF-PQ" should be "IVF-RaBitQ".

Copy-paste from the IVF-PQ docs. Also consider clarifying on the index parameter that *index must be a non-null, default-constructed ivf_rabitq::index<int64_t> (the implementation RAFT_FAILs on nullptr but does not document this precondition).

📝 Proposed fix
- * `@param`[out] index IVF-PQ index
+ * `@param`[out] index IVF-RaBitQ index (must be non-null)

As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` at line 293, Update the Doxygen
param description to correct the algorithm name from "IVF-PQ" to "IVF-RaBitQ"
and clarify the precondition for the parameter `index`: document that the caller
must pass a non-null pointer and that `*index` must be a default-constructed
ivf_rabitq::index<int64_t> (the implementation currently RAFT_FAILs on nullptr).
Locate the doc block that currently reads "@param[out] index IVF-PQ index" and
replace it with a concise note stating "IVF-RaBitQ index; must be non-null and
point to a default-constructed ivf_rabitq::index<int64_t>" so callers know the
required precondition.
cpp/src/neighbors/ivf_rabitq.cu-222-257 (1)

222-257: ⚠️ Potential issue | 🟡 Minor

Add a fall-through guard for unknown search_mode in the dispatch chain.

The if/else-if ladder has no terminal else — if search_mode ever gains a new enumerator, search will silently produce final_ids of uninitialized values (since make_device_vector at line 220 does not zero-initialize), and the subsequent raft::linalg::map at 260-264 will emit garbage neighbor IDs without any diagnostic. The search_mode_to_string lambda already covers the new-enumerator case with RAFT_FAIL, but it runs earlier (for SearcherGPU construction), so for readers of this function the invariant is non-local. A final else { RAFT_FAIL(...); } makes the dispatch self-checking.

🛡️ Proposed guard
   } else if (params.mode == search_mode::QUANT4) {
     idx.rabitq_index().BatchClusterSearchQuantizeQuery(...,
                                                        4);
+  } else {
+    RAFT_FAIL("Invalid search mode");
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq.cu` around lines 222 - 257, The dispatch over
params.mode lacks a terminal guard: after the existing branches handling
search_mode::LUT32, LUT16, QUANT8, and QUANT4, add a final else that calls
RAFT_FAIL (or similar) to fail fast for unknown search_mode values (use
search_mode_to_string(params.mode) if available for a readable message); this
ensures BatchClusterSearch* branches cannot silently leave final_ids
uninitialized.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-947-952 (1)

947-952: ⚠️ Potential issue | 🟡 Minor

Use 0.0f (or remove) the LUT pre-fill instead of -infinity.

Filling the LUT buffer with -std::numeric_limits<float>::infinity() is risky: if precomputeAllLUTs ever leaves any entry unwritten (e.g., due to a mismatch between host-side cur_ivf.get_num_padded_dim() and the searcher's D, kernel launch failure, or future refactor), those entries silently leak -inf into the IP accumulation and produce corrupted distances without any indication. Since the kernel currently writes every entry covered by the searcher's D, either drop the fill entirely or, if kept defensively, use 0.0f which degrades gracefully.

🔒️ Proposed change
   rmm::device_uvector<float> d_lut_for_queries(lut_size / sizeof(float), stream_);
-  thrust::fill(thrust::cuda::par.on(stream_),
-               d_lut_for_queries.data(),
-               d_lut_for_queries.data() + d_lut_for_queries.size(),
-               -std::numeric_limits<float>::infinity());
   // precompute LUTS
   launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 947 -
952, The LUT buffer is being pre-filled with
-std::numeric_limits<float>::infinity() which can silently corrupt inner-product
accumulations if any entry remains unwritten; change the pre-fill for
d_lut_for_queries (used before launchPrecomputeLUTs) to use 0.0f instead (or
remove the thrust::fill call entirely) so unwritten entries degrade gracefully;
update the thrust::fill target d_lut_for_queries.data() to use 0.0f and keep
launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_)
as-is.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-528-540 (1)

528-540: ⚠️ Potential issue | 🟡 Minor

Remove the commented debug line.

Line 530 contains a stale debug artifact (// ex_dist = ex_dist+1;) that should be deleted. The all-threads invariant for queue.add() is correctly implemented: the else branch feeds dummy values (INFINITY/0) to ensure every thread calls queue.add() exactly once per round, which matches the block-sort requirement. The comment at line 539 already affirms this invariant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 528 -
540, Remove the stale debug comment line that reads "//                ex_dist =
ex_dist+1;" located near the calculation of ex_dist and pid; this is a leftover
artifact and should be deleted so the code only contains the real assignment to
ex_dist, the pid assignment from params.d_pids[global_vec_idx], and the
subsequent queue.add(ex_dist, pid) call (ensure references to ex_dist, pid and
queue.add remain unchanged).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3c486d41-58fb-412e-9cf6-6c82835e1bbf

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and e28491b.

📒 Files selected for processing (36)
  • cpp/CMakeLists.txt
  • cpp/bench/ann/CMakeLists.txt
  • cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
  • cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h
  • cpp/bench/ann/src/faiss/faiss_cpu_benchmark.cpp
  • cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h
  • cpp/include/cuvs/neighbors/ivf_rabitq.hpp
  • cpp/src/neighbors/ivf_rabitq.cu
  • cpp/src/neighbors/ivf_rabitq/defines.hpp
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu
  • cpp/src/neighbors/ivf_rabitq/utils/IO.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/memory.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu
  • cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/space.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/tools.hpp
  • cpp/tests/CMakeLists.txt
  • cpp/tests/neighbors/ann_ivf_rabitq.cuh
  • cpp/tests/neighbors/ann_ivf_rabitq/test_float_int64_t.cu
  • python/cuvs/cuvs/tests/test_compute_matrix_product.py
  • python/cuvs_bench/cuvs_bench/config/algorithms.yaml

Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
Comment on lines +332 to +338
if (is_candidate) {
int candidate_slot = atomicAdd(&num_candidates, 1);
if (candidate_slot < params.max_candidates_per_pair) {
shared_candidate_ips[candidate_slot] = local_ip;
shared_candidate_indices[candidate_slot] = vec_idx;
}
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Clamp candidate count before reading shared candidate arrays.

num_candidates keeps increasing after candidate_slot >= params.max_candidates_per_pair, but later loops iterate up to num_candidates and read entries that were never written. A dense cluster or loose threshold can turn this into shared-memory OOB reads and corrupt top-k results.

Proposed fix pattern
   __syncthreads();
+  const int stored_candidates = min(num_candidates, params.max_candidates_per_pair);
 
-  if (num_candidates > 0) {
+  if (stored_candidates > 0) {
     ...
-    for (int cand_idx = warp_id; cand_idx < num_candidates; cand_idx += num_warps) {
+    for (int cand_idx = warp_id; cand_idx < stored_candidates; cand_idx += num_warps) {
       ...
     }
     ...
-    const int adds_per_thread = (num_candidates + num_threads - 1) / num_threads;
+    const int adds_per_thread = (stored_candidates + num_threads - 1) / num_threads;

Apply the same cap to the NoEX block-sort path and threshold-update checks.

Also applies to: 378-379, 411-419, 735-742, 758-766

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 332 - 338, The bug is that num_candidates can grow past
params.max_candidates_per_pair so later loops read unwritten
shared_candidate_ips/shared_candidate_indices; after the candidate-collection
phase, clamp num_candidates = min(num_candidates,
params.max_candidates_per_pair) (or compute a local bounded_count =
min(num_candidates, params.max_candidates_per_pair)) and use that bounded count
for any subsequent loops or threshold-update checks that iterate over
shared_candidate_ips/shared_candidate_indices; apply the same clamp in the NoEX
block-sort path and any threshold-update code paths that currently iterate up to
num_candidates to avoid shared-memory OOB reads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 654, params.max_candidates_per_pair was set (before the kernel call) to max_cluster_size. Even if all vectors in the largest cluster are candidates, num_candidates is still less than or equal to max_cluster_size. Therefore, num_candidates is naturally clamped by the algorithm.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamxia155, you're correct. Since params.max_candidates_per_pair is set to max_cluster_size before the kernel launch, and the atomicAdd only fires for vectors belonging to the current cluster (vec_idx < num_vectors_in_cluster), the count is naturally bounded:

num_candidates ≤ num_vectors_in_cluster ≤ max_cluster_size == params.max_candidates_per_pair

The shared-memory arrays are therefore always large enough to hold all candidates, and no OOB access can occur. My original concern was a false positive — the algorithm already provides the necessary bound implicitly. Thanks for clarifying!

Comment thread cpp/src/neighbors/ivf_rabitq/utils/memory.hpp
for (int j = tid; j < D; j += BlockSize) {
s_xp_norm[j] = d_XP_norm[row * D + j];
}
__syncthreads();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary sync, each thread access the data it wrote to smem just before. A thread won't read values write by other thread

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Astute as ever. Updated. Thanks.

}

/**
* @defgroup ivf_rabitq_cpp_index_params IVF-RaBitQ index build parameters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add all declared group to doxygen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to docs/source/cpp_api. Thanks.

Comment on lines +113 to +115
raft::device_matrix<float, int64_t, raft::row_major> centroids_ =
raft::make_device_matrix<float, int64_t, raft::row_major>(
handle_, K, D); // Stored in GPU device memory. Points to the parent centroids' array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to remove that default and only keep the constructor initialization of centroids_

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Moving this to the constructor to avoid any potential subtlety with default initialization.

centroids_(raft::make_device_matrix<float, int64_t, raft::row_major>(handle_, K, D))
{
dist_func = L2SqrGPU;
raft::resource::sync_stream(handle_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary sync

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks.

Comment on lines +146 to +151
// Initialize candidate IDs as a sequence 0,1,...,K-1 using a custom kernel
int block_size = 256;
int grid_size = (K + block_size - 1) / block_size;
init_sequence_kernel<<<grid_size, block_size, 0, stream_>>>(d_candidate_ids, K);
RAFT_CUDA_TRY(cudaPeekAtLastError());
raft::resource::sync_stream(handle_); // Wait for kernel completion
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer raft::linalg::range

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, this was part of the removed dead code. Will keep in mind going forward. Thanks.

const_cast<float*>(rotation_matrix_.data_handle()), D, D),
raft::make_device_matrix_view<float, int64_t, raft::col_major>(const_cast<float*>(d_A), D, N),
raft::make_device_matrix_view<float, int64_t, raft::col_major>(d_RAND_A, D, N));
raft::resource::sync_stream(handle_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sync

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks.

Comment on lines +27 to +30
auto rd_up_to_multiple_of = [](uint32_t dim, uint32_t mult) -> size_t {
return ((dim + mult - 1) / mult) * mult;
};
D = rd_up_to_multiple_of(dim, 64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft::round_up_safe

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion (along with the one re: raft::div_rounding_up_safe) allowed me to remove an entire header that implements these functionalities that are already in RAFT. Thanks!


namespace cuvs::neighbors::ivf_rabitq::detail {

#define MAX_D 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static constexpr int ... or something else than a macro

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is dead code. Removed. Thanks.


static constexpr int BITS_PER_CHUNK = 4;
static constexpr int LUT_SIZE = (1 << BITS_PER_CHUNK); // 16
static constexpr int WARP_SIZE = 32;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use raft::WarpSize

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

Comment on lines +1423 to +1424
RAFT_CUDA_TRY(cudaFuncSetAttribute(
fully_fused_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_mem_size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed?
The shared mem size specified in <<<>>> should be enough?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using >48KB of dynamic shared memory requires opt-in (see CUDA Programming Guide). However, your comment reminded me to use the recently added cuvs::neighbors::detail::safely_launch_kernel_with_smem_size which is additionally thread-safe. Updated. Thanks.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (4)
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu (2)

332-338: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Still applies: clamp num_candidates to max_candidates_per_pair before downstream reads.

The atomic increment continues to grow num_candidates past the storage capacity, while later loops (lines 378, 411–419) and the threshold-update path (line 465) iterate up to num_candidates, indexing shared_candidate_ips/shared_candidate_indices at slots that were never written. The NoEX block-sort variant has the same shape at lines 735–742 / 758–766. Please apply a single stored_candidates = min(num_candidates, params.max_candidates_per_pair) clamp at the post-collection sync point and use it consistently in all subsequent loops and threshold checks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 332 - 338, The atomic increment of num_candidates can exceed the storage
arrays and later loops/readers (which access
shared_candidate_ips/shared_candidate_indices) must not iterate past stored
entries: after the post-collection __syncthreads() point compute int
stored_candidates = min(num_candidates, params.max_candidates_per_pair) and
replace all subsequent uses of num_candidates in loops and in the
threshold-update path with stored_candidates (apply the same clamp in the NoEX
block-sort variant that mirrors the same collection shape) so readers only index
written slots.

459-460: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still applies: drop the (uint32_t*) casts on params.d_topk_pids to keep type safety with the templated index type.

params.d_topk_pids is PID* (currently uint32_t*), but if IdxT is ever templated to int64_t (as already done for raft::make_device_vector<PID, int64_t>) these explicit casts will silently bypass the mismatch. Removing them surfaces the issue at compile time. The same comment was raised on the matching searcher_gpu_quantize_query.cu and searcher_gpu.cu lines.

Also applies to: 787-788

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 459 - 460, The cast (uint32_t*) on params.d_topk_pids in the queue.store
call must be removed to preserve type safety for templated index types
(PID/IdxT); update the two occurrences (queue.store(params.d_topk_dists +
output_offset, (uint32_t*)(params.d_topk_pids + output_offset))) to pass
params.d_topk_pids with its native PID* type (no cast) so mismatched PID/IdxT
instantiations fail at compile time, and apply the same removal at the other
occurrence around lines 787-788; ensure the surrounding code in
searcher_gpu_shared_mem_opt.cu (and analogous spots in
searcher_gpu_quantize_query.cu / searcher_gpu.cu) accepts PID* rather than
forcing uint32_t*.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu (2)

353-358: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Still applies: clamp num_candidates to max_candidates_per_pair across all four block-sort kernels.

The pattern (atomicAdd increments num_candidates, then later loops iterate up to num_candidates) is repeated unchanged in all four block-sort kernels here:

  • Lines 353-358 + 438-439 + 471-479 (8-bit, EX)
  • Lines 664-669 + 691-699 (8-bit, NoEX)
  • Lines 895-900 + 982-983 + 1015-1023 (4-bit, EX)
  • Lines 1204-1209 + 1231-1239 (4-bit, NoEX)

Please introduce a stored_candidates = min(num_candidates, params.max_candidates_per_pair) at the post-collection sync point in each variant (or, better, factor the kernel body and fix once — see the duplication comment below) and use it for every subsequent loop bound and threshold check.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 353 - 358, The block-sort kernels use atomicAdd(&num_candidates, 1) to
collect candidates but then iterate up to num_candidates without clamping, which
can exceed params.max_candidates_per_pair; update each of the four variants
(8-bit EX/NoEX and 4-bit EX/NoEX) by introducing a local stored_candidates =
min(num_candidates, params.max_candidates_per_pair) at the post-collection
__syncthreads() point and replace every subsequent loop bound and threshold
check that currently uses num_candidates with stored_candidates; ensure this
change is applied where shared_candidate_ips and shared_candidate_indices are
read after collection (the block-sort kernel bodies that perform candidate
loops) so the loops never iterate past params.max_candidates_per_pair.

520-521: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still applies: remove the (uint32_t*) casts on params.d_topk_pids in every queue.store(...) call.

The casts hide a type mismatch with the templated IdxT and would mask future regressions if PID/IdxT widen. Same finding as the prior review on searcher_gpu.cu and searcher_gpu_shared_mem_opt.cu; please address consistently across the four call-sites here (lines 521, 755, 1065, 1295) and the two in the shared-mem-opt file.

Also applies to: 754-755, 1064-1065, 1294-1295

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 520 - 521, The queue.store(...) calls currently cast params.d_topk_pids to
(uint32_t*) which hides a type mismatch with the templated index type; remove
the explicit (uint32_t*) casts and pass params.d_topk_pids (typed as PID/IdxT)
directly to queue.store so the compiler enforces correct template types (update
all four occurrences of queue.store(...) in searcher_gpu_quantize_query.cu that
reference params.d_topk_pids and the two analogous calls in the shared-mem-opt
implementation); ensure the queue.store signature and any template parameters
align with IdxT/PID so no unsafe narrowing casts remain.
🧹 Nitpick comments (2)
cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu (1)

37-55: ⚡ Quick win

Replace per-element file I/O with bulk read/write.

Both load and save iterate one float at a time over D*D elements. For padded dims like 512–1024, that's 256k–1M syscalls/format-state checks per call. A single read/write of the whole buffer is materially faster and equally safe (host buffer is contiguous).

♻️ Proposed fix
 void RotatorGPU::load(std::ifstream& input)
 {
   auto host_buf = raft::make_host_vector<float, int64_t>(D * D);
-  for (size_t i = 0; i < D * D; ++i) {
-    input.read(reinterpret_cast<char*>(&host_buf(i)), sizeof(float));
-  }
+  input.read(reinterpret_cast<char*>(host_buf.data_handle()),
+             static_cast<std::streamsize>(sizeof(float) * D * D));
   raft::copy(rotation_matrix_.data_handle(), host_buf.data_handle(), D * D, stream_);
   raft::resource::sync_stream(handle_);
 }

 void RotatorGPU::save(std::ofstream& output) const
 {
   auto host_buf = raft::make_host_vector<float, int64_t>(D * D);
   raft::copy(host_buf.data_handle(), rotation_matrix_.data_handle(), D * D, stream_);
   raft::resource::sync_stream(handle_);
-  for (size_t i = 0; i < D * D; ++i) {
-    output.write(reinterpret_cast<char*>(&host_buf(i)), sizeof(float));
-  }
+  output.write(reinterpret_cast<char*>(host_buf.data_handle()),
+               static_cast<std::streamsize>(sizeof(float) * D * D));
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu` around lines 37 - 55,
The load/save loops in RotatorGPU::load and RotatorGPU::save perform per-element
file I/O; replace them with a single bulk read/write of the contiguous host
buffer (host_buf) sized D*D floats to avoid many syscalls: in RotatorGPU::load
do one input.read into host_buf's contiguous storage for sizeof(float) * D * D
before the raft::copy and sync, and in RotatorGPU::save do the raft::copy and
sync then one output.write from host_buf's contiguous storage for sizeof(float)
* D * D; ensure you keep the existing raft::copy(rotation_matrix_...) and
raft::resource::sync_stream(handle_) calls.
cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh (1)

162-187: 💤 Low value

Unused FAST_SIZE/num_factors parameters across multiple helpers — clean up the API.

factor_ip, factor_sumxb, factor_err, and next_block all accept a size_t FAST_SIZE (or num_factors) argument that is silently ignored. The TODO at line 180 already acknowledges num_factors is unused. Either the layout truly does not depend on FAST_SIZE/num_factors (in which case remove the parameter), or the helpers need to multiply by FAST_SIZE (in which case they currently produce wrong offsets if SoA layout changes). The dual block_factor overload at lines 150-153 and 157-161 also differs only by an unused size_t D.

These vestigial parameters make the contract ambiguous and create a maintenance trap: a reader will reasonably assume call-sites must pass meaningful FAST_SIZE values, but every implementation discards them.

♻️ Suggested cleanup
-  __host__ inline float* factor_x2(uint32_t* block_fac) const
-  {
-    return reinterpret_cast<float*>(block_fac);
-  }
-  __host__ inline float* factor_ip(uint32_t* block_fac, size_t FAST_SIZE) const
-  {
-    return reinterpret_cast<float*>(block_fac) + 1;
-  }
-  __host__ inline float* factor_sumxb(uint32_t* block_fac, size_t FAST_SIZE) const
-  {
-    return reinterpret_cast<float*>(block_fac) + 2;
-  }
-  __host__ inline float* factor_err(uint32_t* block_fac, size_t FAST_SIZE) const
-  {
-    return reinterpret_cast<float*>(block_fac) + 3;
-  }
-  // TODO: remove unused num_factors
-  __host__ inline uint32_t* next_block(uint32_t* block,
-                                       size_t code_len,
-                                       size_t FAST_SIZE,
-                                       size_t num_factors = 1) const
-  {
-    return block + code_len + num_factors;
-  }
+  __host__ inline float* factor_x2(uint32_t* block_fac) const     { return reinterpret_cast<float*>(block_fac);     }
+  __host__ inline float* factor_ip(uint32_t* block_fac) const     { return reinterpret_cast<float*>(block_fac) + 1; }
+  __host__ inline float* factor_sumxb(uint32_t* block_fac) const  { return reinterpret_cast<float*>(block_fac) + 2; }
+  __host__ inline float* factor_err(uint32_t* block_fac) const    { return reinterpret_cast<float*>(block_fac) + 3; }
+  __host__ inline uint32_t* next_block(uint32_t* block, size_t code_len) const
+  {
+    return block + code_len + /*num_factors=*/1;
+  }

If a layout migration is actually planned (where FAST_SIZE would matter), please leave a clear // TODO(...) instead of silently ignoring the argument.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh` around lines 162 -
187, The helpers factor_ip, factor_sumxb, factor_err, next_block and the
duplicate block_factor overloads include unused size_t parameters (FAST_SIZE,
num_factors, D) that are confusing; remove these unused parameters from the
function signatures (factor_ip(uint32_t*), factor_sumxb(uint32_t*),
factor_err(uint32_t*), next_block(uint32_t*, size_t code_len) and the redundant
block_factor overload) and update all call-sites accordingly, or if the layout
actually depends on FAST_SIZE/num_factors, change the implementations to compute
offsets using that value (e.g., add offsets multiplied by FAST_SIZE or
num_factors) and replace the TODO with a clear comment; pick one approach and
make signatures and implementations consistent (refer to functions factor_ip,
factor_sumxb, factor_err, next_block and the two block_factor overloads).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu`:
- Around line 40-43: LoadCentroids/SaveCentroids currently assume file I/O
succeeded and immediately use centroids_ and host_buf; update both functions to
validate the stream read/write succeeded and the number of bytes processed
matches data_bytes() before calling raft::copy or returning. After
input.read(...), check input.good() / input.gcount() (or input.fail()/eof() and
compare gcount() == data_bytes()) and if the check fails, return or throw an
error and do not call raft::copy(centroids_.data_handle(),
host_buf.data_handle(), data_elements(), stream_) or sync_stream(handle_);
likewise, after output.write(...) verify output.good() (or check bytes written)
and handle/report the error instead of silently succeeding. Reference
centroids_, host_buf, data_bytes(), data_elements(), stream_, raft::copy, and
raft::resource::sync_stream(handle_) when making these checks; apply the same
validation to the other occurrence noted (around the second read/write at the
other location).
- Around line 27-30: FlatInitializerGPU::AddVectors currently calls raft::copy
from cent without validating the input pointer; add a null-pointer guard at the
start of AddVectors that checks if cent is nullptr and handles it (e.g., throw
std::invalid_argument, return an error code, or log and abort) before calling
raft::copy(centroids_.data_handle(), cent, data_elements(), stream_); ensure the
check references the cent parameter and preserves existing behavior for valid
inputs and uses the same error-handling policy as other functions in this
module.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh`:
- Around line 33-35: The constructor InitializerGPU(raft::resources const&
handle, size_t d, size_t k) must validate the input dimensions D and K
immediately: check that d>0 and k>0 (and optionally that multiplication won't
overflow) and fail fast (throw std::invalid_argument or use an assertion) before
any GPU allocation or storing handle_ occurs; update the ctor body around the
InitializerGPU initializer list to perform these checks and produce a clear
error message referencing D and K so downstream copy/load paths never see
K*D==0.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh`:
- Line 27: The code declares lut_dtype = __half (FP16) but many symbols are
named with "bf16", causing a misleading mismatch; fix by either (A) renaming all
identifiers that contain "_bf16" to "_fp16" to reflect __half usage (e.g.
shared_lut_bf16 -> shared_lut_fp16, precomputeAllLUTs_bf16_simple ->
precomputeAllLUTs_fp16_simple, precomputeAllLUTs_bf16_optimized,
launchPrecomputeLUTs_bf16, neg_inf_bf16, etc.), or (B) if BF16 was intended,
change lut_dtype to __nv_bfloat16 and swap conversion intrinsics to
__float2bfloat16/__bfloat162float; apply the chosen change consistently across
declarations, function signatures, variable names and conversion calls to
eliminate ambiguity.
- Around line 65-74: The extract_code function has a 16-bit accumulator with a
shift calculation that becomes negative when EX_BITS exceeds 9, causing
undefined behavior and silent data corruption. Since EX_BITS is derived from the
user-facing bits_per_dim parameter (ex_bits = bits_per_dim - 1), add a
precondition check in the IVFGPU constructor using RAFT_EXPECTS to validate that
bits_per_dim is constrained to the range [2..10], which maps to safe EX_BITS
values of [1..9]. This prevents out-of-range configurations from silently
corrupting results during search operations by catching invalid inputs at index
construction time.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu`:
- Around line 252-1333: The four near-duplicate kernels
(computeInnerProductsWithBitwiseOpt8bitBlockSort,
computeInnerProductsWithBitwiseOpt8bitNoEXBlockSort,
computeInnerProductsWithBitwiseOpt4bitBlockSort,
computeInnerProductsWithBitwiseOpt4bitNoEXBlockSort) should be refactored into a
single templated kernel to remove duplication: create template<int NumBits, bool
WithEx> __global__ computeInnerProductsWithBitwiseBlockSort(const
ComputeInnerProductsKernelParams params) that implements the shared setup
(shared memory layout, packed-query load, candidate filtering, exact-IP
recompute, queue.add/done/store, threshold update) and uses NumBits to unroll
the popcount accumulator (handling the last sign plane with subtraction) and if
constexpr(WithEx) to include/exclude the long-code IP2 path; dispatch to the
four concrete instantiations (4/8 bits × with/without EX) from the callers so
all logic (including num_candidates clamp, output/store, and atomicMin threshold
update) exists in one place.

---

Duplicate comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu`:
- Around line 353-358: The block-sort kernels use atomicAdd(&num_candidates, 1)
to collect candidates but then iterate up to num_candidates without clamping,
which can exceed params.max_candidates_per_pair; update each of the four
variants (8-bit EX/NoEX and 4-bit EX/NoEX) by introducing a local
stored_candidates = min(num_candidates, params.max_candidates_per_pair) at the
post-collection __syncthreads() point and replace every subsequent loop bound
and threshold check that currently uses num_candidates with stored_candidates;
ensure this change is applied where shared_candidate_ips and
shared_candidate_indices are read after collection (the block-sort kernel bodies
that perform candidate loops) so the loops never iterate past
params.max_candidates_per_pair.
- Around line 520-521: The queue.store(...) calls currently cast
params.d_topk_pids to (uint32_t*) which hides a type mismatch with the templated
index type; remove the explicit (uint32_t*) casts and pass params.d_topk_pids
(typed as PID/IdxT) directly to queue.store so the compiler enforces correct
template types (update all four occurrences of queue.store(...) in
searcher_gpu_quantize_query.cu that reference params.d_topk_pids and the two
analogous calls in the shared-mem-opt implementation); ensure the queue.store
signature and any template parameters align with IdxT/PID so no unsafe narrowing
casts remain.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu`:
- Around line 332-338: The atomic increment of num_candidates can exceed the
storage arrays and later loops/readers (which access
shared_candidate_ips/shared_candidate_indices) must not iterate past stored
entries: after the post-collection __syncthreads() point compute int
stored_candidates = min(num_candidates, params.max_candidates_per_pair) and
replace all subsequent uses of num_candidates in loops and in the
threshold-update path with stored_candidates (apply the same clamp in the NoEX
block-sort variant that mirrors the same collection shape) so readers only index
written slots.
- Around line 459-460: The cast (uint32_t*) on params.d_topk_pids in the
queue.store call must be removed to preserve type safety for templated index
types (PID/IdxT); update the two occurrences (queue.store(params.d_topk_dists +
output_offset, (uint32_t*)(params.d_topk_pids + output_offset))) to pass
params.d_topk_pids with its native PID* type (no cast) so mismatched PID/IdxT
instantiations fail at compile time, and apply the same removal at the other
occurrence around lines 787-788; ensure the surrounding code in
searcher_gpu_shared_mem_opt.cu (and analogous spots in
searcher_gpu_quantize_query.cu / searcher_gpu.cu) accepts PID* rather than
forcing uint32_t*.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh`:
- Around line 162-187: The helpers factor_ip, factor_sumxb, factor_err,
next_block and the duplicate block_factor overloads include unused size_t
parameters (FAST_SIZE, num_factors, D) that are confusing; remove these unused
parameters from the function signatures (factor_ip(uint32_t*),
factor_sumxb(uint32_t*), factor_err(uint32_t*), next_block(uint32_t*, size_t
code_len) and the redundant block_factor overload) and update all call-sites
accordingly, or if the layout actually depends on FAST_SIZE/num_factors, change
the implementations to compute offsets using that value (e.g., add offsets
multiplied by FAST_SIZE or num_factors) and replace the TODO with a clear
comment; pick one approach and make signatures and implementations consistent
(refer to functions factor_ip, factor_sumxb, factor_err, next_block and the two
block_factor overloads).

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu`:
- Around line 37-55: The load/save loops in RotatorGPU::load and
RotatorGPU::save perform per-element file I/O; replace them with a single bulk
read/write of the contiguous host buffer (host_buf) sized D*D floats to avoid
many syscalls: in RotatorGPU::load do one input.read into host_buf's contiguous
storage for sizeof(float) * D * D before the raft::copy and sync, and in
RotatorGPU::save do the raft::copy and sync then one output.write from
host_buf's contiguous storage for sizeof(float) * D * D; ensure you keep the
existing raft::copy(rotation_matrix_...) and
raft::resource::sync_stream(handle_) calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ebd77f7-f56e-4d04-a51d-1dfcc8809d5a

📥 Commits

Reviewing files that changed from the base of the PR and between e28491b and 4ddf3f9.

📒 Files selected for processing (15)
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu
  • cpp/src/neighbors/ivf_rabitq/utils/IO.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/memory.hpp
  • docs/source/cpp_api/neighbors.rst
  • docs/source/cpp_api/neighbors_ivf_rabitq.rst
✅ Files skipped from review due to trivial changes (4)
  • docs/source/cpp_api/neighbors.rst
  • docs/source/cpp_api/neighbors_ivf_rabitq.rst
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
🚧 Files skipped from review as they are similar to previous changes (4)
  • cpp/src/neighbors/ivf_rabitq/utils/memory.hpp
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/utils/IO.hpp
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu

Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh Outdated
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu (1)

23-26: ⚡ Quick win

const_cast here signals a const-correctness mismatch in the API.

GetCentroid(PID id) const returns a non-const float* to underlying storage by stripping const from centroids_.data_handle(). Either the method should not be const (it hands out mutable access to member state), or the return type should be const float*. The const_cast is a workaround that hides the contract from callers and makes it easy to mutate centroids through what looks like a read-only accessor.

Consider exposing a typed view to make layout explicit and avoid the cast:

Suggested refactor
-__host__ __device__ float* FlatInitializerGPU::GetCentroid(PID id) const
-{
-  return const_cast<float*>(centroids_.data_handle()) + id * D;
-}
+__host__ __device__ const float* FlatInitializerGPU::GetCentroid(PID id) const
+{
+  return centroids_.data_handle() + id * D;
+}

(and update the base-class declaration / call sites accordingly; if mutation is intended, drop const from the method instead.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 23 -
26, The const_cast in FlatInitializerGPU::GetCentroid(PID id) const hides a
const-correctness bug: either make the method non-const if callers are expected
to mutate the centroid data, or change its return type to const float* so it
does not strip const from centroids_.data_handle(); remove the const_cast
entirely. Update the base-class declaration and all call sites of
FlatInitializerGPU::GetCentroid (and any overrides) to match the chosen
signature, or alternatively return a typed read-only/view wrapper around
centroids_.data_handle() to make layout explicit instead of casting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu`:
- Around line 28-32: The AddVectors implementation calls raft::copy(…, stream_)
which is asynchronous, so the pointer parameter cent must remain valid until the
stream work completes; either synchronize the stream before returning (to match
LoadCentroids behavior) or update the initializer_gpu.cuh header docstring to
state the asynchronous/lifetime contract. Specifically, modify
FlatInitializerGPU::AddVectors to call cudaStreamSynchronize(stream_) (or the
RAFT/stream abstraction equivalent) after raft::copy if you want synchronous
semantics, or else update the comment in initializer_gpu.cuh (near the
FlatInitializerGPU/AddVectors documentation) to explicitly state that cent must
remain valid until stream_ is synchronized or until subsequent work on stream_
that consumes centroids_ completes; reference cent, centroids_, stream_,
raft::copy, AddVectors, and LoadCentroids when making the change.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu`:
- Around line 794-806: The kernel launch for exrabitq_quantize_query computes
shared_mem based on D and launches directly, which can exceed device
shared-memory limits; replace this direct launch with a guarded launch using
safely_launch_kernel_with_smem_size (or equivalent helper) that validates
shared_mem against deviceProp.sharedMemPerBlock and either reduces/adjusts
parameters or returns a clear error; ensure you reference the computed
shared_mem, block_size, grid_size, stream_, and the kernel name
exrabitq_quantize_query when making the check and use the helper to actually
perform the launch so runtime failures are avoided and a validation error is
surfaced if limits are exceeded.
- Around line 551-555: The computed delta (via cos_similarity = ip_resi_xucb /
(norm * norm_quan) and delta = norm / norm_quan * cos_similarity) can become
NaN/Inf when norm or norm_quan is zero; guard the division by computing a denom
= norm * norm_quan and compare against a small epsilon (e.g. 1e-8f) before
dividing, and if denom is <= epsilon set cos_similarity and delta to safe
defaults (0.0f) or clamp them to a bounded range; update the block that computes
norm_quan, cos_similarity and delta (referencing variables norm, norm_quan,
ip_resi_xucb, cos_similarity, delta) to use this epsilon-guarded logic so
query_width downstream cannot be poisoned by a degenerate query.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu`:
- Around line 560-575: Guard against empty work early: if num_queries == 0 ||
nprobe == 0 || topk == 0, return/skip GPU work before computing
lut_elements/lut_size, allocating d_lut_for_queries, or calling
launchPrecomputeLUTs_fp16; also avoid computing expressions like lut_size /
num_queries when num_queries can be zero by moving those computations after the
short-circuit. Additionally, before any kernel launches that use num_pairs or
derived grid/block sizes, validate that grid and block dimensions are non-zero
(e.g., derived from num_pairs, nprobe, or num_queries) and skip the launch if
invalid. Ensure these checks are applied in the block that contains
lut_elements/lut_size/d_lut_for_queries and the similar section around the later
code (the second region noted at ~665-690).

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu`:
- Around line 631-642: The LUT is allocated using cur_ivf.get_num_padded_dim()
but kernels use kernelParams.D (and launchPrecomputeLUTs is called with D),
causing tail dimensions to be dropped; fix by propagating the padded dimension
end-to-end: compute a single variable paddedD = cur_ivf.get_num_padded_dim(),
use paddedD for lut_size allocation, pass paddedD into launchPrecomputeLUTs and
into kernelParams.D (or rename to kernelParams.paddedD) so the kernel uses
params.D = paddedD when computing params.D / BITS_PER_CHUNK and params.D / 32;
additionally, add an input validation/assert that paddedD is non-negative, a
multiple of the packing granularity and that d_query/d_lut_for_queries are
non-null before launching GPU work.
- Around line 44-52: The constructor currently only logs an error when mode_ is
"quant4" or "quant8" but fast_quantize_factors is null, leaving
best_rescaling_factor unset and later causing undefined behavior in
exrabitq_quantize_query; change this to fail fast by validating the combination
up front: inside the same initialization block (referencing mode_,
fast_quantize_factors, and best_rescaling_factor) throw a descriptive exception
(e.g., std::invalid_argument) or call a fatal error routine when mode_ is
"quant4" or "quant8" and fast_quantize_factors is null so object construction
stops immediately and downstream calls like exrabitq_quantize_query cannot run
with an uninitialized rescaling factor.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu`:
- Around line 23-26: The const_cast in FlatInitializerGPU::GetCentroid(PID id)
const hides a const-correctness bug: either make the method non-const if callers
are expected to mutate the centroid data, or change its return type to const
float* so it does not strip const from centroids_.data_handle(); remove the
const_cast entirely. Update the base-class declaration and all call sites of
FlatInitializerGPU::GetCentroid (and any overrides) to match the chosen
signature, or alternatively return a typed read-only/view wrapper around
centroids_.data_handle() to make layout explicit instead of casting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 945d30a0-937b-4e6f-bee4-af62b408f8cc

📥 Commits

Reviewing files that changed from the base of the PR and between 4ddf3f9 and 3626313.

📒 Files selected for processing (6)
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu
  • cpp/src/neighbors/ivf_rabitq/utils/memory.hpp

Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu (2)

557-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return early for empty work before LUT sizing and kernel launch setup.

This path still falls through when num_queries == 0, nprobe == 0, or topk == 0. That leads to a zero-grid LUT launch and later lut_size / num_queries shared-memory sizing, so the failure mode is either an invalid launch or a division-by-zero. Please short-circuit those cases at function entry before any allocation or launch math.

Also applies to: 630-698

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 557 - 575, Short-circuit the function at entry when there's no work by
returning immediately if num_queries == 0 || nprobe == 0 || topk == 0 (do this
before computing lut_elements/lut_size, allocating d_lut_for_queries, or calling
launchPrecomputeLUTs_fp16) to avoid zero-grid kernel launches and
division-by-zero when computing shared-memory per-query (e.g., lut_size /
num_queries); ensure the check occurs before any use of num_queries,
lut_elements, lut_size, d_lut_for_queries, or launchPrecomputeLUTs_fp16 so all
subsequent allocation/launch math is skipped for empty work.

303-425: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Clamp num_candidates before consuming the shared candidate buffers.

num_candidates keeps incrementing even after candidate_slot >= params.max_candidates_per_pair, but the later IP2/queue/threshold loops still iterate up to num_candidates. Once a probed cluster produces more candidates than the shared arrays can hold, this reads unwritten shared memory and corrupts the block-sort results. Compute a bounded stored_candidates = min(num_candidates, (int)params.max_candidates_per_pair) after collection and use that for every subsequent loop and threshold check in both the WithEx and NoEx branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 303 - 425, After collecting candidates into shared_candidate_* but before
any reads/use of those buffers (immediately after the __syncthreads() following
the candidate collection), compute an int stored_candidates =
min(num_candidates, (int)params.max_candidates_per_pair) and use
stored_candidates everywhere you currently use num_candidates for subsequent
loops and checks (the warp loop that reads vec_long_code, the adds_per_thread
calculation and its loop, the candidates_per_thread path and its loop, the final
if (num_candidates >= params.topk) threshold calculation and any iterations that
scan shared_candidate_ips/indices); this ensures you never read unwritten shared
memory when num_candidates exceeded params.max_candidates_per_pair.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu (2)

797-811: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the quantization kernel’s shared-memory requirement before launch.

shared_mem scales directly with D, but this launch bypasses safely_launch_kernel_with_smem_size. Larger padded dimensions will fail at runtime once the shared-memory request crosses the device limit, instead of producing a clean validation error or fallback. Please launch this the same guarded way as the later search kernels.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 797 - 811, The kernel launch for exrabitq_quantize_query uses a computed
shared_mem (based on D) without checking device limits; replace the raw
triple-chevron launch and cudaPeekAtLastError call with the same guarded helper
used for later kernels (safely_launch_kernel_with_smem_size) so the requested
shared memory is validated/falls back; pass the same template/kernel symbol
exrabitq_quantize_query<block_size>, the grid_size, block_size, computed
shared_mem and stream_, and forward the arguments (d_query, num_queries, D,
num_bits, best_rescaling_factor, 1.9f, d_quantized_queries.data_handle(),
d_widths.data_handle()) to that launcher.

250-430: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Clamp the block-sort candidate count to the number actually stored.

num_candidates still counts overflowed candidates after the shared arrays hit params.max_candidates_per_pair, but the later exact-IP/IP2/queue/threshold logic indexes the shared buffers up to num_candidates. That turns dense clusters into unwritten shared-memory reads. Please derive a bounded stored_candidates after collection and use it consistently in the WithEx and NoEx branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 250 - 430, The kernel lets num_candidates exceed the actual number of
entries written into shared_candidate_* (when atomicAdd overflowed the shared
buffer), causing out-of-bounds reads; after the collection phase compute an int
stored_candidates = min(num_candidates, params.max_candidates_per_pair) and use
stored_candidates everywhere instead of num_candidates (including the
candidates_per_thread calculation, all loops that iterate candidates, the
warp-level loop that writes shared_ip2_results, the queue-add loops in both
WithEx and NoEx branches, and the threshold/update logic and early return check)
so all indexing into
shared_candidate_ips/shared_candidate_indices/shared_ip2_results is bounded by
the true stored count.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu`:
- Around line 478-543: The CUB calls and raw kernel launches currently have no
immediate error checking; wrap the direct CUB invocations
(cub::DeviceHistogram::HistogramEven and cub::DeviceScan::ExclusiveSum) with
RAFT_CUDA_TRY(...) and after each cuda kernel launch
(build_cluster_meta_kernel<<<...>>> and scatter_pids_kernel<<<...>>>) call
RAFT_CUDA_TRY(cudaPeekAtLastError()) to surface launch/CUB failures immediately;
apply the same pattern to the later occurrences around lines 628-702 so every
direct CUB call and <<<...>>> launch is followed by the RAFT_CUDA_TRY checks.
- Around line 92-117: The load_transposed implementation currently uses
assert(input.is_open()) and unchecked input.read calls which can leave members
like num_vectors, num_dimensions, num_centroids, ex_bits and cluster_sizes
partially populated; replace the assert with an explicit runtime check that the
file opened and add a small helper (e.g., read_exact or similar) used for every
metadata/header read in IVFGPU::load_transposed so each read validates that the
expected number of bytes were returned, and on failure throw or return a clear
runtime error instead of relying on asserts; apply the same checked-read pattern
when loading DataQuantizerGPU fast-quantize factors and the cluster_sizes vector
and replace the final assert(std::accumulate(...)) with an explicit condition
that emits a runtime error if the totals mismatch before proceeding to allocate
GPU resources or construct DataQuantizerGPU/RotatorGPU.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu`:
- Around line 551-615: num_candidates can exceed the number of entries actually
written into shared_candidate_ips/shared_candidate_pids (capped by
params.max_candidates_per_pair), causing out-of-bounds reads; fix by computing a
bounded stored_candidates = min(num_candidates, params.max_candidates_per_pair)
and use stored_candidates instead of num_candidates for the queue fill loop
(candidates_per_thread calculation and cand_idx check) and for the subsequent
"if (num_candidates >= params.topk)" check/logic (replace with stored_candidates
>= params.topk) so only the written shared slots are read when calling queue.add
and when computing max_topk_dist/threshold updates; keep all other logic
(queue.done, probe_slot, store) unchanged.

---

Duplicate comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu`:
- Around line 797-811: The kernel launch for exrabitq_quantize_query uses a
computed shared_mem (based on D) without checking device limits; replace the raw
triple-chevron launch and cudaPeekAtLastError call with the same guarded helper
used for later kernels (safely_launch_kernel_with_smem_size) so the requested
shared memory is validated/falls back; pass the same template/kernel symbol
exrabitq_quantize_query<block_size>, the grid_size, block_size, computed
shared_mem and stream_, and forward the arguments (d_query, num_queries, D,
num_bits, best_rescaling_factor, 1.9f, d_quantized_queries.data_handle(),
d_widths.data_handle()) to that launcher.
- Around line 250-430: The kernel lets num_candidates exceed the actual number
of entries written into shared_candidate_* (when atomicAdd overflowed the shared
buffer), causing out-of-bounds reads; after the collection phase compute an int
stored_candidates = min(num_candidates, params.max_candidates_per_pair) and use
stored_candidates everywhere instead of num_candidates (including the
candidates_per_thread calculation, all loops that iterate candidates, the
warp-level loop that writes shared_ip2_results, the queue-add loops in both
WithEx and NoEx branches, and the threshold/update logic and early return check)
so all indexing into
shared_candidate_ips/shared_candidate_indices/shared_ip2_results is bounded by
the true stored count.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu`:
- Around line 557-575: Short-circuit the function at entry when there's no work
by returning immediately if num_queries == 0 || nprobe == 0 || topk == 0 (do
this before computing lut_elements/lut_size, allocating d_lut_for_queries, or
calling launchPrecomputeLUTs_fp16) to avoid zero-grid kernel launches and
division-by-zero when computing shared-memory per-query (e.g., lut_size /
num_queries); ensure the check occurs before any use of num_queries,
lut_elements, lut_size, d_lut_for_queries, or launchPrecomputeLUTs_fp16 so all
subsequent allocation/launch math is skipped for empty work.
- Around line 303-425: After collecting candidates into shared_candidate_* but
before any reads/use of those buffers (immediately after the __syncthreads()
following the candidate collection), compute an int stored_candidates =
min(num_candidates, (int)params.max_candidates_per_pair) and use
stored_candidates everywhere you currently use num_candidates for subsequent
loops and checks (the warp loop that reads vec_long_code, the adds_per_thread
calculation and its loop, the candidates_per_thread path and its loop, the final
if (num_candidates >= params.topk) threshold calculation and any iterations that
scan shared_candidate_ips/indices); this ensures you never read unwritten shared
memory when num_candidates exceeded params.max_candidates_per_pair.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 56b3bc84-8727-424b-a567-838abefa4328

📥 Commits

Reviewing files that changed from the base of the PR and between 3626313 and b6d64f2.

📒 Files selected for processing (8)
  • cpp/src/neighbors/ivf_rabitq.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu

Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu (1)

1113-1138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

cub::DeviceRadixSort::SortPairs calls are not wrapped with RAFT_CUDA_TRY.

Both the sizing call (lines 1115–1124) and the actual sort (lines 1129–1138) return cudaError_t but the value is discarded, so CUB failures are silently dropped. All other CUB calls in this file (HistogramEven, ExclusiveSum) have already been fixed to wrap with RAFT_CUDA_TRY.

🛡️ Proposed fix
-  cub::DeviceRadixSort::SortPairs(nullptr,
-                                  temp_storage_bytes,
-                                  d_cluster_keys.data_handle(),
-                                  d_sorted_clusters.data_handle(),
-                                  d_query_values.data_handle(),
-                                  d_sorted_queries.data_handle(),
-                                  total_pairs,
-                                  0,
-                                  sizeof(int) * 8,
-                                  stream);
+  RAFT_CUDA_TRY(cub::DeviceRadixSort::SortPairs(nullptr,
+                                                 temp_storage_bytes,
+                                                 d_cluster_keys.data_handle(),
+                                                 d_sorted_clusters.data_handle(),
+                                                 d_query_values.data_handle(),
+                                                 d_sorted_queries.data_handle(),
+                                                 total_pairs,
+                                                 0,
+                                                 sizeof(int) * 8,
+                                                 stream));

   rmm::device_buffer d_temp_storage(temp_storage_bytes, stream);

-  cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
-                                  temp_storage_bytes,
-                                  d_cluster_keys.data_handle(),
-                                  d_sorted_clusters.data_handle(),
-                                  d_query_values.data_handle(),
-                                  d_sorted_queries.data_handle(),
-                                  total_pairs,
-                                  0,
-                                  sizeof(int) * 8,
-                                  stream);
+  RAFT_CUDA_TRY(cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
+                                                 temp_storage_bytes,
+                                                 d_cluster_keys.data_handle(),
+                                                 d_sorted_clusters.data_handle(),
+                                                 d_query_values.data_handle(),
+                                                 d_sorted_queries.data_handle(),
+                                                 total_pairs,
+                                                 0,
+                                                 sizeof(int) * 8,
+                                                 stream));

As per coding guidelines, "All CUDA calls (kernel launches, memory operations, synchronization) must have explicit error checking using RAFT macros (RAFT_CUDA_TRY, ...)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 1113 - 1138,
The two calls to cub::DeviceRadixSort::SortPairs (the sizing call and the actual
sort) currently discard their return values; wrap both invocations with
RAFT_CUDA_TRY so any cudaError_t is checked and propagated. Locate the two
SortPairs calls that pass nullptr for temp storage and the one that passes
d_temp_storage.data(), and wrap each entire call expression with
RAFT_CUDA_TRY(...), keeping the same arguments (including stream) and leaving
temp_storage_bytes, d_cluster_keys, d_sorted_clusters, d_query_values,
d_sorted_queries, and total_pairs unchanged.
🧹 Nitpick comments (1)
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu (1)

1221-1544: 🏗️ Heavy lift

~100 lines of identical code duplicated verbatim across all three BatchClusterSearch* methods.

BatchClusterSearch, BatchClusterSearchLUT16, and BatchClusterSearchQuantizeQuery each repeat identical logic for: (1) matmul, (2) row_norms_fused_kernel, (3) add_norms_kernel, (4) raft::matrix::select_k, (5) sort_cluster_query_pairs, and (6) computeQueryFactors. The only structural difference is the final SearchClusterQuery* dispatch call. Any bug fix or tweak to the shared logic must be applied three times.

Extract the shared preparation into a private helper:

+// Private helper — runs Steps 1-6 for all BatchClusterSearch variants
+void IVFGPU::PrepareClusterSearchInputs(
+    const float* d_query,
+    size_t batch_size,
+    size_t nprobe,
+    SearcherGPU* searcher_batch,
+    raft::device_matrix<ClusterQueryPair, int64_t>& d_sorted_pairs,  // out
+    raft::device_vector<float, int64_t>& d_G_k1xSumq,               // out
+    raft::device_vector<float, int64_t>& d_G_kbxSumq)               // out
+{
+    // Steps 1-6 moved here (matmul, row_norms, add_norms, select_k, sort, computeQueryFactors)
+}

 void IVFGPU::BatchClusterSearch(...) {
+    PrepareClusterSearchInputs(...);
     searcher_batch->SearchClusterQueryPairs(...);
 }
 void IVFGPU::BatchClusterSearchLUT16(...) {
+    PrepareClusterSearchInputs(...);
     searcher_batch->SearchClusterQueryPairsSharedMemOpt(...);
 }
 void IVFGPU::BatchClusterSearchQuantizeQuery(...) {
+    PrepareClusterSearchInputs(...);
     searcher_batch->SearchClusterQueryPairsQuantizeQuery(...);
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 1221 - 1544,
Multiple BatchClusterSearch* methods duplicate the same preparation steps
(matmul, row_norms_fused_kernel, add_norms_kernel, raft::matrix::select_k,
sort_cluster_query_pairs, computeQueryFactors); extract these into a private
helper (e.g., PrepareBatchClusterSearch) that accepts the common inputs (const
float* d_query, size_t batch_size, size_t nprobe, SearcherGPU* searcher_batch,
cudaStream_t stream_ (or use handle_/stream_ captured)) and returns or outputs
the computed artifacts needed by the caller: device handle(s) for sorted pairs
(d_sorted_pairs) and the query factor vectors (d_G_k1xSumq, d_G_kbxSumq) as raft
device_vectors or data handles; replace the duplicated blocks in
BatchClusterSearch, BatchClusterSearchLUT16, and BatchClusterSearchQuantizeQuery
with a call to PrepareBatchClusterSearch and then only perform the final
dispatch to SearchClusterQueryPairs / SearchClusterQueryPairsSharedMemOpt /
SearchClusterQueryPairsQuantizeQuery using the returned handles. Ensure the
helper uses the same symbols from the diff (raft::linalg::detail::matmul,
row_norms_fused_kernel, add_norms_kernel, raft::matrix::select_k,
sort_cluster_query_pairs, computeQueryFactors, and searcher_batch->get_*
accessors) and preserves stream_/handle_ and memory lifetimes so the subsequent
SearchClusterQueryPairs* calls can use the returned device data.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_rabitq.cu`:
- Around line 176-179: The call to detail::search must validate tensor shapes
and dimensions before dispatching GPU kernels to avoid out-of-bounds device
writes: add strict guards that verify queries.extent(0) == NQ > 0,
queries.extent(1) == dim (and not zero), neighbors.extent(0) == NQ and
neighbors.extent(1) == k, distances (if present) matches neighbors shape, and
that params.n_probes is within valid range (e.g., >0 and <= number of lists);
perform these checks in the function that calls BatchClusterSearch* (reference
detail::search, BatchClusterSearch*, queries, neighbors, distances,
params.n_probes) and early-return or throw on mismatch before any GPU work.
- Around line 104-106: The code uses uint32_t internals (labels and final_ids)
while the public API uses int64_t, which can silently truncate for n_rows >
UINT32_MAX; add an input validation at the start of the IVF/RabitQ index
creation/query path (before any GPU allocations) that checks params.n_rows (or
the public input n_rows) is within [0, UINT32_MAX] and return/throw a clear
error if it exceeds that limit, or alternatively refactor the internal storage
to use uint64_t for labels/final_ids; update checks for invalid/negative
dimensions and null pointers before any rmm::device_uvector allocations and
ensure all cast paths that convert final_ids to int64_t are safe (use bounds
check or 64-bit internals) so no silent wrapping occurs.
- Around line 295-303: The constructor index<IdxT>::index currently constructs
rabitq_index_ (detail::IVFGPU) before validating bits_per_dim; move the
RAFT_EXPECTS(bits_per_dim >= 1 && bits_per_dim <= 9, "Unsupported bits_per_dim")
check to the top of index<IdxT>::index so invalid bits_per_dim is rejected
before any GPU-side object (detail::IVFGPU) is created; ensure the validation
runs prior to the std::make_unique<detail::IVFGPU>(...) call to avoid allocating
GPU resources with invalid parameters.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu`:
- Around line 241-264: init_clusters enqueues an asynchronous raft::copy to
cluster_meta_ on stream_ but never synchronizes, so callers (via
load_transposed) may read device cluster_meta_ before the copy completes; fix by
synchronizing the stream after the copy (use
raft::resource::sync_stream(handle_) or equivalent) at the end of
IVFGPU::init_clusters so that cluster_meta_ is fully transferred before
returning (reference IVFGPU::init_clusters, raft::copy(..., stream_), and
raft::resource::sync_stream(handle_)).
- Around line 316-398: The IVFGPU::save method currently prints errors to stderr
and returns without signaling failure and never checks std::ofstream write/close
errors; change it to propagate failures: replace the early std::cerr+return
branches (the "IVF not constructed" and "Failed to open file for saving" paths)
with RAFT_EXPECTS or throw an exception so callers can detect failure, and after
each output.write (or after the block of writes) and after output.close() check
output.fail()/output.bad() (or use output.good() negative check) and
raise/RAFT_EXPECTS on error; ensure the checks reference the same std::ofstream
variable named output and the IVFGPU::save function so the behavior is
consistent with other failure paths in the file.
- Around line 873-879: Replace the abort() call used when validating cluster IDs
with RAFT_EXPECTS so the error becomes a recoverable exception: where the loop
iterates over host_cluster_ids and checks "if (cid >= num_centroids) {
std::cerr...; abort(); }" (variables PID, host_cluster_ids, num_centroids,
counts), change that to a RAFT_EXPECTS(cid < num_centroids, "Bad cluster id:
%zu", cid) (or equivalent RAFT_EXPECTS usage in this file) and remove the
explicit std::cerr/abort so the function throws consistently with other
validation sites.
- Around line 35-46: The constructor IVFGPU uses bits_per_dim - 1 to initialize
ex_bits and to construct DQ (DataQuantizerGPU) and currently underflows when
bits_per_dim == 0; fix by ensuring bits_per_dim is validated before using it —
either implement a static factory helper that checks RAFT_EXPECTS(bits_per_dim >
0) and then calls IVFGPU or convert to two-phase initialization (compute/assign
ex_bits and construct DQ/RotatorGPU in the constructor body after asserting
bits_per_dim > 0), so references to bits_per_dim - 1 are only evaluated after
the check and you avoid underflow when creating DataQuantizerGPU and
initializing ex_bits.

---

Duplicate comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu`:
- Around line 1113-1138: The two calls to cub::DeviceRadixSort::SortPairs (the
sizing call and the actual sort) currently discard their return values; wrap
both invocations with RAFT_CUDA_TRY so any cudaError_t is checked and
propagated. Locate the two SortPairs calls that pass nullptr for temp storage
and the one that passes d_temp_storage.data(), and wrap each entire call
expression with RAFT_CUDA_TRY(...), keeping the same arguments (including
stream) and leaving temp_storage_bytes, d_cluster_keys, d_sorted_clusters,
d_query_values, d_sorted_queries, and total_pairs unchanged.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu`:
- Around line 1221-1544: Multiple BatchClusterSearch* methods duplicate the same
preparation steps (matmul, row_norms_fused_kernel, add_norms_kernel,
raft::matrix::select_k, sort_cluster_query_pairs, computeQueryFactors); extract
these into a private helper (e.g., PrepareBatchClusterSearch) that accepts the
common inputs (const float* d_query, size_t batch_size, size_t nprobe,
SearcherGPU* searcher_batch, cudaStream_t stream_ (or use handle_/stream_
captured)) and returns or outputs the computed artifacts needed by the caller:
device handle(s) for sorted pairs (d_sorted_pairs) and the query factor vectors
(d_G_k1xSumq, d_G_kbxSumq) as raft device_vectors or data handles; replace the
duplicated blocks in BatchClusterSearch, BatchClusterSearchLUT16, and
BatchClusterSearchQuantizeQuery with a call to PrepareBatchClusterSearch and
then only perform the final dispatch to SearchClusterQueryPairs /
SearchClusterQueryPairsSharedMemOpt / SearchClusterQueryPairsQuantizeQuery using
the returned handles. Ensure the helper uses the same symbols from the diff
(raft::linalg::detail::matmul, row_norms_fused_kernel, add_norms_kernel,
raft::matrix::select_k, sort_cluster_query_pairs, computeQueryFactors, and
searcher_batch->get_* accessors) and preserves stream_/handle_ and memory
lifetimes so the subsequent SearchClusterQueryPairs* calls can use the returned
device data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d1763da7-48b9-4a55-b8f2-c8fffab71ef5

📥 Commits

Reviewing files that changed from the base of the PR and between b6d64f2 and 7ac97f6.

📒 Files selected for processing (2)
  • cpp/src/neighbors/ivf_rabitq.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu

Comment thread cpp/src/neighbors/ivf_rabitq.cu
Comment thread cpp/src/neighbors/ivf_rabitq.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
Comment thread cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented May 7, 2026

/ok to test b9e779d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

6 participants