Switch to public RAFT numpy_serialize APIs#2038
Switch to public RAFT numpy_serialize APIs#2038julianmi wants to merge 7 commits intorapidsai:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughSummary by CodeRabbit
WalkthroughMultiple source and header files switch from internal ChangesNumPy serializer API migration
RAFT third‑party pin and fork
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/src/neighbors/brute_force.cpp`:
- Around line 242-244: The code reads 4 bytes into dtype_string using is.read
and immediately calls raft::numpy_serializer::parse_descr on those bytes;
however it doesn't check whether the read succeeded. Update the block around
dtype_string/is.read to verify the stream read (e.g., check is.gcount() == 4 or
is.fail()/is.good()) before calling raft::numpy_serializer::parse_descr, and on
failure handle the truncated header by returning an error/throwing an exception
or logging and aborting the parse so parse_descr never receives uninitialized
data.
In `@c/src/neighbors/cagra.cpp`:
- Around line 877-880: The code reads 4 bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without validating the read; first check the
read succeeded (e.g., verify is.read(...) and that is.gcount() == 4 or that the
stream is in a good state) before constructing std::string(dtype_string, 4) and
calling raft::numpy_serializer::parse_descr; if the read fails/returns fewer
than 4 bytes, handle the error (throw, return an error code, or log and abort)
instead of passing potentially truncated data to parse_descr.
In `@c/src/neighbors/ivf_flat.cpp`:
- Around line 303-306: The code reads four bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without verifying the read succeeded;
validate the read length immediately after is.read(dtype_string, 4) (e.g., check
is.gcount() == 4 or test stream state like if (!is || is.gcount() != 4)) and on
short reads/failure throw or return a clear error (or set an error status)
before calling raft::numpy_serializer::parse_descr, so malformed or truncated
files are handled safely.
In `@c/src/neighbors/mg_cagra.cpp`:
- Around line 403-406: The 4-byte dtype header read into dtype_string before
calling raft::numpy_serializer::parse_descr is not validated; check the istream
read result (e.g., is.read(...) and then is.gcount() == 4 or !is.fail()) and
handle short/truncated reads by logging/throwing/returning an error instead of
calling parse_descr with partial data; apply the same validation to the other
occurrence that reads 4 bytes so neither is.read -> parse_descr path can receive
garbage.
In `@c/src/neighbors/mg_ivf_flat.cpp`:
- Around line 400-403: The code reads 4 bytes into dtype_string and calls
raft::numpy_serializer::parse_descr without validating the read; update both the
deserialize and distribute code paths to check is.read(...) and the stream state
(e.g., verify gcount() == 4 or is.good()/is) before constructing
std::string(dtype_string,4) and calling parse_descr, and handle truncated reads
by returning/throwing an error or reporting via existing error path; reference
the dtype_string buffer, the is.read call, and the parse_descr invocation so you
change both occurrences (around the calls near deserialize and distribute).
In `@c/src/neighbors/mg_ivf_pq.cpp`:
- Around line 392-395: The code calls is.read into dtype_string and immediately
hands those bytes to raft::numpy_serializer::parse_descr without checking the
read result; update the logic around dtype_string/is.read to validate that 4
bytes were actually read (e.g., check is.gcount() == 4 and/or
is.good()/is.fail()) before calling parse_descr, and handle the short-read error
path (return an error, throw, or log and exit) so parse_descr is never called
with incomplete data; reference the dtype_string buffer, the is.read call, and
raft::numpy_serializer::parse_descr when making the change and ensure is.close()
still runs in all code paths.
In `@cpp/cmake/thirdparty/get_raft.cmake`:
- Around line 9-10: The CMake variables RAFT_FORK and RAFT_PINNED_TAG are
hard-pinned to a personal fork/branch; change the defaults to use the official
upstream (e.g., "rapidsai/raft") and a safe default tag (empty or a release
tag), and expose RAFT_FORK and RAFT_PINNED_TAG as CMake CACHE variables so they
can only be overridden explicitly via -D on the command line; additionally gate
any non-upstream usage behind an opt-in flag (e.g., USE_CUSTOM_RAFT or
RAFT_USE_FORK) so CI/releases use the official repo by default and personal-fork
overrides are explicitly documented and temporary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 393379db-9f11-4936-953c-40a2df4cec4c
📒 Files selected for processing (15)
c/src/neighbors/brute_force.cppc/src/neighbors/cagra.cppc/src/neighbors/ivf_flat.cppc/src/neighbors/mg_cagra.cppc/src/neighbors/mg_ivf_flat.cppc/src/neighbors/mg_ivf_pq.cppcpp/cmake/thirdparty/get_raft.cmakecpp/include/cuvs/neighbors/cagra.hppcpp/include/cuvs/util/file_io.hppcpp/src/neighbors/brute_force_serialize.cucpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/cagra/cagra_serialize.cuhcpp/src/neighbors/detail/hnsw.hppcpp/src/neighbors/ivf_flat/ivf_flat_serialize.cuhcpp/src/neighbors/mg/snmg.cuh
cuVS uses
parse_descr,read_header,get_numpy_dtype,write_headerfromraft::detail::numpy_serializeras discussed here. This PR proposes to expose them publicly in RAFT instead and use them throughout cuVS.There are similar issues with the
raft/core/detail/macros.hppto be address in a separate PR.Depends on rapidsai/raft#3003