-
Notifications
You must be signed in to change notification settings - Fork 240
feat(compass-collection): Avoid sending sample values for binary fields - CLOUDP-350484 #7439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes schema-to-field-info transformation to prevent LLM request payload size issues by excluding sample values for Binary fields (which often contain large embeddings) while preserving type information. It also implements several payload size optimizations including reducing sample value counts, truncating long strings, and rounding probabilities.
- Exclude sample values for Binary fields to avoid massive base64 strings from embeddings
- Reduce maximum sample values from 10 to 5 and implement string length limits
- Round probabilities to 2 decimal places for cleaner output
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| transform-schema-to-field-info.ts | Implements Binary field sample value exclusion, string truncation, and probability rounding |
| transform-schema-to-field-info.spec.ts | Updates tests to reflect new limits and adds comprehensive test coverage for new features |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if (value instanceof Binary) { | ||
| return value.toString('base64'); | ||
| // For Binary data, provide a descriptive placeholder instead of the actual data | ||
| // to avoid massive base64 strings that can break LLM requests | ||
| // (Defensive check: should never be called, since sample values for binary are skipped) | ||
| const sizeInBytes = value.buffer?.length || 0; | ||
| return `<Binary data: ${sizeInBytes} bytes>`; | ||
| } |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defensive code in convertBSONToPrimitive is unreachable since Binary sample values are now excluded at line 307. Consider removing this code path or converting it to a warning/error if it's truly defensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot is right about this one, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting it to a warning/error if it's truly defensive.
It does seem like an invariant to me where throwing an error would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Addressing in a fast-follow PR to unblock debugging server call failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #7456
|
Assigned |
| if (value instanceof ObjectId) { | ||
| return value.toString(); | ||
| } | ||
| if (value instanceof Binary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Avoid instanceof if you can and rely on _bsontype instead, instanceof is notoriously fragile in JS and will break if e.g. value and Binary come from different instances of the BSON library (like when somebody accidentally breaks package hoisting in the package lock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like when somebody accidentally breaks package hoisting in the package lock
I suspect that will cause all sorts of other issues? Could we check for that earlier as a precondition to allow ourselves to use idiomatic JS instead? Perhaps BSON could register something in the global to warn if two instances are loaded into a program ( - in any case, certainly not for this PR to fix 🙂)
If we really want to avoid instanceof on BSON types, perhaps we should have a lint rule checking that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just avoid instanceof altogether 🙂
I suspect that will cause all sorts of other issues?
It will work mostly fine, except for the part where the object identities mismatch
Could we check for that earlier as a precondition to allow ourselves to use idiomatic JS instead?
Well, I'd argue that avoiding instanceof is the idiomatic thing to do in JS – in any case, as a developer, you kind of need to be aware of the potential pitfalls when using it.
If we really want to avoid
instanceofon BSON types, perhaps we should have a lint rule checking that?
Yeah, I wouldn't mind that. There don't seem to be many uses of instanceof in the codebase that couldn't easily be replaced with a better check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #7456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just avoid instanceof altogether 🙂
Sent you a DM - I'd love to discuss that more 🙂
| if (value instanceof Binary) { | ||
| return value.toString('base64'); | ||
| // For Binary data, provide a descriptive placeholder instead of the actual data | ||
| // to avoid massive base64 strings that can break LLM requests | ||
| // (Defensive check: should never be called, since sample values for binary are skipped) | ||
| const sizeInBytes = value.buffer?.length || 0; | ||
| return `<Binary data: ${sizeInBytes} bytes>`; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot is right about this one, no?
| typeof value === 'string' && | ||
| value.length > MAX_STRING_SAMPLE_VALUE_LENGTH | ||
| ) { | ||
| return value.substring(0, MAX_STRING_SAMPLE_VALUE_LENGTH) + '...'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering - is it still worth it to use a truncated string?
| // to avoid massive base64 strings that can break LLM requests | ||
| // (Defensive check: should never be called, since sample values for binary are skipped) | ||
| const sizeInBytes = value.buffer?.length || 0; | ||
| return `<Binary data: ${sizeInBytes} bytes>`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is unlikely to happen, but how would this be handled down the line if it ever does return this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed fallback value in favor of throwing error if ever getting to this state: #7456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions - nothing blocking a merge IMO.
Description
Binary fields containing large embeddings are converted to massive base64 strings in sample values, causing LLM request payloads to exceed size limits.
Solution:
Exclude sample values for Binary fields while preserving type information. The LLM can still map Binary fields appropriately using just the MongoDB type.
Also, to limit request schema size:
Checklist
Types of changes