Skip to content

Conversation

@jcobis
Copy link
Collaborator

@jcobis jcobis commented Oct 9, 2025

Description

Binary fields containing large embeddings are converted to massive base64 strings in sample values, causing LLM request payloads to exceed size limits.

Solution:
Exclude sample values for Binary fields while preserving type information. The LLM can still map Binary fields appropriately using just the MongoDB type.

Also, to limit request schema size:

  • Changed number of sample values sent from 10 to 5
  • Limited long strings
  • Rounded probabilities

Checklist

  • New tests and/or benchmarks are included
  • Documentation is changed or added
  • If this change updates the UI, screenshots/videos are added and a design review is requested
  • I have signed the MongoDB Contributor License Agreement (https://www.mongodb.com/legal/contributor-agreement)

Types of changes

  • Backport Needed
  • Patch (non-breaking change which fixes an issue)
  • Minor (non-breaking change which adds functionality)
  • Major (fix or feature that would cause existing functionality to change)

@github-actions github-actions bot added the feat label Oct 9, 2025
@jcobis jcobis added the no release notes Fix or feature not for release notes label Oct 9, 2025
@jcobis jcobis changed the title feat(compass-collection): Skip binary for sample values - CLOUDP-350484 feat(compass-collection): Avoid sending sample values for binary fields - CLOUDP-350484 Oct 9, 2025
@jcobis jcobis marked this pull request as ready for review October 9, 2025 20:31
@jcobis jcobis requested a review from a team as a code owner October 9, 2025 20:31
@jcobis jcobis requested review from Copilot and nbbeeken October 9, 2025 20:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes schema-to-field-info transformation to prevent LLM request payload size issues by excluding sample values for Binary fields (which often contain large embeddings) while preserving type information. It also implements several payload size optimizations including reducing sample value counts, truncating long strings, and rounding probabilities.

  • Exclude sample values for Binary fields to avoid massive base64 strings from embeddings
  • Reduce maximum sample values from 10 to 5 and implement string length limits
  • Round probabilities to 2 decimal places for cleaner output

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
transform-schema-to-field-info.ts Implements Binary field sample value exclusion, string truncation, and probability rounding
transform-schema-to-field-info.spec.ts Updates tests to reflect new limits and adds comprehensive test coverage for new features

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 122 to 128
if (value instanceof Binary) {
return value.toString('base64');
// For Binary data, provide a descriptive placeholder instead of the actual data
// to avoid massive base64 strings that can break LLM requests
// (Defensive check: should never be called, since sample values for binary are skipped)
const sizeInBytes = value.buffer?.length || 0;
return `<Binary data: ${sizeInBytes} bytes>`;
}
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defensive code in convertBSONToPrimitive is unreachable since Binary sample values are now excluded at line 307. Consider removing this code path or converting it to a warning/error if it's truly defensive.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is right about this one, no?

Copy link
Contributor

@kraenhansen kraenhansen Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting it to a warning/error if it's truly defensive.

It does seem like an invariant to me where throwing an error would be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Addressing in a fast-follow PR to unblock debugging server call failures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #7456

@codeowners-service-app
Copy link

Assigned kraenhansen for team compass-developers because nbbeeken is out of office.

if (value instanceof ObjectId) {
return value.toString();
}
if (value instanceof Binary) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Avoid instanceof if you can and rely on _bsontype instead, instanceof is notoriously fragile in JS and will break if e.g. value and Binary come from different instances of the BSON library (like when somebody accidentally breaks package hoisting in the package lock)

Copy link
Contributor

@kraenhansen kraenhansen Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like when somebody accidentally breaks package hoisting in the package lock

I suspect that will cause all sorts of other issues? Could we check for that earlier as a precondition to allow ourselves to use idiomatic JS instead? Perhaps BSON could register something in the global to warn if two instances are loaded into a program ( - in any case, certainly not for this PR to fix 🙂)

If we really want to avoid instanceof on BSON types, perhaps we should have a lint rule checking that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just avoid instanceof altogether 🙂

I suspect that will cause all sorts of other issues?

It will work mostly fine, except for the part where the object identities mismatch

Could we check for that earlier as a precondition to allow ourselves to use idiomatic JS instead?

Well, I'd argue that avoiding instanceof is the idiomatic thing to do in JS – in any case, as a developer, you kind of need to be aware of the potential pitfalls when using it.

If we really want to avoid instanceof on BSON types, perhaps we should have a lint rule checking that?

Yeah, I wouldn't mind that. There don't seem to be many uses of instanceof in the codebase that couldn't easily be replaced with a better check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #7456

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just avoid instanceof altogether 🙂

Sent you a DM - I'd love to discuss that more 🙂

Comment on lines 122 to 128
if (value instanceof Binary) {
return value.toString('base64');
// For Binary data, provide a descriptive placeholder instead of the actual data
// to avoid massive base64 strings that can break LLM requests
// (Defensive check: should never be called, since sample values for binary are skipped)
const sizeInBytes = value.buffer?.length || 0;
return `<Binary data: ${sizeInBytes} bytes>`;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is right about this one, no?

typeof value === 'string' &&
value.length > MAX_STRING_SAMPLE_VALUE_LENGTH
) {
return value.substring(0, MAX_STRING_SAMPLE_VALUE_LENGTH) + '...';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering - is it still worth it to use a truncated string?

// to avoid massive base64 strings that can break LLM requests
// (Defensive check: should never be called, since sample values for binary are skipped)
const sizeInBytes = value.buffer?.length || 0;
return `<Binary data: ${sizeInBytes} bytes>`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is unlikely to happen, but how would this be handled down the line if it ever does return this?

Copy link
Collaborator Author

@jcobis jcobis Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed fallback value in favor of throwing error if ever getting to this state: #7456

Copy link
Contributor

@kraenhansen kraenhansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions - nothing blocking a merge IMO.

@jcobis jcobis merged commit 54d7c77 into main Oct 14, 2025
61 checks passed
@jcobis jcobis deleted the CLOUDP-350484 branch October 14, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat no release notes Fix or feature not for release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants