[swe] Add truncation fix for miles agentic tool call#785
[swe] Add truncation fix for miles agentic tool call#785maocheng23 wants to merge 2 commits intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the agentic tool call system by introducing a configurable mechanism to limit the total token length of agent responses. It integrates this limit across the server and client components, ensuring that responses exceeding the specified token count are either truncated or result in an 'OutputLimitExceeded' status, thereby preventing excessively long outputs and improving resource management. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a mechanism to limit the total number of tokens in an agentic tool call by truncating the response. The changes span across the server, the agent function, and the rollout generation logic, introducing a new max_total_response_tokens parameter. The implementation correctly propagates this parameter and applies truncation when the limit is exceeded. I've found one potential issue regarding robustness in the new truncation logic that could lead to a TypeError if certain sample properties are None.
| sample.rollout_log_probs = sample.rollout_log_probs[:keep_tokens] | ||
| sample.loss_mask = sample.loss_mask[:keep_tokens] |
There was a problem hiding this comment.
The rollout_log_probs and loss_mask attributes of a Sample can be None according to their type hints in miles/utils/types.py. Slicing a None value will raise a TypeError. To make this utility function more robust, you should add checks to ensure these attributes are not None before attempting to slice them.
| sample.rollout_log_probs = sample.rollout_log_probs[:keep_tokens] | |
| sample.loss_mask = sample.loss_mask[:keep_tokens] | |
| if sample.rollout_log_probs is not None: | |
| sample.rollout_log_probs = sample.rollout_log_probs[:keep_tokens] | |
| if sample.loss_mask is not None: | |
| sample.loss_mask = sample.loss_mask[:keep_tokens] |
| if exc_type in _TIMEOUT_EXCEPTIONS: | ||
| return "LimitsExceeded" | ||
| if exc_type in _OUTPUT_LIMIT_EXCEPTIONS: | ||
| return "OutputLimitExceeded" |
There was a problem hiding this comment.
It's a bit weird to have both LimitsExceeded and OutputLimitExceeded. I'd propose changing LimitsExceeded to TimeLimitExceeded, and OutputLimitExceeded to SequenceLengthLimitExceeded.
More generally I think passing literal strings around isn't really the best practice? I'm OK with it as-is, though, since everything in under experimental.
|
|
||
| metadata = input.sample.metadata | ||
| if max_total_response_tokens is not None: | ||
| metadata = {**metadata, "max_total_response_tokens": max_total_response_tokens} |
There was a problem hiding this comment.
Super nit: I think Python style guide would recommend metadata |= {"max_total_response_tokens": max_total_response_tokens} since this modifies the dictionary metadata in-place, instead of creating a new dictionary instance and assigning it to metadata.
There was a problem hiding this comment.
Here I would prefer to do copy and paste as it's safer.
| allowed_output = sample.response_length - overshoot | ||
| if allowed_output <= 0: | ||
| break |
There was a problem hiding this comment.
Here if I understand correctly, if the prompt part of a sample already exceeds the length limit, then the entire sample is dropped. I think this is quite dangerous since if max_total_response_tokens is so small such that the prompt of the first sample crosses the limit, then the returned list will be empty, which crashes merge_samples in sample_utils.py.
I'm not sure whether we can do the following: if the prompt in a sample has 500 tokens, and the response in the same sample has 200 tokens, with token limit 400, then we clip the sample, keeping just the first 400 tokens in the prompt with no response.
Also I'd recommend changing the name max_total_response_tokens into max_total_sample_tokens or something similar, since we are counting both prompt and response tokens.
There was a problem hiding this comment.
There are two parts:
- Here we actually collect all samples from a single trajectory, and then merge later, so we will still have some data (unless the initial prompt is super long)
- update the name to
max_seq_len
As title, add length limit for agentic tool call