-
Notifications
You must be signed in to change notification settings - Fork 56
Add data checkpointing capability #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
derekhiggins
commented
Jul 26, 2024
7b233b8 to
33e9a60
Compare
markmc
reviewed
Jul 26, 2024
33e9a60 to
c1c1618
Compare
c1c1618 to
e8d90e2
Compare
markmc
reviewed
Jul 26, 2024
markmc
reviewed
Jul 26, 2024
markmc
reviewed
Jul 26, 2024
markmc
reviewed
Jul 26, 2024
e8d90e2 to
2514000
Compare
Contributor
Author
this in CI is I think because we have concurrency turned on against llama_cpp |
4 tasks
Introduce a comprehensive data checkpointing mechanism that saves intermediate states during data generation. This allows for more granular progress tracking and recovery. Checkpoints are saved periodically based on the save_freq setting, preventing data loss and enabling resumption from the last saved state in case of interruptions. The system can resume from a saved checkpoint by comparing the generated data with the seed data to identify and process missing data. Each checkpoint is uniquely identified using UUIDs, ensuring distinct and traceable save points. Co-authored-by: shiv <[email protected]> Co-authored-by: Derek Higgins <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Derek Higgins <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>
Fix the logic error that means if a pipeline removes a column that was present in the original dataset, then checkpointing causes the column not be present before the pipeline starts. Add a test case to the unit test to cover this, but note I had to add an additional column to ensure there is at least one column in common between the original dataset and the checkpoint dataset. Signed-off-by: Mark McLoughlin <[email protected]>
derekhiggins
commented
Jul 26, 2024
Contributor
Author
derekhiggins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
markmc
approved these changes
Jul 26, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #195
Needs the following CLI changes to enable it:
--batch-size-- Add the checkpoint-dir arg instructlab#1889--checkpoint-dir-- Enable SDG batching with vLLM instructlab#1797c1c70a5 Add data checkpointing capability
63b3599 checkpointing: fix "missing data" check with removed columns