Skip to content

Conversation

@derekhiggins
Copy link
Contributor

@derekhiggins derekhiggins commented Jul 26, 2024

Fixes #195

Needs the following CLI changes to enable it:


c1c70a5 Add data checkpointing capability

    Introduce a comprehensive data checkpointing mechanism that saves
    intermediate states during data generation. This allows for more
    granular progress tracking and recovery.
    
    Checkpoints are saved periodically based on the save_freq setting,
    preventing data loss and enabling resumption from the last saved state
    in case of interruptions.
    
    The system can resume from a saved checkpoint by comparing the
    generated data with the seed data to identify and process missing
    data.
    
    Each checkpoint is uniquely identified using UUIDs, ensuring distinct
    and traceable save points.
    
    Co-authored-by: shiv <[email protected]>
    Co-authored-by: Derek Higgins <[email protected]>
    Co-authored-by: Mark McLoughlin <[email protected]>

63b3599 checkpointing: fix "missing data" check with removed columns

    Fix the logic error that means if a pipeline removes a column
    that was present in the original dataset, then checkpointing
    causes the column not be present before the pipeline starts.
    
    Add a test case to the unit test to cover this, but note I
    had to add an additional column to ensure there is at least
    one column in common between the original dataset and the
    checkpoint dataset.

@mergify mergify bot added the ci-failure label Jul 26, 2024
@markmc markmc added this to the 0.2.2 milestone Jul 26, 2024
@mergify mergify bot added the ci-failure label Jul 26, 2024
@mergify mergify bot added ci-failure testing Relates to testing and removed ci-failure labels Jul 26, 2024
@derekhiggins
Copy link
Contributor Author

derekhiggins commented Jul 26, 2024

instructlab.sdg.pipeline.PipelineBlockError: PipelineBlockError(<class 'instructlab.sdg.llmblock.LLMBlock'>/gen_skill_freeform): Service Unavailable

this in CI is I think because we have concurrency turned on against llama_cpp
426bc37

@mergify mergify bot removed the ci-failure label Jul 26, 2024
@markmc markmc marked this pull request as ready for review July 26, 2024 21:48
@markmc markmc changed the title Checkpointing Add data checkpointing capability Jul 26, 2024
Introduce a comprehensive data checkpointing mechanism that saves
intermediate states during data generation. This allows for more
granular progress tracking and recovery.

Checkpoints are saved periodically based on the save_freq setting,
preventing data loss and enabling resumption from the last saved state
in case of interruptions.

The system can resume from a saved checkpoint by comparing the
generated data with the seed data to identify and process missing
data.

Each checkpoint is uniquely identified using UUIDs, ensuring distinct
and traceable save points.

Co-authored-by: shiv <[email protected]>
Co-authored-by: Derek Higgins <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>

Signed-off-by: Derek Higgins <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
Fix the logic error that means if a pipeline removes a column
that was present in the original dataset, then checkpointing
causes the column not be present before the pipeline starts.

Add a test case to the unit test to cover this, but note I
had to add an additional column to ensure there is at least
one column in common between the original dataset and the
checkpoint dataset.

Signed-off-by: Mark McLoughlin <[email protected]>
Copy link
Contributor Author

@derekhiggins derekhiggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Epic] Add data checkpointing and recovery

3 participants