Add data checkpointing capability #222

derekhiggins · 2024-07-26T12:03:58Z

Fixes #195

Needs the following CLI changes to enable it:

--batch-size -- Add the checkpoint-dir arg instructlab#1889
--checkpoint-dir -- Enable SDG batching with vLLM instructlab#1797

c1c70a5 Add data checkpointing capability

    Introduce a comprehensive data checkpointing mechanism that saves
    intermediate states during data generation. This allows for more
    granular progress tracking and recovery.
    
    Checkpoints are saved periodically based on the save_freq setting,
    preventing data loss and enabling resumption from the last saved state
    in case of interruptions.
    
    The system can resume from a saved checkpoint by comparing the
    generated data with the seed data to identify and process missing
    data.
    
    Each checkpoint is uniquely identified using UUIDs, ensuring distinct
    and traceable save points.
    
    Co-authored-by: shiv <[email protected]>
    Co-authored-by: Derek Higgins <[email protected]>
    Co-authored-by: Mark McLoughlin <[email protected]>

63b3599 checkpointing: fix "missing data" check with removed columns

    Fix the logic error that means if a pipeline removes a column
    that was present in the original dataset, then checkpointing
    causes the column not be present before the pipeline starts.
    
    Add a test case to the unit test to cover this, but note I
    had to add an additional column to ensure there is at least
    one column in common between the original dataset and the
    checkpoint dataset.

src/instructlab/sdg/pipeline.py

src/instructlab/sdg/checkpointing.py

src/instructlab/sdg/generate_data.py

derekhiggins · 2024-07-26T21:10:17Z

instructlab.sdg.pipeline.PipelineBlockError: PipelineBlockError(<class 'instructlab.sdg.llmblock.LLMBlock'>/gen_skill_freeform): Service Unavailable

this in CI is I think because we have concurrency turned on against llama_cpp
426bc37

Introduce a comprehensive data checkpointing mechanism that saves intermediate states during data generation. This allows for more granular progress tracking and recovery. Checkpoints are saved periodically based on the save_freq setting, preventing data loss and enabling resumption from the last saved state in case of interruptions. The system can resume from a saved checkpoint by comparing the generated data with the seed data to identify and process missing data. Each checkpoint is uniquely identified using UUIDs, ensuring distinct and traceable save points. Co-authored-by: shiv <[email protected]> Co-authored-by: Derek Higgins <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Derek Higgins <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]>

Fix the logic error that means if a pipeline removes a column that was present in the original dataset, then checkpointing causes the column not be present before the pipeline starts. Add a test case to the unit test to cover this, but note I had to add an additional column to ensure there is at least one column in common between the original dataset and the checkpoint dataset. Signed-off-by: Mark McLoughlin <[email protected]>

derekhiggins

lgtm

mergify bot added the ci-failure label Jul 26, 2024

markmc added this to the 0.2.2 milestone Jul 26, 2024

derekhiggins commented Jul 26, 2024

View reviewed changes

src/instructlab/sdg/pipeline.py Outdated Show resolved Hide resolved

derekhiggins force-pushed the checkpointing branch from 7b233b8 to 33e9a60 Compare July 26, 2024 12:58

mergify bot added ci-failure and removed ci-failure labels Jul 26, 2024