Skip to content

CPU OOM issue #263

@sfc-gh-jrasley

Description

@sfc-gh-jrasley

Environment setup:

NGC torch 25.03 base image
latest arctic training (default deps)
llama 70b
zero 3 offload
single 8xH200 node

Observation, during the zero 3 offload init the cpu memory explodes up to 2TB and runs out of memory. Not happening on AT 591ddf5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions