Skip to content

Model Training Gets Stuck, GPU Memory Usage High but Utilization at 0% #658

@wjhme

Description

@wjhme

When training an Orion anomaly detection model, the program becomes stuck during the training phase. GPU memory usage is abnormally high while GPU utilization remains at 0%. Specific symptoms are as follows:

Environment Information
Hardware Configuration:

GPU: 2× NVIDIA A40 (42GB memory per card, 84GB total)

System RAM: 232GB

Software Environment:

TensorFlow version: 2.14.1

Orion version: 0.7.1

CUDA version: 11.8

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions