When training an Orion anomaly detection model, the program becomes stuck during the training phase. GPU memory usage is abnormally high while GPU utilization remains at 0%. Specific symptoms are as follows:
Environment Information
Hardware Configuration:
GPU: 2× NVIDIA A40 (42GB memory per card, 84GB total)
System RAM: 232GB
Software Environment:
TensorFlow version: 2.14.1
Orion version: 0.7.1
CUDA version: 11.8
