-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
Not sure
Is your feature request related to a problem? Please describe.
We have encountered situations where the performance of certain GPUs in the cluster degrades significantly, but these issues cannot be reproduced by simply restarting the tasks. This makes it difficult to accurately identify which GPUs are problematic. If there were a way to promptly query and list the GPU kernel tasks that have been submitted by the CPU but are still queued for execution, it would greatly reduce the time required for fault diagnosis.
If no such function is available, would you be happy to provide such, either by assisting us in a PR or doing it on your own?
Describe the solution you'd like
Similar to how nvidia-smi provides real-time GPU status, we are looking for a lightweight tool or API that can query and list the current execution queue of a GPU, including pending kernel tasks
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status