Skip to content

[FEA]: Ability to list the kernel execution queue #696

@Ind1x1

Description

@Ind1x1

Is this a duplicate?

Area

Not sure

Is your feature request related to a problem? Please describe.

We have encountered situations where the performance of certain GPUs in the cluster degrades significantly, but these issues cannot be reproduced by simply restarting the tasks. This makes it difficult to accurately identify which GPUs are problematic. If there were a way to promptly query and list the GPU kernel tasks that have been submitted by the CPU but are still queued for execution, it would greatly reduce the time required for fault diagnosis.
If no such function is available, would you be happy to provide such, either by assisting us in a PR or doing it on your own?

Describe the solution you'd like

Similar to how nvidia-smi provides real-time GPU status, we are looking for a lightweight tool or API that can query and list the current execution queue of a GPU, including pending kernel tasks

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds the team's attention

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions