Skip to content

Add multi MIG GPU support in release-v1.3 as fix for #1586#1589

Draft
Tishwings wants to merge 1 commit into
nanoporetech:release-v1.3from
Tishwings:issue-1586-mig-fix-v1.3
Draft

Add multi MIG GPU support in release-v1.3 as fix for #1586#1589
Tishwings wants to merge 1 commit into
nanoporetech:release-v1.3from
Tishwings:issue-1586-mig-fix-v1.3

Conversation

@Tishwings
Copy link
Copy Markdown

fix for #1586 in release-v1.3

Summary of changes

This PR improves support for NVIDIA MIG GPU devices by determining the maximum number of devices from both NVML and the CUDA runtime (torch). This ensures MIG instances are recognized correctly, even when CUDA exposes more devices than NVML.

Key points:

Device count now uses the maximum from NVML and torch (torch::cuda::device_count()), allowing proper enumeration of MIG instances.
Device mapping and validation now reference this unified count.
Instead of errors when too many CUDA_VISIBLE_DEVICES are specified, a warning is logged for more robust operation.
Deprecated NVML API warnings (CUDA 13+) are suppressed to reduce CI/CD build noise.

Why error suppression is needed

Suppressing NVML deprecation warnings is necessary because newer CUDA versions (13+) produce many build warnings due to outdated, but still required, APIs. This keeps our build logs clean and avoids unnecessary alarm for unavoidable warnings.

Next steps (out of scope for this PR)

Proper validation should include tests on hardware other than ours, including different MIG and non-MIG GPUs. Migrating to future NVIDIA APIs for device management would also be recommended once available. For now, these improvements safely extend current support for us without risking regressions elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant