Skip to content

Conversation

@mresvanis
Copy link

This PR adds Fabric Manager (FM) Shared NVSwitch virtualization model support when NVSwitch devices are detected and the newly introduced FABRIC_MANAGER_FABRIC_MODE env var is set to 1 (shared-nvswitch).

No changes introduced when FABRIC_MANAGER_FABRIC_MODE=0 (default FM mode - full-passthrough).

Changes

  • add env var FABRIC_MANAGER_FABRIC_MODE to control fabric manager FABRIC_MODE (defaults to 0 for full-passthrough, 1 for shared-nvswitch).
  • blacklist GPU devices from the NVIDIA driver when FABRIC_MODE is set to Shared NVSwitch.
    • instead bind them to vfio-pci.
  • Create GPU physical module ID to PCIe address mapping JSON file.
  • Do not run nvidia-persistenced since GPU devices are blacklisted from the NVIDIA driver.

Flow when FABRIC_MANAGER_FABRIC_MODE=1 (shared-nvswitch)

  1. Updates fabric manager config to use shared-nvswitch mode.
  2. Configures UNIX socket communication instead of TCP.
  3. Captures and persists GPU physical modeul ID to PCIe address mapping via nvidia-smi.
  4. Blacklists GPU devices from the NVIDIA driver using driver_override
  5. Binds GPU devices to vfio-pci for passthrough scenarios.
  6. Skips nvidia-persistenced startup since GPU devices are no longer managed by the NVIDIA driver.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mresvanis mresvanis force-pushed the fabric-manager-configuration branch 2 times, most recently from 7182624 to d59cb29 Compare January 15, 2026 18:58
The changes include:

- add the `FABRIC_MANAGER_FABRIC_MODE` env var that configures FM with
  either full-passthrough (0) or shared-nvswitch (1) mode. It defaults
  to 0.
- when fabric manager mode is set to 0 no changes to the flow, i.e.
  execute the fabric manager daemon with its default configuration.
- when fabric manager mode is set to 1:
  - edit the fabric manager configuration file and set `FABRIC_MODE=1`.
  - persist mapping of physical GPU module IDs to their PCIe address by
    creating a JSON file on disk (the physical GPU module IDs are
    available through nvidia-smi).
  - blacklist GPU devices from the NVIDIA driver.
  - disable `nvidia-persistenced`, as the GPU devices are now
    blacklisted from the NVIDIA driver and bound to VFIO.

Signed-off-by: Michail Resvanis <[email protected]>
@mresvanis mresvanis force-pushed the fabric-manager-configuration branch from d59cb29 to 01dca1b Compare January 15, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant