diff --git a/Guide/src/SUMMARY.md b/Guide/src/SUMMARY.md index 17ffdb70c5..2b9b51a978 100644 --- a/Guide/src/SUMMARY.md +++ b/Guide/src/SUMMARY.md @@ -93,7 +93,7 @@ - [virtio-net]() - [virtio-pmem]() - [VMBus]() - - [storvsp]() + - [storvsp](./reference/devices/vmbus/storvsp.md) - [netvsp]() - [vpci]() - [serial]() @@ -107,15 +107,15 @@ - [Serial]() - [Legacy x86]() - [i440BX + PIIX4 chipset]() - - [IDE HDD/Optical]() - - [Floppy]() + - [IDE HDD/Optical](./reference/emulated/legacy_x86/ide.md) + - [Floppy](./reference/emulated/legacy_x86/floppy.md) - [PCI]() - [VGA]() - [Direct Assigned]() - [Device Backends]() - [Serial]() - [Graphics and Input]() - - [Storage]() + - [Storage](./reference/backends/storage.md) - [Networking]() - [Architecture](./reference/architecture.md) - [OpenVMM Architecture](./reference/architecture/openvmm.md) @@ -129,6 +129,8 @@ - [Boot Flow](./reference/architecture/openhcl/boot.md) - [Sidecar](./reference/architecture/openhcl/sidecar.md) - [IGVM](./reference/architecture/openhcl/igvm.md) + - [Device Architecture](./reference/architecture/devices.md) + - [Storage Pipeline](./reference/architecture/devices/storage.md) --- diff --git a/Guide/src/dev_guide/contrib/style_guide.md b/Guide/src/dev_guide/contrib/style_guide.md index 1870b42626..f58a0157bc 100644 --- a/Guide/src/dev_guide/contrib/style_guide.md +++ b/Guide/src/dev_guide/contrib/style_guide.md @@ -120,7 +120,15 @@ cargo run -- --disk memdiff:file:C:\vhds\disk.vhdx ### Length Keep code blocks under 30 lines. If longer, split with explanatory text -between blocks. Comments inside code blocks should explain *why*, not *what*. +between blocks. Diagrams (ASCII art in `text` fences) are exempt from +this limit — keep them as a single block so the visual structure isn't +broken. Comments inside code blocks should explain *why*, not *what*. + +## Line wrapping + +Wrap prose lines at approximately 80 characters. This keeps diffs +readable and makes review easier. Lines inside tables and code blocks +are exempt. ## Links diff --git a/Guide/src/reference/architecture/devices.md b/Guide/src/reference/architecture/devices.md new file mode 100644 index 0000000000..efcea36c70 --- /dev/null +++ b/Guide/src/reference/architecture/devices.md @@ -0,0 +1,13 @@ +# Device architecture + +This section covers the internal architecture of device emulators and +their backends — the shared machinery that both OpenVMM and OpenHCL +use to connect guest-visible storage, networking, and other devices to +their backing implementations. + +## Pages + +- [Storage pipeline](./devices/storage.md) — how guest I/O flows from + a storage frontend (NVMe, SCSI, IDE) through the + [`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) + abstraction to a concrete backing store. diff --git a/Guide/src/reference/architecture/devices/storage.md b/Guide/src/reference/architecture/devices/storage.md new file mode 100644 index 0000000000..a29e93cd75 --- /dev/null +++ b/Guide/src/reference/architecture/devices/storage.md @@ -0,0 +1,255 @@ +# Storage pipeline + +The storage stack carries guest I/O requests from a guest-visible +controller to a backing store and back. It's shared between OpenVMM +and OpenHCL. Every disk backend implements the +[`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) +trait, and frontends hold a +[`Disk`](https://openvmm.dev/rustdoc/linux/disk_backend/struct.Disk.html) +wrapper — a cheap, cloneable handle to any backend. For the `DiskIo` +trait surface, method contracts, and error model, see the +[`disk_backend` rustdoc](https://openvmm.dev/rustdoc/linux/disk_backend/index.html). + +## The pipeline + +Every storage I/O flows through the same layered pipeline: + +```text + ┌──────────────────────────────────────────────────────────┐ + │ Guest I/O │ + └────────────────────────┬─────────────────────────────────┘ + │ + ┌────────────────────────┼─────────────────────────────────┐ + │ Frontend │ │ + │ (NVMe · StorVSP · IDE)│ │ + └────┬───────────────────┼────────────────────┬────────────┘ + │ │ │ + │ NVMe: direct │ SCSI / IDE │ + │ ▼ │ + │ ┌────────────────────────┐ │ + │ │ SCSI adapter │ │ + │ │ (SimpleScsiDisk / │ │ + │ │ SimpleScsiDvd) │ │ + │ └───────────┬────────────┘ │ + │ │ │ + ▼ ▼ ▼ + ┌──────────────────────────────────────────────────────────┐ + │ Disk (DiskIo trait boundary) │ + └────────────────────────┬─────────────────────────────────┘ + │ + ┌────────────────────────┼─────────────────────────────────┐ + │ Decorator wrappers │ (optional: crypt · delay · PR) │ + └────────────────────────┼─────────────────────────────────┘ + │ + ┌──────────────┴──────────────┐ + ▼ ▼ + ┌──────────────────┐ ┌──────────────────────────────┐ + │ Backend │ │ Layered disk │ + │ (file · block │ │ (optional: RAM + backing) │ + │ device · blob │ │ ├── Layer 0 (RAM/sqlite) │ + │ · VHD · ...) │ │ └── Layer 1 (backend) │ + └──────────────────┘ └──────────────────────────────┘ +``` + +Key vocabulary: + +- **Frontend.** Speaks a guest-visible storage protocol and translates + requests into `DiskIo` calls. +- **SCSI adapter.** For the SCSI and IDE paths, an intermediate layer + ([`SimpleScsiDisk`](https://openvmm.dev/rustdoc/linux/scsidisk/struct.SimpleScsiDisk.html) + or + [`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html)) + that parses SCSI CDB opcodes before calling `DiskIo`. +- **Backend.** A `DiskIo` implementation that reads and writes to a + specific backing store. +- **Decorator.** A `DiskIo` implementation that wraps another `Disk` + and transforms I/O in transit (encryption, delay, persistent + reservations). +- **Layered disk.** A `DiskIo` implementation composed of ordered + layers with per-sector presence tracking. + +## Frontends + +Three frontends exist. Each speaks a different guest-visible protocol +but they all produce `DiskIo` calls on the backend side. + +| Frontend | Protocol | Transport | Crate | +|----------|----------|-----------|-------| +| [NVMe](../../emulated/NVMe/overview.md) | NVMe 2.0 | PCI MMIO + MSI-X | [`nvme`](https://openvmm.dev/rustdoc/linux/nvme/index.html) | +| [StorVSP](../../devices/vmbus/storvsp.md) | SCSI CDB over VMBus | VMBus ring buffers | [`storvsp`](https://openvmm.dev/rustdoc/linux/storvsp/index.html) | +| [IDE](../../emulated/legacy_x86/ide.md) | ATA / ATAPI | PCI/ISA I/O ports + DMA | [`ide`](https://openvmm.dev/rustdoc/linux/ide/index.html) | + +**NVMe** is the simplest path. The NVMe controller's namespace directly holds a `Disk`. NVM opcodes (READ, WRITE, FLUSH, DSM) map nearly 1:1 to `DiskIo` methods. The FUA bit from the NVMe write command is forwarded directly. + +**StorVSP / SCSI** has a two-layer design. StorVSP handles the VMBus transport — negotiation, ring buffer management, sub-channel allocation. It dispatches each SCSI request to an [`AsyncScsiDisk`](https://openvmm.dev/rustdoc/linux/scsi_core/trait.AsyncScsiDisk.html) implementation. For hard drives, that's [`SimpleScsiDisk`](https://openvmm.dev/rustdoc/linux/scsidisk/struct.SimpleScsiDisk.html), which parses the SCSI CDB and translates it to `DiskIo` calls. For optical drives, it's [`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html). + +**IDE** is the legacy path. ATA commands for hard drives call `DiskIo` directly. ATAPI commands for optical drives delegate to `SimpleScsiDvd` through an ATAPI-to-SCSI translation layer — the same DVD implementation that StorVSP uses. IDE also supports [enlightened INT13 commands](../../emulated/legacy_x86/ide.md#enlightened-io), a Microsoft-specific optimization that collapses the multi-exit register-programming sequence into a single VM exit. + +## Backends + +A backend is a `DiskIo` implementation that reads and writes to a specific backing store. Backends are interchangeable — swap one for another without changing the frontend. The frontend holds a `Disk` and doesn't know what's behind it. See the [storage backends](../../backends/storage.md) page for the full catalog and platform details. + +## Decorators + +A decorator is a `DiskIo` implementation that wraps another `Disk` and transforms I/O in transit. Features compose by stacking decorators without modifying backends: + +```text + CryptDisk + └── BlockDeviceDisk +``` + +Three decorators exist: [`CryptDisk`](https://openvmm.dev/rustdoc/linux/disk_crypt/struct.CryptDisk.html) (XTS-AES-256 encryption), [`DelayDisk`](https://openvmm.dev/rustdoc/linux/disk_delay/struct.DelayDisk.html) (injected latency), and [`DiskWithReservations`](https://openvmm.dev/rustdoc/linux/disk_prwrap/struct.DiskWithReservations.html) (in-memory persistent reservation emulation). All three forward metadata (sector count, sector size, disk ID, `wait_resize`) to the inner disk unchanged. See the [storage backends](../../backends/storage.md) page for the decorator catalog. + +## The layered disk model + +A [`LayeredDisk`](https://openvmm.dev/rustdoc/linux/disk_layered/struct.LayeredDisk.html) is a `DiskIo` implementation composed of multiple layers, ordered from top to bottom. Each layer is a block device with per-sector *presence* tracking. This model powers diff disks, RAM overlays, and caching. + +### Reads fall through + +When a read arrives, the layered disk checks layers top-to-bottom. The first layer that has the requested sectors provides the data. Sectors not present in any layer are zeroed. + +### Writes go to the top + +Writes always go to the topmost layer. If that layer is configured with *write-through*, the write also propagates to the next layer. + +### Read caching + +A layer can be configured to cache read misses: when sectors are fetched from a lower layer, they're written back to the cache layer. This uses a `write_no_overwrite` operation to avoid overwriting sectors that were written between the read and the cache population. + +### Layer implementations + +Two concrete layers exist today: + +- **RamDiskLayer** ([`disklayer_ram`](https://openvmm.dev/rustdoc/linux/disklayer_ram/index.html)) — ephemeral, in-memory. Data is stored in a `BTreeMap` keyed by sector number. Fast, but lost when the VM stops. +- **SqliteDiskLayer** ([`disklayer_sqlite`](https://openvmm.dev/rustdoc/linux/disklayer_sqlite/index.html)) — persistent, backed by a SQLite database (`.dbhd` file). Designed for dev/test scenarios — no stability guarantees on the on-disk format. + +A full `Disk` can appear at the bottom of the stack as a fully-present layer (`DiskAsLayer`). This is the typical case: a RAM or sqlite layer on top of a file or block device. + +### Worked example: `memdiff:file:disk.vhdx` + +```text + Layer 0: RamDiskLayer (empty, writable) + Layer 1: DiskAsLayer wrapping FileDisk (fully present, read-only + from the layered disk's perspective) +``` + +- Guest write → sector goes to the RAM layer. +- Guest read → check RAM; if the sector is present, return it. If absent, fall through to the file. +- Sectors absent from both layers → zero-filled. + +Changes are ephemeral — they live in the RAM layer and are lost when the VM stops. The [Running OpenVMM](../../../user_guide/openvmm/run.md) page shows concrete `memdiff:` examples. + +## How configuration becomes a concrete stack + +The resource resolver connects configuration (CLI flags, VTL2 settings) to concrete backends. A resource *handle* describes what backend to use; a *resolver* creates it. + +The storage resolver chain is recursive. An NVMe controller resolves each namespace's disk, which may be a layered disk, which resolves each layer in parallel, which may itself be a disk that needs resolving. + +**Example:** `--disk memdiff:file:path/to/disk.vhdx` + +1. CLI parses this into a `LayeredDiskHandle` with two layers: + - Layer 0: `RamDiskLayerHandle { len: None }` (RAM diff, inherits size from backing disk) + - Layer 1: `DiskLayerHandle(FileDiskHandle(...))` (the file) +2. The layered disk resolver resolves both layers in parallel. +3. The RAM layer attaches on top of the file layer, inheriting its sector size and capacity. +4. The resulting `LayeredDisk` is wrapped in a `Disk` and handed to the NVMe namespace or SCSI controller. + +For the OpenHCL settings model (`StorageController`, `Lun`, `PhysicalDevice`), see [Storage Translation](../openhcl/storage_translation.md) and [Storage Configuration Model](../openhcl/storage_configuration.md). + +## Backend catalog + +| Backend | Crate | Wraps | Platform | Note | +|---------|-------|-------|----------|------| +| FileDisk | [`disk_file`](https://openvmm.dev/rustdoc/linux/disk_file/index.html) | Host file | Cross-platform | Simplest backend | +| Vhd1Disk | [`disk_vhd1`](https://openvmm.dev/rustdoc/linux/disk_vhd1/index.html) | VHD1 fixed file | Cross-platform | Parses VHD footer | +| VhdmpDisk | `disk_vhdmp` | Windows vhdmp driver | Windows | Dynamic/differencing VHD/VHDX | +| BlobDisk | [`disk_blob`](https://openvmm.dev/rustdoc/linux/disk_blob/index.html) | HTTP / Azure Blob | Cross-platform | Read-only, HTTP range requests | +| BlockDeviceDisk | [`disk_blockdevice`](https://openvmm.dev/rustdoc/linux/disk_blockdevice/index.html) | Linux block device | Linux | io_uring, resize via uevent, PR passthrough | +| NvmeDisk | [`disk_nvme`](https://openvmm.dev/rustdoc/linux/disk_nvme/index.html) | Physical NVMe (VFIO) | Linux/Windows | User-mode NVMe driver, resize via AEN | +| StripedDisk | [`disk_striped`](https://openvmm.dev/rustdoc/linux/disk_striped/index.html) | Multiple Disks | Cross-platform | Data striping | + +## Online disk resize + +Disk resize is a cross-cutting concern that spans backends and frontends. + +### Backend detection + +Only two backends detect capacity changes at runtime: + +- **BlockDeviceDisk** — listens for Linux uevent notifications on the block device. When the host resizes the device, a uevent fires, the backend re-queries the size via ioctl, and `wait_resize` completes. +- **NvmeDisk** — the user-mode NVMe driver monitors Async Event Notifications (AEN) from the physical controller and rescans namespace capacity. + +All other backends default to never signaling (`wait_resize` returns `pending()`). Decorators and layered disks delegate `wait_resize` to the inner backend. + +```admonish warning +`FileDisk` never signals resize. If you attach a file backend and resize the file at runtime, nothing happens — the guest won't be notified. Use `BlockDeviceDisk` or `NvmeDisk` for runtime resize. +``` + +### Frontend notification + +Once a backend detects a resize, the frontend notifies the guest: + +| Frontend | Mechanism | How it works | +|----------|-----------|-------------| +| NVMe | Async Event Notification | Background task per namespace calls `wait_resize`. On change, completes a queued AER command with a changed-namespace-list log page. Guest re-identifies the namespace. | +| StorVSP / SCSI | UNIT_ATTENTION | On the next SCSI command after a resize, `SimpleScsiDisk` detects the capacity change and returns CHECK_CONDITION with UNIT_ATTENTION / CAPACITY_DATA_CHANGED. Guest retries and re-reads capacity. | +| IDE | Not supported | IDE has no capacity-change notification mechanism. | + +The resize path is the same in OpenHCL and standalone — `BlockDeviceDisk` detects the uevent from the host, `wait_resize` completes, and the frontend notifies the guest through the standard mechanism. No special paravisor-level interception. + +## Virtual optical / DVD + +DVD and CD-ROM drives use a different model from disk devices. + +[`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html) implements `AsyncScsiDisk` and manages media state: a disk can be `Loaded` or `Unloaded`. Optical media always uses a 2048-byte sector size. The implementation handles optical-specific SCSI commands: `GET_EVENT_STATUS_NOTIFICATION`, `GET_CONFIGURATION`, `START_STOP_UNIT` (eject), and media change events. + +### Eject + +Two eject paths exist: + +- **Guest-initiated** (SCSI `START_STOP_UNIT` with the load/eject flag): the DVD handler checks the prevent flag, replaces media with `Unloaded`, and calls `disk.eject()`. Once ejected via SCSI, the media is **permanently removed** for the VM lifetime. +- **Host-initiated** (`change_media` via the resolver's background task): can insert new media or remove existing media dynamically. + +### Frontend support + +| Frontend | DVD support | How | +|----------|-------------|-----| +| StorVSP / SCSI | Yes | `SimpleScsiDvd` implements `AsyncScsiDisk` directly. | +| IDE | Yes | ATAPI wraps `SimpleScsiDvd` through the ATAPI-to-SCSI layer. | +| NVMe | No | NVMe has no removable media concept. Explicitly rejected. | + +### CLI + +- `--disk file:my.iso,dvd` → SCSI optical drive. +- `--ide file:my.iso,dvd` → IDE optical drive (ATAPI). + +The `dvd` flag implicitly sets `read_only = true`. + +## `mem:` and `memdiff:` CLI mapping + +Both CLI options map to the layered disk model: + +- **`mem:1G`** creates a single-layer `LayeredDisk` with a `RamDiskLayer` sized to 1 GB. No backing disk — the RAM layer is the entire disk. +- **`memdiff:file:disk.vhdx`** creates a two-layer `LayeredDisk`: a `RamDiskLayer` (inheriting size from the backing disk) on top of the file. Writes go to the RAM layer; reads fall through to the file for sectors not yet written. + +Both use `RamDiskLayerHandle` under the hood. The difference is `len: Some(size)` for `mem:` (standalone RAM disk with explicit size) vs. `len: None` for `memdiff:` (inherits from backing disk). The [Running OpenVMM](../../../user_guide/openvmm/run.md) page shows concrete examples. + +## Controller identity and Azure disk classification + +In Azure, which controller a disk sits on is a de facto compatibility boundary. Azure VMs present four SCSI controllers (this may change), each with a distinct instance ID. One controller carries the OS disk, resource (temporary) disk, and related infrastructure disks; a separate controller carries remote data disks. For Gen1 VMs, the IDE controllers logically replace that first SCSI controller, while data disks remain on SCSI. + +Guest agents use controller identity to classify disks. The [azure-vm-utils udev rules](https://github.com/Azure/azure-vm-utils/blob/main/udev/80-azure-disk.rules) match on SCSI controller instance IDs to create stable symlinks under `/dev/disk/azure/`. Moving a disk from one StorVSP controller instance to another changes its classification and can break guest-side automation. For SCSI disk mapping details, see the [Azure disk mapping docs](https://learn.microsoft.com/en-us/azure/virtual-machines/windows/azure-to-guest-disk-mapping). + +For NVMe, the mapping uses namespace IDs: NSID 1 is the OS disk, NSID 2+ are data disks (portal LUN = NSID − 2). On newer VM sizes (v7+), disks are split across multiple NVMe controllers by caching policy. NVMe is Gen2-only. See the [NVMe overview](https://learn.microsoft.com/en-us/azure/virtual-machines/nvme-overview) and [NVMe disk identification FAQ](https://learn.microsoft.com/en-us/azure/virtual-machines/enable-nvme-remote-faqs) for the full Azure perspective. + +## Implementation map + +| Component | Why read it | Source | Rustdoc | +|-----------|-------------|--------|---------| +| `disk_backend` | `DiskIo` trait, `Disk` wrapper, error model | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/disk_backend/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/disk_backend/index.html) | +| `disk_layered` | Layered disk, `LayerIo` trait, bitmap tracking | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/disk_layered/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/disk_layered/index.html) | +| `nvme` | NVMe controller emulator | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/nvme/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/nvme/index.html) | +| `storvsp` | VMBus SCSI controller | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/storvsp/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/storvsp/index.html) | +| `scsidisk` | SCSI CDB parser (`SimpleScsiDisk`, `SimpleScsiDvd`) | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/scsidisk/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/scsidisk/index.html) | +| `ide` | IDE controller emulator | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/ide/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/ide/index.html) | +| `scsi_core` | `AsyncScsiDisk` trait, `Request`, `ScsiResult` | [source](https://github.com/microsoft/openvmm/blob/main/vm/devices/storage/scsi_core/src/lib.rs) | [rustdoc](https://openvmm.dev/rustdoc/linux/scsi_core/index.html) | diff --git a/Guide/src/reference/architecture/openhcl/storage_configuration.md b/Guide/src/reference/architecture/openhcl/storage_configuration.md index 7164ee4dec..1be82bf7a3 100644 --- a/Guide/src/reference/architecture/openhcl/storage_configuration.md +++ b/Guide/src/reference/architecture/openhcl/storage_configuration.md @@ -1,6 +1,11 @@ -# OpenHCL Storage Configuration Model - -The VTL2 settings model describes guest-visible storage controllers, child devices, and their backing devices. +# OpenHCL storage configuration model + +The VTL2 settings model describes guest-visible storage controllers, +child devices, and their backing devices. For the internal +architecture of the disk backend abstraction, the layered disk model, +and how frontends translate guest I/O into +[`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) +calls, see the [storage pipeline](../devices/storage.md) page. ## Overview diff --git a/Guide/src/reference/architecture/openhcl/storage_translation.md b/Guide/src/reference/architecture/openhcl/storage_translation.md index b9c8a14c34..b7a300d96c 100644 --- a/Guide/src/reference/architecture/openhcl/storage_translation.md +++ b/Guide/src/reference/architecture/openhcl/storage_translation.md @@ -1,6 +1,11 @@ -# OpenHCL Storage Translation - -OpenHCL maps storage offered into VTL2 onto the controller and disk model that VTL0 sees. +# OpenHCL storage translation + +OpenHCL maps storage offered into VTL2 onto the controller and disk +model that VTL0 sees. This page covers that mapping — the *outside* +of the shell. For the *inside* (how guest I/O flows from a storage +frontend through the SCSI adapter and disk backend abstraction to a +concrete backing store), see the +[storage pipeline](../devices/storage.md) page. ## Overview diff --git a/Guide/src/reference/backends/storage.md b/Guide/src/reference/backends/storage.md new file mode 100644 index 0000000000..6df9cecb99 --- /dev/null +++ b/Guide/src/reference/backends/storage.md @@ -0,0 +1,53 @@ +# Storage backends + +Storage backends implement the +[`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) +trait, the shared abstraction that all storage frontends use to read +and write data. A frontend holds a +[`Disk`](https://openvmm.dev/rustdoc/linux/disk_backend/struct.Disk.html) +handle and doesn't know what kind of backend is behind it — the same +frontend code works with a local file, a Linux block device, a remote +blob, or a layered composition of multiple backends. + +## Backend catalog + +| Backend | Crate | Wraps | Platform | Key characteristic | +|---------|-------|-------|----------|--------------------| +| FileDisk | [`disk_file`](https://openvmm.dev/rustdoc/linux/disk_file/index.html) | Host file | Cross-platform | Simplest backend. Blocking I/O via `unblock()`. | +| Vhd1Disk | [`disk_vhd1`](https://openvmm.dev/rustdoc/linux/disk_vhd1/index.html) | VHD1 fixed file | Cross-platform | Parses VHD footer for geometry. | +| VhdmpDisk | `disk_vhdmp` | Windows vhdmp driver | Windows | Dynamic and differencing VHD/VHDX. | +| BlobDisk | [`disk_blob`](https://openvmm.dev/rustdoc/linux/disk_blob/index.html) | HTTP / Azure Blob | Cross-platform | Read-only. HTTP range requests. | +| BlockDeviceDisk | [`disk_blockdevice`](https://openvmm.dev/rustdoc/linux/disk_blockdevice/index.html) | Linux block device | Linux | io_uring, resize via uevent, PR passthrough. | +| NvmeDisk | [`disk_nvme`](https://openvmm.dev/rustdoc/linux/disk_nvme/index.html) | Physical NVMe (VFIO) | Linux/Windows | User-mode NVMe driver. Resize via AEN. | +| StripedDisk | [`disk_striped`](https://openvmm.dev/rustdoc/linux/disk_striped/index.html) | Multiple Disks | Cross-platform | Stripes data across underlying disks. | + +## Decorators + +Decorators wrap another +[`Disk`](https://openvmm.dev/rustdoc/linux/disk_backend/struct.Disk.html) +and transform I/O in transit. Features compose by stacking decorators +without modifying the backends underneath. + +| Decorator | Crate | Transform | +|-----------|-------|-----------| +| CryptDisk | [`disk_crypt`](https://openvmm.dev/rustdoc/linux/disk_crypt/index.html) | XTS-AES-256 encryption. Encrypts on write, decrypts on read. | +| DelayDisk | [`disk_delay`](https://openvmm.dev/rustdoc/linux/disk_delay/index.html) | Adds configurable latency to each I/O operation. | +| DiskWithReservations | [`disk_prwrap`](https://openvmm.dev/rustdoc/linux/disk_prwrap/index.html) | In-memory SCSI persistent reservation emulation. | + +## Layered disks + +A [`LayeredDisk`](https://openvmm.dev/rustdoc/linux/disk_layered/index.html) +composes multiple layers into a single `DiskIo` implementation. Each +layer tracks which sectors it has; reads fall through from top to +bottom until a layer has the requested data. This powers the +`memdiff:` and `mem:` CLI options. + +Two layer implementations exist today: + +- **RamDiskLayer** ([`disklayer_ram`](https://openvmm.dev/rustdoc/linux/disklayer_ram/index.html)) — ephemeral, in-memory. +- **SqliteDiskLayer** ([`disklayer_sqlite`](https://openvmm.dev/rustdoc/linux/disklayer_sqlite/index.html)) — persistent, file-backed (dev/test only). + +The [storage pipeline](../architecture/devices/storage.md) page covers +the full architecture: how frontends, backends, decorators, and the +layered disk model connect, plus cross-cutting concerns like online +disk resize and virtual optical media. diff --git a/Guide/src/reference/devices/vmbus/storvsp.md b/Guide/src/reference/devices/vmbus/storvsp.md new file mode 100644 index 0000000000..2aec7c250c --- /dev/null +++ b/Guide/src/reference/devices/vmbus/storvsp.md @@ -0,0 +1,46 @@ +# StorVSP + +StorVSP is the VMBus SCSI controller emulator. It presents a virtual +SCSI adapter to the guest over a VMBus channel and translates SCSI +requests into calls against the shared disk backend abstraction. + +## Overview + +StorVSP implements the Hyper-V synthetic SCSI protocol — a +VMBus-based transport that carries SCSI CDBs (Command Descriptor +Blocks) between the guest's `storvsc` driver and the host. This +isn't a standard SCSI transport (like iSCSI or SAS); it's a +Hyper-V-specific wire format defined in +[`storvsp_protocol`](https://openvmm.dev/rustdoc/linux/storvsp_protocol/index.html). +The guest side (`storvsc`) is in the Linux kernel and Windows inbox +drivers. + +Each SCSI path (channel / target / LUN) maps to an +[`AsyncScsiDisk`](https://openvmm.dev/rustdoc/linux/scsi_core/trait.AsyncScsiDisk.html) +implementation — typically +[`SimpleScsiDisk`](https://openvmm.dev/rustdoc/linux/scsidisk/struct.SimpleScsiDisk.html) +for hard drives or +[`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html) +for optical media. Those implementations parse the SCSI CDB and +translate it into +[`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) +calls (read, write, flush, unmap). + +## Key characteristics + +- **Transport.** VMBus ring buffers with GPADL-backed memory. +- **Protocol.** Hyper-V SCSI (SRB-based), with version negotiation + (Win6 through Blue). +- **Sub-channels.** StorVSP supports multiple VMBus sub-channels + for parallel I/O, one worker per channel. +- **Hot-add / hot-remove.** SCSI devices can be attached and + detached at runtime via `ScsiControllerRequest`. +- **Performance.** Poll-mode optimization — when pending I/O count + exceeds `poll_mode_queue_depth`, switches from interrupt-driven + to busy-poll for new requests, reducing guest exit frequency. +- **Crate.** [`storvsp`](https://openvmm.dev/rustdoc/linux/storvsp/index.html) + +The [storage pipeline](../../architecture/devices/storage.md) page +covers the full frontend-to-backend architecture, including the SCSI +adapter layer and how `SimpleScsiDisk` translates CDB opcodes to +`DiskIo` calls. diff --git a/Guide/src/reference/emulated/NVMe/doorbells.md b/Guide/src/reference/emulated/NVMe/doorbells.md index 5fbf80bc34..903dff5891 100644 --- a/Guide/src/reference/emulated/NVMe/doorbells.md +++ b/Guide/src/reference/emulated/NVMe/doorbells.md @@ -1,5 +1,12 @@ # Doorbells -The doorbell notification system in the NVMe emulator is built around two core structures: `DoorbellMemory` and `DoorbellState`. These components work together to coordinate doorbell updates between the guest and the device, following a server-client like model. + +The doorbell notification system in the NVMe emulator is built around +two core structures: `DoorbellMemory` and `DoorbellState`. +These components work together to coordinate doorbell updates between +the guest and the device, following a server-client like model. For +how the NVMe emulator fits into the broader storage pipeline, see the +[storage pipeline](../../architecture/devices/storage.md) page and the +[`nvme` rustdoc](https://openvmm.dev/rustdoc/linux/nvme/index.html). ![Figure that shows the basic layout of the doorbell memory and doorbell state. There is 1 doorbell memory struct containing a vector of registered wakers and a pointer in to guest memory at "offset". There are 3 doorbell state structs that each track a different doorbell but all have pointers to the doorbell memory struct](images/Doorbell%20Setup.png "Doorbell Setup") Fig: Basic layout of DoorbellMemory and DoorbellStates. diff --git a/Guide/src/reference/emulated/NVMe/overview.md b/Guide/src/reference/emulated/NVMe/overview.md index 8393c56d73..0d8e36fa59 100644 --- a/Guide/src/reference/emulated/NVMe/overview.md +++ b/Guide/src/reference/emulated/NVMe/overview.md @@ -1,8 +1,8 @@ -# NVMe Emulator +# NVMe emulator Among the devices that OpenVMM emulates, an NVMe controller is one. The OpenVMM NVMe emulator comes in two flavors: - An NVMe emulator that can be used to serve IO workloads (but pragmatically is only used by OpenVMM for test scenarios today) -- An NVMe emulator used to test OpenHCL (`nvme_test`), which allows test authors to inject faults and inspect the state of NVMe devices used by the guest, and +- An NVMe emulator used to test OpenHCL ([`nvme_test`](https://openvmm.dev/rustdoc/linux/nvme_test/index.html)), which allows test authors to inject faults and inspect the state of NVMe devices used by the guest. -This guide provides a brief overview of the architecture shared by the NVMe emulators. +This guide provides a brief overview of the architecture shared by the NVMe emulators. For how NVMe fits into the broader storage pipeline — including how namespaces map to [`DiskIo`](https://openvmm.dev/rustdoc/linux/disk_backend/trait.DiskIo.html) backends, online disk resize via AEN, and the layered disk model — see the [storage pipeline](../../architecture/devices/storage.md) page. diff --git a/Guide/src/reference/emulated/legacy_x86/floppy.md b/Guide/src/reference/emulated/legacy_x86/floppy.md new file mode 100644 index 0000000000..e752e4fbe3 --- /dev/null +++ b/Guide/src/reference/emulated/legacy_x86/floppy.md @@ -0,0 +1,81 @@ +# Floppy + +The floppy controller emulates an +[Intel 82077AA](https://en.wikipedia.org/wiki/Intel_82077AA) CHMOS +single-chip floppy disk controller. It connects to the storage stack +through +[`Disk`](https://openvmm.dev/rustdoc/linux/disk_backend/struct.Disk.html) +— the same backend abstraction used by NVMe and SCSI. Data transfers +use ISA DMA channel 2; interrupts use IRQ 6. + +Two variants exist: + +- [`FloppyDiskController`](https://openvmm.dev/rustdoc/linux/floppy/struct.FloppyDiskController.html) + — full emulator with disk I/O. +- [`StubFloppyDiskController`](https://openvmm.dev/rustdoc/linux/floppy_pcat_stub/struct.StubFloppyDiskController.html) + — reports "no drives" for PCAT BIOS compatibility when no floppy is + configured. + +## Supported media + +The controller auto-detects the floppy format from the disk image byte +size. See +[Wikipedia's list of floppy disk formats](https://en.wikipedia.org/wiki/List_of_floppy_disk_formats) +for background on these formats. + +| Format | Capacity | Sectors/track | Notes | +|--------|----------|---------------|-------| +| Low density (SS) | 360 KB | 9 | Single-sided (one head) | +| Low density | 720 KB | 9 | | +| Medium density | 1.2 MB | 15 | | +| High density | 1.44 MB | 18 | Most common format | +| [DMF](https://en.wikipedia.org/wiki/Distribution_Media_Format) | 1.68 MB | 21 | Microsoft Distribution Media Format | +| XDF | 1.72 MB | 23 | Extended density (fixed 23 SPT variant) | + +All formats use 512-byte sectors, 80 cylinders, CHS addressing. The +controller rejects images that don't match a known format size. + +## I/O port layout + +Register offsets from base (typically 0x3F0): + +| Offset | Read | Write | Purpose | +|--------|------|-------|---------| +| +0 | STATUS_A | — | Fixed 0xFF (not emulated) | +| +1 | STATUS_B | — | Fixed 0xFC (no tape drives) | +| +2 | DOR | DOR | Motor control, drive select, DMA gate, reset | +| +4 | MSR | DSR | Main status (busy, direction, RQM) / data rate select | +| +5 | DATA | DATA | Command/parameter/result FIFO (16-byte) | +| +7 | DIR | CCR | Disk change signal / config control | + +The controller claims port 0x3F7 for DIR/CCR separately from the +6-byte base region, because 0x3F6 is shared with the IDE controller's +alternate status register. + +## Limitations and deviations + +The real 82077AA supports four drives; OpenVMM supports one. The +emulator implements a pragmatic subset of the command set — enough for +MS-DOS, Windows, and Linux floppy drivers to detect the controller, +identify media, and perform read/write/format operations. Commands that +interact with physical media timing (perpendicular recording mode, +power management) are accepted but largely no-op'd. + +Key differences from real hardware: + +- No multi-drive support (real hardware supports drives 0–3). +- Physical media timing (step rate, head load/unload from SPECIFY) is + accepted but doesn't affect I/O timing. +- CHS-to-LBA translation is straightforward — the controller doesn't + emulate track-level interleave or skew. +- STATUS_A and STATUS_B registers return fixed values rather than reflecting physical drive state. + +## Crates + +| Crate | Purpose | Rustdoc | +|-------|---------|---------| +| `floppy` | Full 82077AA emulator | [rustdoc](https://openvmm.dev/rustdoc/linux/floppy/index.html) | +| `floppy_pcat_stub` | Stub controller (no drives) | [rustdoc](https://openvmm.dev/rustdoc/linux/floppy_pcat_stub/index.html) | +| `floppy_resources` | Config types (Resource-based instantiation not yet implemented) | [rustdoc](https://openvmm.dev/rustdoc/linux/floppy_resources/index.html) | + +The [storage pipeline](../../architecture/devices/storage.md) page covers how the floppy controller connects to the broader disk backend abstraction. diff --git a/Guide/src/reference/emulated/legacy_x86/ide.md b/Guide/src/reference/emulated/legacy_x86/ide.md new file mode 100644 index 0000000000..b831c07694 --- /dev/null +++ b/Guide/src/reference/emulated/legacy_x86/ide.md @@ -0,0 +1,139 @@ +# IDE HDD/Optical + +The IDE controller emulates the storage portion of an Intel PIIX4 +(82371AB) PCI-to-ISA bridge. It provides two IDE channels (primary and +secondary), each supporting up to two devices — four total. Devices are +either ATA hard drives or ATAPI optical drives. + +The controller connects to the storage stack through +[`Disk`](https://openvmm.dev/rustdoc/linux/disk_backend/struct.Disk.html) +for hard drives and +[`AsyncScsiDisk`](https://openvmm.dev/rustdoc/linux/scsi_core/trait.AsyncScsiDisk.html) +(via +[`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html)) +for optical drives. Interrupts use IRQ 14 (primary) and IRQ 15 +(secondary). The emulator implements a subset of +[ATA/ATAPI-6](https://www.t13.org/standards-published) (48-bit LBA) +and the ATAPI packet interface from ATA/ATAPI-4, with a PCI config +space layout based on the Intel PIIX4 (82371AB) datasheet (PCI +vendor/device ID `8086:7111`). + +## I/O port layout + +Command block registers (per channel): + +| Port (pri / sec) | Register | Access | Purpose | +|-------------------|----------|--------|---------| +| 0x1F0 / 0x170 | Data | R/W | 16-bit PIO data transfer | +| 0x1F1 / 0x171 | Error (R) / Features (W) | R/W | Error status / command parameters | +| 0x1F2 / 0x172 | Sector count | R/W | Transfer size in sectors | +| 0x1F3–0x1F5 / 0x173–0x175 | LBA low / mid / high | R/W | LBA address (28-bit or 48-bit with HOB) | +| 0x1F6 / 0x176 | Device / head | R/W | Drive select + LBA[24:27] or head | +| 0x1F7 / 0x177 | Status (R) / Command (W) | R/W | Status flags / command issue | +| 0x3F6 / 0x376 | Alt status (R) / Device control (W) | R/W | Non-interrupt status / reset + nIEN | + +The IDE controller claims port 0x3F6 (shared region with the floppy +controller's 0x3F7). + +## Bus master DMA + +The controller provides PCI bus master DMA via BAR4. Each channel has +its own registers (primary at BAR4+0, secondary at BAR4+8): + +| Offset | Register | Purpose | +|--------|----------|---------| +| +0 | Command | Start DMA, read/write direction | +| +2 | Status | Active, DMA error, interrupt flags | +| +4 | PRD table pointer | Physical Region Descriptor table address | + +The PRD table is a scatter-gather list in guest memory. Each entry +contains a 32-bit physical base address, a 16-bit byte count, and an +end-of-table flag. DMA transfers iterate entries until the end-of-table +bit or the requested byte count is reached. + +PCI config space includes PIIX4-specific timing registers +(`PRIMARY_TIMING_REG_ADDR` at 0x40, `SECONDARY_TIMING_REG_ADDR` at +0x44) and a UDMA control register (`UDMA_CTL_REG_ADDR` at 0x48). + +## ATA hard drives + +The ATA (AT Attachment) protocol defines a register-based command +interface for hard drives. The guest programs LBA, sector count, and +command into the command block registers, then transfers data via PIO +or DMA. The emulator implements the subset that OS drivers actually +use: + +- Data transfer: `READ SECTORS`, `WRITE SECTORS` (PIO), `READ DMA`, + `WRITE DMA` (DMA), plus 48-bit LBA extended variants. +- `WRITE DMA FUA EXT` — force unit access, mapped to + `Disk::write_vectored` with `fua: true`. +- `IDENTIFY DEVICE` — returns 512 bytes of drive geometry, + capabilities, and supported command sets. +- `FLUSH CACHE` / `FLUSH CACHE EXT` — mapped to `Disk::sync_cache`. +- `SET FEATURES`, `SET MULTI BLOCK MODE`, power management + (`STANDBY`, `IDLE`, `SLEEP`, `CHECK POWER MODE`). + +Commands not implemented (including SMART, security, and device +configuration overlays) return an error. The emulator doesn't emulate +PIO timing — transfers complete as fast as the backend can serve them. + +## ATAPI optical drives + +The ATAPI (ATA Packet Interface) extension transports SCSI commands +over the ATA register interface. The guest issues `PACKET COMMAND` +(0xA0), then writes a 12-byte SCSI CDB through the data register. +The controller forwards this CDB to +[`SimpleScsiDvd`](https://openvmm.dev/rustdoc/linux/scsidisk/scsidvd/struct.SimpleScsiDvd.html), +which handles optical-specific SCSI commands (READ, +GET_CONFIGURATION, START_STOP_UNIT for eject, +GET_EVENT_STATUS_NOTIFICATION for media change). + +This layering means the ATAPI drive is a thin ATA-to-SCSI bridge — +the same `SimpleScsiDvd` implementation serves both StorVSP (direct +SCSI) and IDE (via ATAPI). See the +[storage pipeline — virtual optical / DVD](../../architecture/devices/storage.md#virtual-optical--dvd) +section for the DVD model and eject behavior. + +`IDENTIFY PACKET DEVICE` (0xA1) returns device identification with +ATAPI-specific fields (general config word indicates removable media, +ATAPI device type). + +## Enlightened I/O + +The IDE controller supports a Microsoft-specific performance +optimization: enlightened INT13 commands. Instead of the guest issuing +a sequence of register writes to set up an ATA/ATAPI command (LBA, +sector count, command register, then DMA start), the guest writes a +single `EnlightenedInt13Command` packet to guest memory and writes +the packet's GPA to the enlightened port. + +Enlightened ports: 0x1E0 (primary channel), 0x160 (secondary channel). + +The `EnlightenedInt13Command` struct contains the ATA command opcode, +device/head select, full 48-bit LBA, block count, a GPA for the data +buffer, byte count, skip-bytes for partial sector transfers, and a +result status field written by the controller on completion. + +This collapses the multi-exit register-programming sequence into a +single VM exit, significantly reducing overhead for legacy IDE I/O. +The enlightened path uses the same `DeferredWrite` async I/O +mechanism — the I/O port write returns deferred, and the controller +completes it when the disk operation finishes. + +Both HDD and optical drives support enlightened commands, with +separate completion paths. + +## Limitations + +- No hot-add or hot-remove. +- No online disk resize (IDE has no capacity-change notification). +- Maximum four devices (two channels × two drives). +- No native command queuing (NCQ) — one command at a time per channel. + +## Crate + +[`ide`](https://openvmm.dev/rustdoc/linux/ide/index.html). See also +[`ide_resources`](https://openvmm.dev/rustdoc/linux/ide_resources/index.html) +for the `GuestMedia` enum and `IdeDeviceConfig` types. The +[storage pipeline](../../architecture/devices/storage.md) page covers +how IDE fits into the broader frontend-to-backend architecture. diff --git a/openvmm/openvmm_entry/src/cli_args.rs b/openvmm/openvmm_entry/src/cli_args.rs index dca0a70905..10944b3756 100644 --- a/openvmm/openvmm_entry/src/cli_args.rs +++ b/openvmm/openvmm_entry/src/cli_args.rs @@ -136,8 +136,16 @@ valid disk kinds: : length of ramdisk, e.g.: `1G` `memdiff:` memory backed diff disk : lower disk, e.g.: `file:base.img` - `file:` file-backed disk + `file:[;create=]` file-backed disk : path to file + `sql:[;create=]` SQLite-backed disk (dev/test) + `sqldiff:[;create]:` SQLite diff layer on a backing disk + `autocache::` auto-cached SQLite layer (use `autocache::` to omit key; needs OPENVMM_AUTO_CACHE_PATH) + `blob::` HTTP blob (read-only) + : `flat` or `vhd1` + `crypt:::` encrypted disk wrapper + : `xts-aes-256` + `prwrap:` persistent reservations wrapper flags: `ro` open disk as read-only @@ -160,8 +168,16 @@ valid disk kinds: : length of ramdisk, e.g.: `1G` `memdiff:` memory backed diff disk : lower disk, e.g.: `file:base.img` - `file:` file-backed disk + `file:[;create=]` file-backed disk : path to file + `sql:[;create=]` SQLite-backed disk (dev/test) + `sqldiff:[;create]:` SQLite diff layer on a backing disk + `autocache::` auto-cached SQLite layer (use `autocache::` to omit key; needs OPENVMM_AUTO_CACHE_PATH) + `blob::` HTTP blob (read-only) + : `flat` or `vhd1` + `crypt:::` encrypted disk wrapper + : `xts-aes-256` + `prwrap:` persistent reservations wrapper flags: `ro` open disk as read-only @@ -504,8 +520,17 @@ valid disk kinds: : length of ramdisk, e.g.: `1G` `memdiff:` memory backed diff disk : lower disk, e.g.: `file:base.img` - `file:` file-backed disk + `file:[;create=]` file-backed disk : path to file + `sql:[;create=]` SQLite-backed disk (dev/test) + `sqldiff:[;create]:` SQLite diff layer on a backing disk + `blob::` HTTP blob (read-only) + : `flat` or `vhd1` + `crypt:::` encrypted disk wrapper + : `xts-aes-256` + +additional wrapper kinds (e.g., `autocache`, `prwrap`) are also supported; +this list is not exhaustive. flags: `ro` open disk as read-only @@ -518,7 +543,7 @@ flags: /// attach a floppy drive (should be able to be passed multiple times). VM must be generation 1 (no UEFI) /// #[clap(long_help = r#" -e.g: --floppy memdiff:/path/to/disk.vfd,ro +e.g: --floppy memdiff:file:/path/to/disk.vfd,ro syntax: | kind:[,flag,opt=arg,...] @@ -527,8 +552,14 @@ valid disk kinds: : length of ramdisk, e.g.: `1G` `memdiff:` memory backed diff disk : lower disk, e.g.: `file:base.img` - `file:` file-backed disk + `file:[;create=]` file-backed disk : path to file + `sql:[;create=]` SQLite-backed disk (dev/test) + `sqldiff:[;create]:` SQLite diff layer on a backing disk + `blob::` HTTP blob (read-only) + : `flat` or `vhd1` + `crypt:::` encrypted disk wrapper + : `xts-aes-256` flags: `ro` open disk as read-only diff --git a/vm/devices/storage/disk_backend/src/lib.rs b/vm/devices/storage/disk_backend/src/lib.rs index 14c7d57f55..40caebe8a7 100644 --- a/vm/devices/storage/disk_backend/src/lib.rs +++ b/vm/devices/storage/disk_backend/src/lib.rs @@ -1,13 +1,74 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Defines the [`Disk`] type, which provides an interface to a block -//! device, used for different disk frontends (such as the floppy disk, IDE, -//! SCSI, or NVMe emulators) as well as direct disk access for other purposes -//! (such as the VMGS file system). +//! The shared disk backend abstraction for OpenVMM storage. //! -//! `Disk`s are backed by a [`DiskIo`] implementation. Specific disk -//! backends should be in their own crates. +//! This crate defines [`Disk`] and the [`DiskIo`] trait, the central +//! interface between storage frontends (NVMe, SCSI/StorVSP, IDE) and disk +//! backends (host files, block devices, remote blobs, and more). +//! +//! # Architecture +//! +//! Every disk backend implements [`DiskIo`]. Frontends don't interact with +//! backends directly — they hold a [`Disk`], which wraps a type-erased +//! backend (`DynDisk`, an adapter around [`DiskIo`] that normalizes return +//! futures) behind an `Arc` for cheap, concurrent cloning. The `Disk` +//! wrapper caches immutable metadata (sector size, physical sector size, +//! disk ID, FUA support) at construction time and validates that sector +//! sizes are powers of two and at least 512 bytes. +//! +//! # I/O model +//! +//! All I/O is **async** and uses **scatter-gather** buffers via +//! [`RequestBuffers`]. Callers must pass +//! buffers that are an integral number of sectors. +//! +//! The key operations are: +//! +//! - [`DiskIo::read_vectored`] / [`DiskIo::write_vectored`] — async +//! scatter-gather read and write. The `fua` parameter on writes requests +//! Force Unit Access (write-through to stable storage). Whether FUA is +//! actually respected depends on the backend — check +//! [`DiskIo::is_fua_respected`]. +//! - [`DiskIo::sync_cache`] — flush (equivalent to SCSI SYNCHRONIZE CACHE +//! or NVMe FLUSH). +//! - [`DiskIo::unmap`] — trim / deallocate sectors. The +//! [`DiskIo::unmap_behavior`] method reports whether unmapped sectors +//! become zero, become indeterminate, or whether unmap is ignored +//! entirely. +//! - [`DiskIo::eject`] — eject media (optical drives only). The default +//! returns [`DiskError::UnsupportedEject`]. Eject is a media state change +//! managed by the SCSI DVD layer, not by the backend. +//! - [`DiskIo::wait_resize`] — block until the disk's sector count changes. +//! The default returns [`std::future::pending()`], meaning the backend +//! never signals a resize. Only backends that can detect runtime capacity +//! changes (e.g., `BlockDeviceDisk` via Linux uevent, `NvmeDisk` via AEN) +//! should override this. Decorators and layered disks delegate to the +//! inner backend. +//! +//! # Error model +//! +//! All I/O methods return [`DiskError`], which frontends translate into +//! protocol-specific errors (NVMe status codes, SCSI sense keys). The +//! variants cover out-of-range LBAs, I/O errors, medium errors with +//! sub-classification, guest memory access failures, read-only violations, +//! persistent reservation conflicts, and unsupported eject. +//! +//! # Available backends +//! +//! | Backend | Crate | Description | +//! |---------|-------|-------------| +//! | `FileDisk` | `disk_file` | Host file, cross-platform | +//! | `Vhd1Disk` | `disk_vhd1` | VHD1 fixed format | +//! | `VhdmpDisk` | `disk_vhdmp` | Windows vhdmp driver | +//! | `BlobDisk` | `disk_blob` | Read-only HTTP / Azure Blob | +//! | `BlockDeviceDisk` | `disk_blockdevice` | Linux block device (io_uring) | +//! | `NvmeDisk` | `disk_nvme` | Physical NVMe (user-mode driver) | +//! | `StripedDisk` | `disk_striped` | Striped across multiple disks | +//! | `CryptDisk` | `disk_crypt` | XTS-AES-256 encryption wrapper | +//! | `DelayDisk` | `disk_delay` | Injected I/O latency wrapper | +//! | `DiskWithReservations` | `disk_prwrap` | In-memory PR emulation wrapper | +//! | `LayeredDisk` | `disk_layered` | Layered disk with per-sector presence | #![forbid(unsafe_code)] @@ -120,6 +181,13 @@ pub trait DiskIo: 'static + Send + Sync + Inspect { ) -> impl Future> + Send; /// Returns the behavior of the unmap operation. + /// + /// This tells callers what happens to the content of unmapped sectors: + /// + /// - [`UnmapBehavior::Zeroes`] — unmapped sectors read back as zero. + /// - [`UnmapBehavior::Unspecified`] — content may or may not change, and + /// not necessarily to zero. + /// - [`UnmapBehavior::Ignored`] — unmap is a no-op; content is unchanged. fn unmap_behavior(&self) -> UnmapBehavior; /// Returns the optimal granularity for unmaps, in sectors. @@ -134,6 +202,11 @@ pub trait DiskIo: 'static + Send + Sync + Inspect { } /// Issues an asynchronous eject media operation to the disk. + /// + /// The default implementation returns [`DiskError::UnsupportedEject`]. + /// Eject is primarily a media state change managed by the SCSI DVD layer + /// (`SimpleScsiDvd`), not by disk backends. Backends generally do not + /// need to override this. fn eject(&self) -> impl Future> + Send { ready(Err(DiskError::UnsupportedEject)) } @@ -164,7 +237,18 @@ pub trait DiskIo: 'static + Send + Sync + Inspect { /// Issues an asynchronous flush operation to the disk. fn sync_cache(&self) -> impl Future> + Send; - /// Waits for the disk sector size to be different than the specified value. + /// Waits for the disk sector count to change from the specified value. + /// + /// Returns the new sector count once [`DiskIo::sector_count`] would return + /// a value different from `sector_count`. Frontends use this to detect + /// runtime capacity changes and notify the guest (NVMe via AEN, SCSI via + /// UNIT_ATTENTION). + /// + /// The default implementation returns [`std::future::pending()`], meaning + /// the disk never signals a resize. Only backends that can detect runtime + /// capacity changes should override this — for example, `BlockDeviceDisk` + /// (via Linux uevent) and `NvmeDisk` (via NVMe AEN). Decorator wrappers + /// and `LayeredDisk` should delegate to the inner disk. fn wait_resize(&self, sector_count: u64) -> impl Future + Send { let _ = sector_count; std::future::pending() @@ -359,20 +443,27 @@ impl Disk { self.0.disk.sync_cache() } - /// Waits for the disk sector size to be different than the specified value. + /// Waits for the disk sector count to change from the specified value. pub fn wait_resize(&self, sector_count: u64) -> impl use<'_> + Future { self.0.disk.wait_resize(sector_count) } } -/// The behavior of unmap. +/// The behavior of the [`DiskIo::unmap`] operation. +/// +/// This describes what happens to the content of unmapped sectors. Frontends +/// use this to report the correct behavior to the guest (e.g., SCSI +/// `LBPRZ` bit or NVMe DLFEAT field). #[derive(Clone, Copy, Debug, PartialEq, Eq, Inspect)] pub enum UnmapBehavior { /// Unmap may or may not change the content, and not necessarily to zero. + /// The guest cannot assume anything about the content of unmapped sectors. Unspecified, - /// Unmaps are guaranteed to be ignored. + /// Unmaps are guaranteed to be ignored — the content is unchanged. + /// The disk reports that unmap is not supported. Ignored, - /// Unmap will deterministically zero the content. + /// Unmap will deterministically zero the content. The guest can rely on + /// reading back zeroes from unmapped sectors. Zeroes, } diff --git a/vm/devices/storage/disk_layered/src/lib.rs b/vm/devices/storage/disk_layered/src/lib.rs index d9b015b8ce..ee77ea039b 100644 --- a/vm/devices/storage/disk_layered/src/lib.rs +++ b/vm/devices/storage/disk_layered/src/lib.rs @@ -20,6 +20,30 @@ //! which would be needed for caches that are smaller than the disk. These //! require potentially complicated cache management policies and are probably //! best implemented in a separate disk implementation. +//! +//! # Layer types +//! +//! Each layer implements [`LayerIo`], which is similar to [`DiskIo`] +//! but adds per-sector presence tracking via [`SectorMarker`]. Two concrete +//! layer implementations exist: +//! +//! - **`RamDiskLayer`** (`disklayer_ram`) — ephemeral, in-memory. +//! - **`SqliteDiskLayer`** (`disklayer_sqlite`) — persistent, file-backed +//! (dev/test only). +//! +//! A full [`Disk`] can appear at the bottom of the stack +//! as a fully-present layer via `DiskLayer::from_disk`, which wraps it in +//! `DiskAsLayer` — a layer that marks all sectors as present on every read. +//! +//! # Construction and validation +//! +//! [`LayeredDisk::new`] validates the layer stack at construction time: +//! +//! - All layers must have matching sector sizes. +//! - Write-through layers must be contiguous from the top. +//! - The last layer must not be write-through. +//! - Layers used as read caches must support [`WriteNoOverwrite`]. +//! - If the disk is writable, all layers in the write path must be writable. #![forbid(unsafe_code)] diff --git a/vm/devices/storage/floppy_resources/src/lib.rs b/vm/devices/storage/floppy_resources/src/lib.rs index 64b741e526..0ba1493252 100644 --- a/vm/devices/storage/floppy_resources/src/lib.rs +++ b/vm/devices/storage/floppy_resources/src/lib.rs @@ -1,10 +1,11 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Client definitions for describing floppy controller configuration. +//! Configuration types for the floppy controller. //! -//! TODO: refactor to support `Resource`-based instantiation of floppy -//! controllers, at which point this crate name makes sense. +//! Resource-based instantiation of floppy controllers is not yet implemented; +//! these types exist in anticipation of that work. The controller is currently +//! instantiated directly as part of the chipset configuration. #![forbid(unsafe_code)] diff --git a/vm/devices/storage/ide/src/lib.rs b/vm/devices/storage/ide/src/lib.rs index 7b40d87a8d..99d8c91226 100644 --- a/vm/devices/storage/ide/src/lib.rs +++ b/vm/devices/storage/ide/src/lib.rs @@ -1,6 +1,33 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! Legacy PCI/ISA IDE controller emulator (PIIX4-compatible). +//! +//! Emulates the storage portion of an Intel PIIX4 (82371AB) PCI-to-ISA bridge +//! with two IDE channels (primary + secondary), each supporting up to two +//! devices. PCI vendor/device ID: `8086:7111`. +//! +//! # Drive types +//! +//! - **ATA hard drives** — use [`Disk`] for I/O. Support +//! PIO and DMA modes, 28-bit and 48-bit LBA, `IDENTIFY DEVICE`, `FLUSH CACHE`. +//! - **ATAPI optical drives** — use `PACKET COMMAND` (0xA0) to transport SCSI +//! CDBs over the ATA interface, delegating to +//! [`SimpleScsiDvd`](scsidisk::scsidvd::SimpleScsiDvd). +//! +//! # Port I/O +//! +//! Primary channel: 0x1F0–0x1F7 + 0x3F6. Secondary: 0x170–0x177 + 0x376. +//! Bus master DMA via PCI BAR4 (PRD scatter-gather table). +//! +//! # Enlightened I/O +//! +//! Microsoft-specific optimization: enlightened INT13 commands via ports +//! 0x1E0 (primary) and 0x160 (secondary). The guest writes an +//! `EnlightenedInt13Command` packet GPA, collapsing the multi-exit register +//! programming sequence into a single VM exit. Uses the `DeferredWrite` +//! pattern for async completion. + #![expect(missing_docs)] #![forbid(unsafe_code)] diff --git a/vm/devices/storage/nvme/src/lib.rs b/vm/devices/storage/nvme/src/lib.rs index e793a72104..6e2c599cac 100644 --- a/vm/devices/storage/nvme/src/lib.rs +++ b/vm/devices/storage/nvme/src/lib.rs @@ -1,7 +1,42 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! An implementation of an NVMe controller emulator. +//! NVMe controller emulator (NVMe 2.0, NVM command set). +//! +//! This crate emulates an NVMe controller as a PCI device with MMIO BAR0, +//! MSI-X, and admin + I/O queue pairs. It targets the +//! [NVMe Base 2.0](https://nvmexpress.org/specifications/) specification +//! (version register reports 0x00020000) with vendor ID 0x1414 (Microsoft). +//! +//! # Architecture +//! +//! - **PCI layer** ([`NvmeController`]) — MMIO BAR0 register handling, PCI +//! config space, MSI-X interrupt routing, doorbell writes. +//! - **Coordinator** — manages enable/reset sequencing, namespace add/remove. +//! - **Admin worker** — processes admin commands: Identify Controller/Namespace, +//! Create/Delete I/O Queue, Get/Set Features, Async Event Request. +//! - **I/O workers** — pool of tasks (one per completion queue) processing NVM +//! commands: READ, WRITE, FLUSH, Dataset Management (TRIM), and persistent +//! reservation commands. +//! +//! # What it doesn't implement +//! +//! Firmware update, admin-level namespace management (create/delete), multi-path +//! I/O, end-to-end data protection (PI), and save/restore (`SaveRestore` +//! returns not-supported). +//! +//! # Namespace management +//! +//! Namespaces can be added and removed at runtime via [`NvmeControllerClient`]. +//! Each namespace wraps a [`Disk`](disk_backend::Disk) and a background task +//! monitors capacity changes via `wait_resize`, completing Async Event Requests +//! with `CHANGED_NAMESPACE_LIST` when the disk size changes. +//! +//! # Key constants +//! +//! - `MAX_DATA_TRANSFER_SIZE`: 256 KB +//! - `MAX_QES`: 256 queue entries +//! - `BAR0_LEN`: 64 KB #![forbid(unsafe_code)] diff --git a/vm/devices/storage/nvme_common/src/lib.rs b/vm/devices/storage/nvme_common/src/lib.rs index 0534568948..f44822c6f8 100644 --- a/vm/devices/storage/nvme_common/src/lib.rs +++ b/vm/devices/storage/nvme_common/src/lib.rs @@ -1,8 +1,9 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Common routines for interoperating between [`nvme_spec`] and -//! [`disk_backend`] types. +//! Conversion routines between [`nvme_spec`] and [`disk_backend`] types, +//! primarily for persistent reservation mapping between NVMe and SCSI PR +//! models. #![forbid(unsafe_code)] diff --git a/vm/devices/storage/nvme_resources/src/lib.rs b/vm/devices/storage/nvme_resources/src/lib.rs index fc384ebb5b..a80eef079f 100644 --- a/vm/devices/storage/nvme_resources/src/lib.rs +++ b/vm/devices/storage/nvme_resources/src/lib.rs @@ -2,6 +2,10 @@ // Licensed under the MIT License. //! Resource definitions for NVMe controllers. +//! +//! [`NvmeControllerHandle`] configures the controller with its initial +//! namespaces, MSI-X count, and queue limits. [`NvmeControllerRequest`] enables +//! runtime namespace add/remove. #![forbid(unsafe_code)] diff --git a/vm/devices/storage/nvme_spec/src/lib.rs b/vm/devices/storage/nvme_spec/src/lib.rs index 467b99585d..32a65a3500 100644 --- a/vm/devices/storage/nvme_spec/src/lib.rs +++ b/vm/devices/storage/nvme_spec/src/lib.rs @@ -1,7 +1,11 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Definitions from the NVMe specifications: +//! NVMe specification definitions (NVMe Base 2.0c and PCIe Transport 1.0c). +//! +//! Provides bitfield structs, command/completion queue entry formats, status +//! codes, and register definitions. The [`nvm`] submodule defines the NVM +//! command set (read, write, flush, DSM, reservations, namespace identification). //! //! Base 2.0c: //! PCIe transport 1.0c: diff --git a/vm/devices/storage/scsi_core/src/lib.rs b/vm/devices/storage/scsi_core/src/lib.rs index ef43c22c47..7cbacbaf52 100644 --- a/vm/devices/storage/scsi_core/src/lib.rs +++ b/vm/devices/storage/scsi_core/src/lib.rs @@ -1,7 +1,15 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Core SCSI traits and types. +//! Core SCSI traits and types for the OpenVMM storage stack. +//! +//! Defines the [`AsyncScsiDisk`] trait — the interface between SCSI transport +//! layers (`storvsp`, IDE/ATAPI) and SCSI device implementations +//! (`SimpleScsiDisk`, `SimpleScsiDvd`). Also defines +//! [`Request`], [`ScsiResult`], and save/restore types for SCSI devices. +//! +//! Implementations must fit within [`ASYNC_SCSI_DISK_STACK_SIZE`] or the +//! returned future will be heap-allocated via `StackFuture::from_or_box`. #![forbid(unsafe_code)] diff --git a/vm/devices/storage/scsi_defs/src/lib.rs b/vm/devices/storage/scsi_defs/src/lib.rs index 704942d4c4..e455df85a2 100644 --- a/vm/devices/storage/scsi_defs/src/lib.rs +++ b/vm/devices/storage/scsi_defs/src/lib.rs @@ -1,6 +1,23 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! SCSI protocol definitions: opcodes, status codes, sense data structures, +//! and CDB (Command Descriptor Block) layouts. +//! +//! Based on the SCSI Primary Commands (SPC) and SCSI Block Commands (SBC) +//! specifications from the [T10 committee](https://www.t10.org/). All +//! multi-byte integers use big-endian byte order per the SCSI spec. +//! +//! This crate defines wire-format types only — no I/O logic. Consumers +//! include `scsidisk` (CDB parsing) and `storvsp` (SRB status handling). +//! +//! # Modules +//! +//! - [`srb`] — SRB (SCSI Request Block) types for the Hyper-V SCSI protocol. +//! [`SrbStatus`](srb::SrbStatus) reports command-level status; +//! [`SrbStatusAndFlags`](srb::SrbStatusAndFlags) packs status with additional +//! flags. + #![expect(missing_docs)] #![forbid(unsafe_code)] @@ -557,6 +574,11 @@ pub struct SenseDataHeader { pub additional_sense_length: u8, } +/// Fixed-format SCSI sense data (SPC-4 §4.5.3). +/// +/// Contains the sense key (broad error category), additional sense code (ASC), +/// and additional sense code qualifier (ASCQ) that together identify the +/// specific error condition. Constructed via [`SenseData::new`]. #[repr(C)] #[derive(Debug, Copy, Clone, IntoBytes, Immutable, KnownLayout, FromBytes)] pub struct SenseData { @@ -1025,6 +1047,10 @@ pub struct Cdb16 { pub control: u8, } +/// Flags from a SCSI CDB (Command Descriptor Block) for 10-byte commands. +/// +/// Packs FUA (Force Unit Access), DPO (Disable Page Out), and protection +/// information fields into a single byte. #[bitfield(u8)] #[derive(IntoBytes, Immutable, KnownLayout, FromBytes)] pub struct CdbFlags { diff --git a/vm/devices/storage/scsi_defs/src/srb.rs b/vm/devices/storage/scsi_defs/src/srb.rs index cca8bc2d66..c65b64c63f 100644 --- a/vm/devices/storage/scsi_defs/src/srb.rs +++ b/vm/devices/storage/scsi_defs/src/srb.rs @@ -1,6 +1,12 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! SRB (SCSI Request Block) types for the Hyper-V SCSI protocol. +//! +//! [`SrbStatus`] reports command-level completion status (success, error, +//! aborted, etc.). [`SrbStatusAndFlags`] packs the status with additional +//! flags (autosense valid, queue frozen, etc.) into a single byte. + use bitfield_struct::bitfield; use open_enum::open_enum; use zerocopy::FromBytes; diff --git a/vm/devices/storage/scsidisk/src/lib.rs b/vm/devices/storage/scsidisk/src/lib.rs index bb8a3035ca..af06d3fb77 100644 --- a/vm/devices/storage/scsidisk/src/lib.rs +++ b/vm/devices/storage/scsidisk/src/lib.rs @@ -1,6 +1,31 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! SCSI CDB parser and disk/DVD emulation. +//! +//! This crate translates SCSI commands (CDBs) into [`DiskIo`](disk_backend::DiskIo) +//! calls. It's used by `storvsp` for hard drives and by `ide` (via ATAPI) +//! for optical drives. It doesn't implement the SCSI transport — that's the +//! frontend's job. +//! +//! # Key types +//! +//! - [`SimpleScsiDisk`] — hard drive emulation. Implements +//! [`AsyncScsiDisk`], holds a +//! [`Disk`], and parses SCSI CDB opcodes. Handles +//! READ/WRITE (6/10/12/16), READ_CAPACITY, INQUIRY, MODE_SENSE, UNMAP, +//! WRITE_SAME, SYNCHRONIZE_CACHE, and PERSISTENT_RESERVE. +//! - [`SimpleScsiDvd`](scsidvd::SimpleScsiDvd) — optical drive emulation. +//! Manages media state (`Loaded` / `Unloaded`), handles MMC optical commands +//! (GET_EVENT_STATUS, GET_CONFIGURATION, READ_TOC, START_STOP_UNIT for eject). +//! +//! # Capacity change detection +//! +//! On every SCSI command, `SimpleScsiDisk` checks the current sector count +//! against the last-known value. If the disk resized, it returns +//! UNIT_ATTENTION with CAPACITY_DATA_CHANGED. The guest retries and re-reads +//! capacity. + #![expect(missing_docs)] #![forbid(unsafe_code)] diff --git a/vm/devices/storage/storvsp/src/lib.rs b/vm/devices/storage/storvsp/src/lib.rs index a68b2bdfe5..1f6f2a1c6d 100644 --- a/vm/devices/storage/storvsp/src/lib.rs +++ b/vm/devices/storage/storvsp/src/lib.rs @@ -1,6 +1,41 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! VMBus SCSI controller emulator (StorVSP). +//! +//! StorVSP implements the Hyper-V synthetic SCSI protocol — a VMBus-based +//! transport that carries SCSI CDBs between the guest's `storvsc` driver and +//! the VMM. This is not a standard SCSI transport (like iSCSI or SAS); it's a +//! Hyper-V-specific wire format defined in [`storvsp_protocol`]. +//! +//! # Architecture +//! +//! The crate uses a multi-worker model. The primary VMBus channel handles +//! protocol version negotiation (Win6 through Blue); sub-channels process I/O +//! in parallel. Each worker owns a VMBus ring and processes packets +//! concurrently via `FuturesUnordered`. +//! +//! StorVSP handles the transport (ring buffer management, GPADL setup, packet +//! framing, sub-channel lifecycle) and a few SCSI control commands directly +//! (`REPORT_LUNS`, `INQUIRY` for absent targets). All actual I/O is delegated +//! to [`AsyncScsiDisk`] implementations — StorVSP +//! never interprets SCSI data CDBs itself. +//! +//! # Key types +//! +//! - [`StorageDevice`] — the VMBus device. Implements `VmbusDevice` and +//! `SaveRestoreVmbusDevice`. +//! - [`ScsiController`] — manages attached disks by [`ScsiPath`]. Supports +//! runtime attach/remove. +//! - [`ScsiControllerDisk`] — wraps `Arc`. +//! +//! # Performance +//! +//! Poll-mode optimization: when pending I/O count exceeds +//! `poll_mode_queue_depth`, the worker switches from interrupt-driven to +//! busy-poll for new requests, reducing guest exit frequency. Future storage +//! for SCSI request processing is pooled to avoid allocation on the hot path. + #![expect(missing_docs)] #![forbid(unsafe_code)] diff --git a/vm/devices/storage/storvsp_protocol/src/lib.rs b/vm/devices/storage/storvsp_protocol/src/lib.rs index 9f0cf7d0d6..c721710230 100644 --- a/vm/devices/storage/storvsp_protocol/src/lib.rs +++ b/vm/devices/storage/storvsp_protocol/src/lib.rs @@ -1,6 +1,11 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. +//! Wire-format definitions for the Hyper-V SCSI VMBus protocol. +//! +//! Defines packet structures, interface GUIDs (`SCSI_INTERFACE_ID`, +//! `IDE_ACCELERATOR_INTERFACE_ID`), and protocol version negotiation types. + #![expect(missing_docs)] #![forbid(unsafe_code)] diff --git a/vm/devices/storage/storvsp_resources/src/lib.rs b/vm/devices/storage/storvsp_resources/src/lib.rs index 707fe4b049..c8687ab174 100644 --- a/vm/devices/storage/storvsp_resources/src/lib.rs +++ b/vm/devices/storage/storvsp_resources/src/lib.rs @@ -1,7 +1,11 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -//! Resource definitions for storvsp. +//! Resource definitions for the StorVSP SCSI controller. +//! +//! [`ScsiControllerHandle`] configures the controller with its initial devices, +//! instance ID, and queue depth. [`ScsiControllerRequest`] enables runtime +//! device add/remove. #![forbid(unsafe_code)]