-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Document the storage stack: disk backends and the frontend-to-backend pipeline
Summary
PR #2932 adds documentation for the VTL2 storage translation and settings model — how OpenHCL maps backing devices onto guest-visible controllers. That covers the outer shell: what the guest sees, what OpenHCL is offered, and how the configuration surface connects them.
What it does not cover is the inside of that shell: the disk backend abstraction, the concrete disk backends, the layered disk model, and the path from a storage frontend (NVMe, StorVSP, IDE) through the SCSI adapter down to a disk backend. That is the scope of this issue.
This should be written so it is useful for both OpenVMM and OpenHCL contexts, since the same DiskIo trait, the same backends, and the same frontend implementations are shared.
What should be documented
The DiskIo trait and the Disk wrapper
The central abstraction is the DiskIo trait in vm/devices/storage/disk_backend/src/lib.rs. Every disk backend implements it. The key operations are:
read_vectored/write_vectored(async, scatter-gather)sync_cache(flush)unmap(TRIM / deallocate)- capacity and sector size queries
- optional persistent reservation support
The Disk struct wraps Arc<dyn DynDisk> for cheap concurrent cloning. This is what frontends hold.
The doc should explain the trait, the wrapper, and the design choices (async, scatter-gather, FUA, sector-aligned I/O).
Storage frontends
How each frontend consumes a Disk:
| Frontend | Protocol | Transport | Crate |
|---|---|---|---|
| NVMe | NVMe 2.0 | PCI MMIO + MSI-X | nvme/ |
| StorVSP | SCSI CDB over VMBus | VMBus ring buffers | storvsp/ |
| IDE | ATA/ATAPI | PCI/ISA I/O ports + DMA | ide/ |
The doc should cover the data flow from guest I/O to DiskIo method call. The SCSI path is the most interesting because it goes through two layers:
- StorVSP: dequeues SCSI requests from the VMBus ring
- SimpleScsiDisk (
scsidisk/): parses CDB opcodes and translates them toDiskIocalls
NVMe is simpler: the NVMe controller's namespace directly holds a Disk and calls into it.
Concrete disk backends
All the backends that implement DiskIo:
| Backend | What it wraps | Crate |
|---|---|---|
| FileDisk | Host file (read-write-at) | disk_file/ |
| Vhd1Disk | VHD1 fixed format | disk_vhd1/ |
| VhdmpDisk | Windows VHD via vhdmp driver | disk_vhdmp/ |
| BlobDisk | Remote HTTP/Azure blob (read-only) | disk_blob/ |
| BlockDeviceDisk | Linux block device (io_uring) | disk_blockdevice/ |
| NvmeDisk | Physical NVMe via VFIO | disk_nvme/ |
| StripedDisk | Data striped across multiple backends | disk_striped/ |
Wrapping backends (decorators)
Backends that wrap another Disk and transform I/O:
| Wrapper | Transform | Crate |
|---|---|---|
| CryptDisk | XTS-AES256 encryption | disk_crypt/ |
| DelayDisk | Injected I/O latency | disk_delay/ |
| DiskWithReservations | In-memory persistent reservations | disk_prwrap/ |
The wrapping pattern is important to document because it is how features compose without modifying backends.
The layered disk model
disk_layered/ is its own subsystem. A layered disk stacks multiple layers with read-through and optional write-through semantics:
- Each layer implements
LayerIo(similar toDiskIobut tracks sector presence via bitmap) - Reads check layers top-to-bottom; the first layer with the requested sectors wins
- Writes go to the topmost writable layer
- Layers can optionally cache read-miss data from below (
read_cache) - Layers can optionally write through to the next layer (
write_through)
The two concrete layer implementations today are:
- RamDiskLayer (
disklayer_ram/) — ephemeral, fast - SqliteDiskLayer (
disklayer_sqlite/) — persistent, portable
This is what powers the memdiff: disk configuration in the CLI. It deserves its own section because the bitmap-based presence tracking and the layer configuration model are not obvious from reading a single file.
Resolver integration
The doc should show how the resolver pattern connects configuration to concrete backends. The storage resolver chain is the best example of recursive resolution in the codebase:
- NVMe controller → resolves each namespace's disk
- Layered disk → resolves each layer in parallel
- Each layer or backend → resolves to a concrete
DiskIoimplementation
This ties back to the resolver documentation (separate issue) but deserves a storage-specific section showing the concrete resolution flow.
Online disk resize
Online disk resize is an interesting cross-cutting concern because the behavior differs by frontend, backend, and OpenHCL vs standalone context.
Frontend notification mechanisms:
| Frontend | Resize notification | How it works |
|---|---|---|
| NVMe | AEN (Async Event Notification) | Background task calls disk.wait_resize() per namespace; on change, completes a queued AER command with CHANGED_NAMESPACE_LIST |
| StorVSP/SCSI | UNIT_ATTENTION sense key | On the next SCSI command after a resize, SimpleScsiDisk detects the capacity change and returns UNIT_ATTENTION; guest retries and re-reads capacity |
| IDE | Not supported | IDE has no standardized capacity-change notification |
Backend wait_resize support:
The DiskIo trait has a wait_resize method that defaults to pending() (never completes). Only backends that can detect runtime capacity changes override it:
| Backend | wait_resize |
How |
|---|---|---|
disk_blockdevice |
✅ Event-driven | Linux uevent listener for block device resize events |
disk_nvme |
✅ Event-driven | NVMe driver monitors AENs from the physical controller; rescans namespace to detect capacity changes |
disk_file |
❌ Default (pending) | No file-change monitoring |
disk_vhd1 |
❌ Default (pending) | Fixed-size format |
disk_blob |
❌ Default (pending) | Remote blob, no resize |
disk_layered |
✅ Delegates | Delegates to bottom-most layer |
| Wrappers (crypt, delay, prwrap) | ✅ Delegates | Forward to inner disk |
OpenHCL vs standalone:
In OpenHCL, the resize path is the same as standalone: disk_blockdevice detects the uevent from the host, wait_resize completes, and the NVMe or SCSI frontend notifies the VTL0 guest through the standard mechanism. There is no special paravisor-level resize interception.
The doc should explain this end-to-end flow and make clear which backends actually support it, since a contributor attaching a disk_file backend and expecting runtime resize will be confused when nothing happens.
RAM disk (mem:<len>)
The CLI supports standalone RAM disks via mem:<len> (e.g., --disk mem:1G). This is distinct from memdiff:<disk>, which stacks a RAM layer on top of a backing disk.
Under the hood, mem:<len> creates a RamDiskLayerHandle { len: Some(len) } wrapped in a single-layer LayeredDiskHandle. So even a "standalone" RAM disk is actually the layered disk machinery with one layer.
memdiff:<disk> creates a RamDiskLayerHandle { len: None } (sized from the backing disk) stacked on top of the inner disk. Writes go to the RAM layer; reads fall through to the backing disk for sectors not yet written.
The doc should explain this because the CLI surface (mem: vs memdiff:) hides the underlying layered disk model, and contributors reading the code will see RamDiskLayerHandle in both cases and wonder what the difference is.
Virtual optical / DVD
The storage stack supports virtual DVD/CD-ROM drives, which have a different model from disk devices.
How it works:
SimpleScsiDvd(inscsidisk/src/scsidvd/) implementsAsyncScsiDiskand handles optical-specific SCSI commands:GET_EVENT_STATUS_NOTIFICATION,GET_CONFIGURATION,START_STOP_UNIT(eject), media change events, and the standard read path.- The
GuestMediaenum (inide_resources/) distinguishesGuestMedia::DvdfromGuestMedia::Disk. DVD wraps aSimpleScsiDvdHandlewhich holds aResource<ScsiDeviceHandleKind>, while Disk wraps aResource<DiskHandleKind>. - Eject is supported: the
DiskIotrait has aneject()method (defaults toUnsupportedEject), andSimpleScsiDvdhandles the SCSISTART_STOP_UNITcommand with the load/eject flag. Once ejected, the media is permanently removed for the lifetime of the VM.
Frontend support:
| Frontend | DVD support |
|---|---|
| StorVSP/SCSI | ✅ Via SimpleScsiDvd |
| IDE | ✅ Via AtapiDrive wrapping SimpleScsiDvd through ATAPI |
| NVMe | ❌ Explicitly rejected ("dvd not supported with nvme") |
CLI surface:
DVD is specified with the dvd flag on --disk or --ide:
--disk file:my.iso,dvd→ SCSI optical drive--ide file:my.iso,dvd→ IDE optical drive (ATAPI)
The dvd flag implicitly sets read_only = true.
The doc should cover the DVD model because it is a common source of confusion: the guest media enum, the SCSI-vs-ATAPI layering, why NVMe rejects DVD, and how eject works.
Where this belongs
This is architecture reference content. I think it belongs as a new page or set of pages under the architecture section, near the existing OpenVMM and OpenHCL architecture pages. It should cross-link to:
- the storage translation page from PR Guide: add OpenHCL VMBus relay and storage architecture pages #2932 (for the OpenHCL settings model)
- the resolver docs (for how backends are wired up)
- the NVMe and StorVSP device pages (once those exist)
Possible locations:
Guide/src/reference/architecture/openvmm/storage.md— for the shared storage pipeline- Or a section under
Guide/src/reference/devices/if we want it closer to device docs
I lean toward the architecture section since this is about the internal pipeline, not about a single device.
What should be rustdoc vs Guide
| Content | Location |
|---|---|
DiskIo trait semantics, method contracts, scatter-gather model |
Rustdoc on disk_backend |
wait_resize method contract and default behavior |
Rustdoc on disk_backend |
LayerIo trait, bitmap semantics, layer configuration |
Rustdoc on disk_layered |
| Per-backend implementation notes (e.g., VHD1 format details, io_uring usage) | Rustdoc on each backend crate |
| Architecture overview: how frontends, the SCSI adapter, and backends connect | Guide page |
| Data flow diagrams (guest I/O → frontend → SCSI → DiskIo → backend) | Guide page |
| Layered disk model explanation with examples | Guide page |
| RAM disk vs memdiff CLI semantics | Guide page |
| Online disk resize: which backends support it, how frontends notify the guest | Guide page |
Virtual optical / DVD: GuestMedia model, eject, frontend support matrix |
Guide page |
| Backend catalog (what exists, when to use each) | Guide page |
| Wrapping/decorator pattern explanation | Guide page |
| Resolver integration for storage | Guide page |
Goals
- A contributor can understand how a guest I/O request flows from the frontend to the disk backend
- A contributor adding a new disk backend knows what to implement and how it gets wired up
- The layered disk model is explained clearly enough that someone can reason about layer composition without reading the implementation
- The backend catalog gives a quick reference for what exists and when each is appropriate
Non-goals
- Documenting the VTL2 storage translation settings model (that is PR Guide: add OpenHCL VMBus relay and storage architecture pages #2932)
- Documenting every SCSI CDB opcode that SimpleScsiDisk handles
- Redesigning the storage stack
- Performance tuning guidance (stack futures, buffer reuse, etc. are implementation details for now)
Rough implementation plan
- Write a Guide page covering the storage pipeline:
DiskIotrait, frontends, the SCSI adapter, backends, wrappers, layered disks - Include data flow diagrams (mermaid) for the NVMe path and the StorVSP/SCSI path
- Add a backend catalog table
- Add a section on the layered disk model with a worked example
- Expand rustdoc on
disk_backendanddisk_layeredcrate-level docs - Cross-link to the storage translation page from PR Guide: add OpenHCL VMBus relay and storage architecture pages #2932 and to the resolver docs