Skip to content

Document the storage stack: disk backends and the frontend-to-backend pipeline #2939

@mattkur

Description

@mattkur

Document the storage stack: disk backends and the frontend-to-backend pipeline

Summary

PR #2932 adds documentation for the VTL2 storage translation and settings model — how OpenHCL maps backing devices onto guest-visible controllers. That covers the outer shell: what the guest sees, what OpenHCL is offered, and how the configuration surface connects them.

What it does not cover is the inside of that shell: the disk backend abstraction, the concrete disk backends, the layered disk model, and the path from a storage frontend (NVMe, StorVSP, IDE) through the SCSI adapter down to a disk backend. That is the scope of this issue.

This should be written so it is useful for both OpenVMM and OpenHCL contexts, since the same DiskIo trait, the same backends, and the same frontend implementations are shared.


What should be documented

The DiskIo trait and the Disk wrapper

The central abstraction is the DiskIo trait in vm/devices/storage/disk_backend/src/lib.rs. Every disk backend implements it. The key operations are:

  • read_vectored / write_vectored (async, scatter-gather)
  • sync_cache (flush)
  • unmap (TRIM / deallocate)
  • capacity and sector size queries
  • optional persistent reservation support

The Disk struct wraps Arc<dyn DynDisk> for cheap concurrent cloning. This is what frontends hold.

The doc should explain the trait, the wrapper, and the design choices (async, scatter-gather, FUA, sector-aligned I/O).

Storage frontends

How each frontend consumes a Disk:

Frontend Protocol Transport Crate
NVMe NVMe 2.0 PCI MMIO + MSI-X nvme/
StorVSP SCSI CDB over VMBus VMBus ring buffers storvsp/
IDE ATA/ATAPI PCI/ISA I/O ports + DMA ide/

The doc should cover the data flow from guest I/O to DiskIo method call. The SCSI path is the most interesting because it goes through two layers:

  1. StorVSP: dequeues SCSI requests from the VMBus ring
  2. SimpleScsiDisk (scsidisk/): parses CDB opcodes and translates them to DiskIo calls

NVMe is simpler: the NVMe controller's namespace directly holds a Disk and calls into it.

Concrete disk backends

All the backends that implement DiskIo:

Backend What it wraps Crate
FileDisk Host file (read-write-at) disk_file/
Vhd1Disk VHD1 fixed format disk_vhd1/
VhdmpDisk Windows VHD via vhdmp driver disk_vhdmp/
BlobDisk Remote HTTP/Azure blob (read-only) disk_blob/
BlockDeviceDisk Linux block device (io_uring) disk_blockdevice/
NvmeDisk Physical NVMe via VFIO disk_nvme/
StripedDisk Data striped across multiple backends disk_striped/

Wrapping backends (decorators)

Backends that wrap another Disk and transform I/O:

Wrapper Transform Crate
CryptDisk XTS-AES256 encryption disk_crypt/
DelayDisk Injected I/O latency disk_delay/
DiskWithReservations In-memory persistent reservations disk_prwrap/

The wrapping pattern is important to document because it is how features compose without modifying backends.

The layered disk model

disk_layered/ is its own subsystem. A layered disk stacks multiple layers with read-through and optional write-through semantics:

  • Each layer implements LayerIo (similar to DiskIo but tracks sector presence via bitmap)
  • Reads check layers top-to-bottom; the first layer with the requested sectors wins
  • Writes go to the topmost writable layer
  • Layers can optionally cache read-miss data from below (read_cache)
  • Layers can optionally write through to the next layer (write_through)

The two concrete layer implementations today are:

  • RamDiskLayer (disklayer_ram/) — ephemeral, fast
  • SqliteDiskLayer (disklayer_sqlite/) — persistent, portable

This is what powers the memdiff: disk configuration in the CLI. It deserves its own section because the bitmap-based presence tracking and the layer configuration model are not obvious from reading a single file.

Resolver integration

The doc should show how the resolver pattern connects configuration to concrete backends. The storage resolver chain is the best example of recursive resolution in the codebase:

  • NVMe controller → resolves each namespace's disk
  • Layered disk → resolves each layer in parallel
  • Each layer or backend → resolves to a concrete DiskIo implementation

This ties back to the resolver documentation (separate issue) but deserves a storage-specific section showing the concrete resolution flow.

Online disk resize

Online disk resize is an interesting cross-cutting concern because the behavior differs by frontend, backend, and OpenHCL vs standalone context.

Frontend notification mechanisms:

Frontend Resize notification How it works
NVMe AEN (Async Event Notification) Background task calls disk.wait_resize() per namespace; on change, completes a queued AER command with CHANGED_NAMESPACE_LIST
StorVSP/SCSI UNIT_ATTENTION sense key On the next SCSI command after a resize, SimpleScsiDisk detects the capacity change and returns UNIT_ATTENTION; guest retries and re-reads capacity
IDE Not supported IDE has no standardized capacity-change notification

Backend wait_resize support:

The DiskIo trait has a wait_resize method that defaults to pending() (never completes). Only backends that can detect runtime capacity changes override it:

Backend wait_resize How
disk_blockdevice ✅ Event-driven Linux uevent listener for block device resize events
disk_nvme ✅ Event-driven NVMe driver monitors AENs from the physical controller; rescans namespace to detect capacity changes
disk_file ❌ Default (pending) No file-change monitoring
disk_vhd1 ❌ Default (pending) Fixed-size format
disk_blob ❌ Default (pending) Remote blob, no resize
disk_layered ✅ Delegates Delegates to bottom-most layer
Wrappers (crypt, delay, prwrap) ✅ Delegates Forward to inner disk

OpenHCL vs standalone:

In OpenHCL, the resize path is the same as standalone: disk_blockdevice detects the uevent from the host, wait_resize completes, and the NVMe or SCSI frontend notifies the VTL0 guest through the standard mechanism. There is no special paravisor-level resize interception.

The doc should explain this end-to-end flow and make clear which backends actually support it, since a contributor attaching a disk_file backend and expecting runtime resize will be confused when nothing happens.

RAM disk (mem:<len>)

The CLI supports standalone RAM disks via mem:<len> (e.g., --disk mem:1G). This is distinct from memdiff:<disk>, which stacks a RAM layer on top of a backing disk.

Under the hood, mem:<len> creates a RamDiskLayerHandle { len: Some(len) } wrapped in a single-layer LayeredDiskHandle. So even a "standalone" RAM disk is actually the layered disk machinery with one layer.

memdiff:<disk> creates a RamDiskLayerHandle { len: None } (sized from the backing disk) stacked on top of the inner disk. Writes go to the RAM layer; reads fall through to the backing disk for sectors not yet written.

The doc should explain this because the CLI surface (mem: vs memdiff:) hides the underlying layered disk model, and contributors reading the code will see RamDiskLayerHandle in both cases and wonder what the difference is.

Virtual optical / DVD

The storage stack supports virtual DVD/CD-ROM drives, which have a different model from disk devices.

How it works:

  • SimpleScsiDvd (in scsidisk/src/scsidvd/) implements AsyncScsiDisk and handles optical-specific SCSI commands: GET_EVENT_STATUS_NOTIFICATION, GET_CONFIGURATION, START_STOP_UNIT (eject), media change events, and the standard read path.
  • The GuestMedia enum (in ide_resources/) distinguishes GuestMedia::Dvd from GuestMedia::Disk. DVD wraps a SimpleScsiDvdHandle which holds a Resource<ScsiDeviceHandleKind>, while Disk wraps a Resource<DiskHandleKind>.
  • Eject is supported: the DiskIo trait has an eject() method (defaults to UnsupportedEject), and SimpleScsiDvd handles the SCSI START_STOP_UNIT command with the load/eject flag. Once ejected, the media is permanently removed for the lifetime of the VM.

Frontend support:

Frontend DVD support
StorVSP/SCSI ✅ Via SimpleScsiDvd
IDE ✅ Via AtapiDrive wrapping SimpleScsiDvd through ATAPI
NVMe ❌ Explicitly rejected ("dvd not supported with nvme")

CLI surface:

DVD is specified with the dvd flag on --disk or --ide:

  • --disk file:my.iso,dvd → SCSI optical drive
  • --ide file:my.iso,dvd → IDE optical drive (ATAPI)

The dvd flag implicitly sets read_only = true.

The doc should cover the DVD model because it is a common source of confusion: the guest media enum, the SCSI-vs-ATAPI layering, why NVMe rejects DVD, and how eject works.


Where this belongs

This is architecture reference content. I think it belongs as a new page or set of pages under the architecture section, near the existing OpenVMM and OpenHCL architecture pages. It should cross-link to:

Possible locations:

  • Guide/src/reference/architecture/openvmm/storage.md — for the shared storage pipeline
  • Or a section under Guide/src/reference/devices/ if we want it closer to device docs

I lean toward the architecture section since this is about the internal pipeline, not about a single device.


What should be rustdoc vs Guide

Content Location
DiskIo trait semantics, method contracts, scatter-gather model Rustdoc on disk_backend
wait_resize method contract and default behavior Rustdoc on disk_backend
LayerIo trait, bitmap semantics, layer configuration Rustdoc on disk_layered
Per-backend implementation notes (e.g., VHD1 format details, io_uring usage) Rustdoc on each backend crate
Architecture overview: how frontends, the SCSI adapter, and backends connect Guide page
Data flow diagrams (guest I/O → frontend → SCSI → DiskIo → backend) Guide page
Layered disk model explanation with examples Guide page
RAM disk vs memdiff CLI semantics Guide page
Online disk resize: which backends support it, how frontends notify the guest Guide page
Virtual optical / DVD: GuestMedia model, eject, frontend support matrix Guide page
Backend catalog (what exists, when to use each) Guide page
Wrapping/decorator pattern explanation Guide page
Resolver integration for storage Guide page

Goals

  • A contributor can understand how a guest I/O request flows from the frontend to the disk backend
  • A contributor adding a new disk backend knows what to implement and how it gets wired up
  • The layered disk model is explained clearly enough that someone can reason about layer composition without reading the implementation
  • The backend catalog gives a quick reference for what exists and when each is appropriate

Non-goals


Rough implementation plan

  1. Write a Guide page covering the storage pipeline: DiskIo trait, frontends, the SCSI adapter, backends, wrappers, layered disks
  2. Include data flow diagrams (mermaid) for the NVMe path and the StorVSP/SCSI path
  3. Add a backend catalog table
  4. Add a section on the layered disk model with a worked example
  5. Expand rustdoc on disk_backend and disk_layered crate-level docs
  6. Cross-link to the storage translation page from PR Guide: add OpenHCL VMBus relay and storage architecture pages #2932 and to the resolver docs

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions