From 33130540ae64dc67357dc2ff752e56a4248be87d Mon Sep 17 00:00:00 2001 From: Chao Liu Date: Wed, 29 Apr 2026 15:10:11 +0800 Subject: [PATCH 1/2] docs: consolidate 10 docs per language into 3 + bilingual index Reorganize docs/ following triton-cpu/docs/riscv64 structure: - getting-started.md: build, launch, HIP test, GPU driver init - architecture.md: system design, memory sharing, GART, xGMI - reference.md: parameters, pitfalls, debugging, PTE layout Create docs/README.md bilingual index with Quick Links table. Update README.md and README.zh.md to reference new structure. Delete all 20 old doc files (10 per language). Signed-off-by: Chao Liu --- README.md | 15 +- README.zh.md | 15 +- docs/README.md | 14 + docs/en/architecture.md | 1003 ++++++++++++++++++++++++++ docs/en/build-disk-china-mirror.md | 26 - docs/en/cosim-debugging-pitfalls.md | 186 ----- docs/en/cosim-dev-story.md | 343 --------- docs/en/cosim-guest-gpu-init.md | 158 ---- docs/en/cosim-memory-architecture.md | 405 ----------- docs/en/cosim-technical-notes.md | 352 --------- docs/en/cosim-usage-guide.md | 579 --------------- docs/en/getting-started.md | 532 ++++++++++++++ docs/en/gpu-fs-guide.md | 323 --------- docs/en/mi300x-memory-management.md | 332 --------- docs/en/reference.md | 564 +++++++++++++++ docs/en/xgmi-model.md | 83 --- docs/zh/architecture.md | 1003 ++++++++++++++++++++++++++ docs/zh/build-disk-china-mirror.md | 25 - docs/zh/cosim-debugging-pitfalls.md | 186 ----- docs/zh/cosim-dev-story.md | 342 --------- docs/zh/cosim-guest-gpu-init.md | 158 ---- docs/zh/cosim-memory-architecture.md | 405 ----------- docs/zh/cosim-technical-notes.md | 352 --------- docs/zh/cosim-usage-guide.md | 579 --------------- docs/zh/getting-started.md | 532 ++++++++++++++ docs/zh/gpu-fs-guide.md | 321 --------- docs/zh/mi300x-memory-management.md | 332 --------- docs/zh/reference.md | 564 +++++++++++++++ docs/zh/xgmi-model.md | 77 -- 29 files changed, 4222 insertions(+), 5584 deletions(-) create mode 100644 docs/README.md create mode 100644 docs/en/architecture.md delete mode 100644 docs/en/build-disk-china-mirror.md delete mode 100644 docs/en/cosim-debugging-pitfalls.md delete mode 100644 docs/en/cosim-dev-story.md delete mode 100644 docs/en/cosim-guest-gpu-init.md delete mode 100644 docs/en/cosim-memory-architecture.md delete mode 100644 docs/en/cosim-technical-notes.md delete mode 100644 docs/en/cosim-usage-guide.md create mode 100644 docs/en/getting-started.md delete mode 100644 docs/en/gpu-fs-guide.md delete mode 100644 docs/en/mi300x-memory-management.md create mode 100644 docs/en/reference.md delete mode 100644 docs/en/xgmi-model.md create mode 100644 docs/zh/architecture.md delete mode 100644 docs/zh/build-disk-china-mirror.md delete mode 100644 docs/zh/cosim-debugging-pitfalls.md delete mode 100644 docs/zh/cosim-dev-story.md delete mode 100644 docs/zh/cosim-guest-gpu-init.md delete mode 100644 docs/zh/cosim-memory-architecture.md delete mode 100644 docs/zh/cosim-technical-notes.md delete mode 100644 docs/zh/cosim-usage-guide.md create mode 100644 docs/zh/getting-started.md delete mode 100644 docs/zh/gpu-fs-guide.md delete mode 100644 docs/zh/mi300x-memory-management.md create mode 100644 docs/zh/reference.md delete mode 100644 docs/zh/xgmi-model.md diff --git a/README.md b/README.md index c68fdd4..f483e91 100644 --- a/README.md +++ b/README.md @@ -191,16 +191,11 @@ to gem5's vfio-user server and maps all BARs through the standard vfio-user prot ## Documentation -Detailed technical documentation is available in [`docs/`](docs/): - -- [Complete Usage Guide](docs/en/cosim-usage-guide.md) — build, run, test -- [Technical Notes](docs/en/cosim-technical-notes.md) — architecture, pitfalls, fixes -- [MI300X Memory Management](docs/en/mi300x-memory-management.md) — GART, address translation -- [GPU FS Guide](docs/en/gpu-fs-guide.md) — gem5 standalone GPU full-system simulation -- [Guest GPU Init](docs/en/cosim-guest-gpu-init.md) — driver initialization flow -- [Memory Architecture](docs/en/cosim-memory-architecture.md) — shared memory, VRAM routing, DMA -- [Debugging Pitfalls](docs/en/cosim-debugging-pitfalls.md) — common issues and solutions -- [Development Story](docs/en/cosim-dev-story.md) — how this project was built in one day with Claude +Detailed technical documentation is available in [`docs/`](docs/README.md): + +- [Getting Started](docs/en/getting-started.md) — build, launch, and run your first HIP test +- [Architecture](docs/en/architecture.md) — system design, memory sharing, address translation +- [Reference](docs/en/reference.md) — parameters, troubleshooting, debugging commands ## License diff --git a/README.zh.md b/README.zh.md index 41fdc5d..f95bb34 100644 --- a/README.zh.md +++ b/README.zh.md @@ -161,16 +161,11 @@ cosim/ ## 技术文档 -详细技术文档位于 [`docs/`](docs/) 目录下: - -- [完整使用指南](docs/zh/cosim-usage-guide.md) — 从编译到运行 HIP 测试的全流程 -- [技术笔记](docs/zh/cosim-technical-notes.md) — 架构设计、踩坑记录、修复方案 -- [MI300X 内存管理](docs/zh/mi300x-memory-management.md) — GART、地址翻译、内存映射 -- [GPU 全系统仿真指南](docs/zh/gpu-fs-guide.md) — gem5 单机 GPU FS 仿真复现 -- [Guest GPU 初始化流程](docs/zh/cosim-guest-gpu-init.md) — 驱动加载与设备初始化 -- [内存架构](docs/zh/cosim-memory-architecture.md) — 共享内存、VRAM 路由、DMA -- [调试踩坑记录](docs/zh/cosim-debugging-pitfalls.md) — 常见问题与解决方案 -- [开发故事](docs/zh/cosim-dev-story.md) — 一天时间用 Claude 构建 cosim-gpu 的全过程 +详细技术文档位于 [`docs/`](docs/README.md) 目录下: + +- [快速入门](docs/zh/getting-started.md) — 从编译到运行首个 HIP 测试的全流程 +- [架构文档](docs/zh/architecture.md) — 系统设计、内存共享、地址翻译 +- [参考手册](docs/zh/reference.md) — 参数表、故障排查、调试命令 ## 版本矩阵 diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..39ce476 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,14 @@ +# QEMU + gem5 MI300X Co-simulation Documentation + +Documentation is available in two languages: + +- **中文文档 (Chinese)** — 中文版本 +- **English Documentation** — English version + +## Quick Links + +| Document | 中文 | English | +|----------|------|---------| +| Getting Started | [快速入门](zh/getting-started.md) | [Getting Started](en/getting-started.md) | +| Architecture | [架构文档](zh/architecture.md) | [Architecture](en/architecture.md) | +| Reference | [参考手册](zh/reference.md) | [Reference](en/reference.md) | diff --git a/docs/en/architecture.md b/docs/en/architecture.md new file mode 100644 index 0000000..3fe617b --- /dev/null +++ b/docs/en/architecture.md @@ -0,0 +1,1003 @@ +[中文](../zh/architecture.md) + +# Co-simulation Architecture + +This document provides a deep-dive into the architecture and design of the QEMU + gem5 MI300X co-simulation system. It covers the system-level structure, memory sharing mechanisms, GPU address translation, DMA data flows, interrupt forwarding, the xGMI interconnect model, and key design decisions made during development. + +--- + +## Table of Contents + +- [System Architecture Overview](#system-architecture-overview) + - [Component Diagram](#component-diagram) + - [Key Components](#key-components) + - [Communication Channels](#communication-channels) +- [vfio-user and Legacy Backends](#vfio-user-and-legacy-backends) + - [vfio-user Backend (Default)](#vfio-user-backend-default) + - [Legacy Socket Backend](#legacy-socket-backend) + - [Backend Comparison](#backend-comparison) +- [PCI BAR Layout](#pci-bar-layout) +- [Memory Sharing Architecture](#memory-sharing-architecture) + - [Three Sharing Channels](#three-sharing-channels) + - [VRAM Sharing (BAR0)](#vram-sharing-bar0) + - [Guest RAM Sharing (GTT Pages)](#guest-ram-sharing-gtt-pages) + - [Memory Split (Q35)](#memory-split-q35) + - [Sink Mechanism](#sink-mechanism) +- [GPU Address Translation and GART](#gpu-address-translation-and-gart) + - [GPU Address Spaces and Apertures](#gpu-address-spaces-and-apertures) + - [Aperture Registers](#aperture-registers) + - [GART Structure and Table Layout](#gart-structure-and-table-layout) + - [PTE Format](#pte-format) + - [getGARTAddr Transform](#getgartaddr-transform) + - [Translation Flow](#translation-flow) + - [gartTable Hash Map vs. Shared VRAM](#garttable-hash-map-vs-shared-vram) + - [Address Classification After Translation](#address-classification-after-translation) + - [MMHUB Aperture](#mmhub-aperture) + - [User-Space Translation (VMID > 0)](#user-space-translation-vmid-0) +- [DMA Data Flow](#dma-data-flow) + - [PM4 Packet Processor Routing](#pm4-packet-processor-routing) + - [SDMA Engine Routing](#sdma-engine-routing) + - [VRAM vs. System Memory Detection](#vram-vs-system-memory-detection) + - [vfio-user Backend: Shared Memory Direct Access](#vfio-user-backend-shared-memory-direct-access) + - [Legacy Backend: Socket DMA Protocol](#legacy-backend-socket-dma-protocol) + - [Interrupt Handler (IH) DMA](#interrupt-handler-ih-dma) + - [Complete Data Flow Example](#complete-data-flow-example) +- [MSI-X Interrupt Forwarding](#msi-x-interrupt-forwarding) + - [Interrupt Delivery Path](#interrupt-delivery-path) + - [IH Ring Buffer Interaction](#ih-ring-buffer-interaction) +- [xGMI Interconnect Model](#xgmi-interconnect-model) + - [Packet Format](#packet-format) + - [Address Mapping](#address-mapping) + - [Topology Configuration](#topology-configuration) + - [Link Parameters](#link-parameters) + - [Flow Control](#flow-control) + - [Architecture Phases](#architecture-phases) +- [Design History and Key Decisions](#design-history-and-key-decisions) + - [Why vfio-user Over a Custom Protocol](#why-vfio-user-over-a-custom-protocol) + - [Why Q35 + KVM](#why-q35-kvm) + - [Shared Memory Design](#shared-memory-design) + - [SIGIO Edge-Triggered Drain](#sigio-edge-triggered-drain) + - [GART Fallback Approach](#gart-fallback-approach) + - [VRAM Routing Discovery](#vram-routing-discovery) + +--- + +## System Architecture Overview + +The co-simulation system splits GPU workload execution across two processes: QEMU (with KVM) handles the host CPU, guest OS, and amdgpu driver at near-native speed, while gem5 models the MI300X GPU device -- shader arrays, command processors, SDMA engines, and the Ruby cache hierarchy -- with cycle-level accuracy. The two processes communicate via a Unix domain socket and share memory through POSIX shared memory files for zero-copy DMA. + +### Component Diagram + +``` ++--------------------------------------+ +| QEMU (Q35 + KVM) | +| +--------------------------------+ | +| | Guest Linux (Ubuntu 24) | | +| | amdgpu driver (ROCm 7) | | +| | ROCm userspace | | +| +--------------+-----------------+ | +| | MMIO / Doorbell | +| +--------------v-----------------+ | +| | vfio-user-pci | | +| | (QEMU built-in device) | | +| +--------------+-----------------+ | +| | vfio-user protocol | ++-----------------+--------------------+ + | /tmp/gem5-mi300x.sock + | (Unix socket) ++-----------------+--------------------+ +| gem5 | | +| +--------------v-----------------+ | +| | MI300XVfioUser | | +| | (mi300x_vfio_user.cc) | | +| | [libvfio-user server] | | +| +--------------+-----------------+ | +| | AMDGPUDevice API | +| +--------------v-----------------+ | +| | AMDGPUDevice | | +| | PM4PacketProcessor | | +| | SDMAEngine | | +| | Shader / CU array | | +| +--------------------------------+ | ++--------------------------------------+ + +Shared Memory: + /dev/shm/cosim-guest-ram Guest physical RAM (QEMU <-> gem5 DMA) + /dev/shm/mi300x-vram GPU VRAM (QEMU BAR0 <-> gem5 device memory) +``` + +gem5 runs inside a Docker container with a `StubWorkload` (no Linux kernel of its own). It starts as a vfio-user server, listens on the Unix socket, and waits for MMIO requests from QEMU. + +### Key Components + +| Component | Location | Purpose | +|---|---|---| +| `MI300XVfioUser` | `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5 vfio-user server; handles BAR access and interrupts via libvfio-user (default backend) | +| `vfio-user-pci` | QEMU built-in device | QEMU-side vfio-user client; no custom QEMU code needed | +| `CosimBridge` | `src/dev/amdgpu/cosim_bridge.hh` | Abstract co-simulation bridge interface, implemented by both backends | +| `MI300XGem5Cosim` | `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | Legacy socket bridge SimObject | +| `mi300x_gem5.c` | `qemu/hw/misc/` | Legacy QEMU PCI device; forwards MMIO/doorbell via custom socket protocol | +| `mi300_cosim.py` | `configs/example/gpufs/` | gem5 config; selects backend via `--cosim-backend=vfio-user|legacy` | +| `cosim_launch.sh` | `scripts/` | Orchestrates Docker (gem5) + QEMU launch sequence | + +### Communication Channels + +The system uses three distinct channels between QEMU and gem5: + +1. **VRAM shared memory** (`/dev/shm/mi300x-vram`, 16 GiB) -- GPU VRAM including GART page tables. Both sides mmap the same file for zero-copy access. +2. **Guest RAM shared memory** (`/dev/shm/cosim-guest-ram`, 8 GiB) -- Host physical memory containing ring buffers, fences, GTT pages. QEMU uses `memory-backend-file` with `share=on`; gem5 uses `shared_backstore`. +3. **vfio-user socket** (`/tmp/gem5-mi300x.sock`) -- Carries MMIO reads/writes, config space access, doorbell writes, and interrupt notifications via the vfio-user protocol. + +--- + +## vfio-user and Legacy Backends + +The co-simulation system supports two communication backends, selectable via `--cosim-backend=vfio-user|legacy` in the gem5 configuration. + +### vfio-user Backend (Default) + +The vfio-user backend uses the industry-standard vfio-user protocol (QEMU 10.0+ built-in support). On the gem5 side, Nutanix's libvfio-user library acts as the server. + +- **QEMU side**: Uses the built-in `vfio-user-pci` device. No custom QEMU code is required; any stock QEMU 10.0+ build works. +- **gem5 side**: `MI300XVfioUser` registers BAR regions, configuration space, and MSI-X capabilities with libvfio-user, then serves requests from QEMU. +- **DMA**: gem5 accesses Guest RAM directly through the Ruby memory system's shared backstore, with no socket round-trips. +- **Interrupts**: Delivered via `irq_fd` (eventfd injected into KVM), eliminating custom interrupt messages. + +### Legacy Socket Backend + +The legacy backend uses a custom `mi300x-gem5` QEMU PCI device and a custom binary protocol over two Unix socket connections: + +- **Synchronous connection**: MMIO request-response pairs (QEMU sends write/read, gem5 responds). +- **Asynchronous connection**: gem5 sends IRQ raise/lower events and DMA read/write requests to QEMU. + +This backend requires a QEMU build from the `cosim/qemu/` directory. + +### Backend Comparison + +| Dimension | vfio-user Backend | Legacy Socket Backend | +|-----------|-------------------|----------------------| +| Guest RAM DMA | Ruby memory system direct access to shared backstore | Socket request-response protocol | +| VRAM access | mmap zero-copy | mmap zero-copy | +| Interrupts | irq_fd (eventfd -> KVM) | Custom socket messages | +| MMIO | vfio-user message passing | Custom binary protocol | +| QEMU-side device | Built-in `vfio-user-pci` | Custom `mi300x_gem5.c` | +| Address translation | gem5-internal GART translation | QEMU-side `pci_dma_read/write` | +| QEMU version | Stock QEMU 10.0+ | Custom fork required | + +--- + +## PCI BAR Layout + +The PCI BAR layout must match the expectations hardcoded in the amdgpu driver (`AMDGPU_VRAM_BAR=0`, `AMDGPU_DOORBELL_BAR=2`, `AMDGPU_MMIO_BAR=5`). + +``` +BAR0+1 VRAM 64-bit prefetchable 16 GiB (shared memory) +BAR2+3 Doorbell 64-bit 4 MiB +BAR4 MSI-X exclusive 256 vectors +BAR5 MMIO regs 32-bit 512 KiB (forwarded to gem5) +``` + +| BAR | Content | Size | Communication Method | +|-----|---------|------|---------------------| +| BAR0+1 | VRAM | 16 GiB | Shared memory (zero-copy mmap) | +| BAR2+3 | Doorbell | 4 MiB | Socket forwarding (vfio-user or legacy) | +| BAR4 | MSI-X | 256 vectors | QEMU local | +| BAR5 | MMIO registers | 512 KiB | Socket forwarding (vfio-user or legacy) | + +BAR0+1 and BAR2+3 are 64-bit BARs (16 GiB VRAM cannot fit in the 32-bit address space). During PCI BAR size probing, the upper half of each 64-bit BAR must return the high 32 bits of the size mask. + +The PCI class code is set to `PCI_CLASS_DISPLAY_VGA (0x0300)` rather than `PCI_CLASS_DISPLAY_OTHER (0x0380)`, so the kernel detects the device as a "video device with shadowed ROM" and enables VGA ROM lookup at `0xC0000`. + +--- + +## Memory Sharing Architecture + +In co-simulation, the GPU device model (gem5) and the host system (QEMU/KVM) run as separate processes. The GPU needs access to two types of memory: + +- **VRAM** (local video memory): GPU-private storage for textures, buffers, GART page tables, and device-local allocations. +- **GTT** (Graphics Translation Table / System Memory): Host physical memory regions mapped by the GPU, used for ring buffers, fences, IH cookies, and DMA buffers. + +Both types are shared via POSIX shared memory files, enabling bidirectional visibility without socket communication. + +### Three Sharing Channels + +``` ++----------------------------+ +-----------------------------+ +| QEMU (Q35 + KVM) | | gem5 (Docker) | +| | | | +| Guest Linux | | MI300X GPU Model | +| amdgpu driver | | Shader / CU / SDMA | +| | | PM4 / IH / Ruby caches | +| | | | +| +--------+ +---------+ | vfio-user (Unix) | +------------+ +--------+ | +| | BAR0 | | BAR5 |<---(MMIO/CFG/Doorbell)--->|MI300XVfio | |GPU core| | +| | (VRAM) | | (MMIO) | | | |User bridge | | | | +| +---+----+ +---------+ | | +-----+------+ +--------+ | +| | | | | | ++------+---------------------+ +--------+--------------------+ + | | + v v + /dev/shm/mi300x-vram (16 GiB) mmap same file + (VRAM: GPU data + GART page tables) (vramShmemPtr) + | | + v v + /dev/shm/cosim-guest-ram (8 GiB) mmap same file + (Guest RAM: ring buffers, fences, (system->getPhysMem()) + GTT pages, kernel/user data) +``` + +| Channel | File/Socket | Size | Purpose | Access Method | +|---------|-------------|------|---------|---------------| +| VRAM Shared Memory | `/dev/shm/mi300x-vram` | 16 GiB | GPU VRAM + GART page tables | mmap (zero-copy) | +| Guest RAM Shared Memory | `/dev/shm/cosim-guest-ram` | 8 GiB | Host physical memory (GTT pages) | QEMU: mmap; gem5: Ruby memory system direct access to shared backstore | +| vfio-user Socket | `/tmp/gem5-mi300x.sock` | -- | MMIO/config space/doorbell; interrupts via irq_fd (eventfd -> KVM) | vfio-user protocol | + +### VRAM Sharing (BAR0) + +#### Initialization + +On the gem5 side (`mi300x_vfio_user.cc:setupVramShm`): + +```cpp +shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666); +ftruncate(shmemFd, vramSize); +shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0); + +// Pass the shared pointer to the GART translator +gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr; +gpuDevice->getVM().vramShmemSize = vramSize; +``` + +QEMU obtains the BAR0 mapping through the vfio-user DMA region mapping mechanism -- it no longer directly opens the VRAM shared memory file, but instead receives the mapping through the vfio-user protocol. + +#### VRAM Content Layout + +``` +Offset 0x000000000 +------------------------------+ + | GPU Data Area | + | - hipMalloc allocations | + | - Kernel args, textures | + | - Driver internal allocs | + | | + | ... | + | | +Offset ~0x3EE600000 +------------------------------+ +(ptBase) | GART Page Table (PTEs) | + | 8 bytes per PTE | + | Maps GPU VA -> phys addr | +Offset 0x400000000 +------------------------------+ +(16 GiB) +``` + +#### Access Patterns + +| Scenario | Writer | Reader | Path | +|----------|--------|--------|------| +| GPU buffer allocation | Driver (via BAR0 write) | gem5 (via vramShmemPtr) | Shared memory direct access | +| GART PTE writes | Driver (via BAR0 write) | gem5 GART translator | memcpy from vramShmemPtr | +| IP Discovery table | gem5 initialization | Driver (via BAR0 read) | Shared memory direct access | + +Since QEMU's BAR0 and gem5's `vramShmemPtr` both mmap the same `/dev/shm` file, data written by the driver to BAR0 is immediately visible to gem5 with no socket communication required. + +### Guest RAM Sharing (GTT Pages) + +In AMD GPUs, GTT = GART = Graphics Address Remapping Table. It is a single-level page table (VMID 0) that maps GPU virtual addresses to host physical addresses. The host physical memory pages being mapped are "GTT pages." + +Typical GTT page contents: + +| Data Structure | Description | Access Direction | +|---------------|-------------|-----------------| +| PM4 Ring Buffer | GFX command queue | Driver writes -> GPU reads | +| SDMA Ring Buffer | DMA command queue | Driver writes -> GPU reads | +| IH Ring Buffer | Interrupt handler queue | GPU writes -> Driver reads | +| Fence values | Completion signals | GPU writes -> Driver reads | +| MQD (Map Queue Descriptor) | Queue descriptors | Driver writes -> GPU reads | +| User DMA buffers | hipMemcpy src/dst | Bidirectional | + +#### Initialization + +QEMU side (command-line): + +```bash +-object memory-backend-file,id=mem0,size=8G,\ + mem-path=/dev/shm/cosim-guest-ram,share=on +-numa node,memdev=mem0 +``` + +`share=on` ensures `MAP_SHARED`, making QEMU's modifications visible to other processes. + +gem5 side (`mi300_cosim.py`): + +```python +system.shared_backstore = args.shmem_host_path # "/cosim-guest-ram" +system.auto_unlink_shared_backstore = True +system.memories[0].shared_backstore = args.shmem_host_path +``` + +gem5's `PhysicalMemory` uses the same POSIX shared memory file as its backing store. + +#### Why GTT Needs No Extra Sharing Mechanism + +GTT pages reside in Guest RAM, which is already shared via `/dev/shm/cosim-guest-ram`: + +1. **Driver writes to ring buffer** -> writes to Guest RAM -> shared memory -> gem5 can read +2. **gem5 writes fence** -> Ruby memory controller writes to shared backstore -> driver can read +3. **Physical addresses in GART PTEs** -> offsets within Guest RAM -> accessible by both sides + +### Memory Split (Q35) + +QEMU Q35 splits memory into two regions when RAM >= 2.75 GiB: + +- **Below-4G region**: first 2 GiB (file offset 0) +- **Above-4G region**: the remainder at file offset 2 GiB, mapped to guest physical address 0x100000000+ + +gem5's `mi300_cosim.py` replicates this split to ensure both sides maintain consistent file offsets: + +```python +total_mem = convert.toMemorySize(args.mem_size) +lowmem_limit = 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000 +below_4g = min(total_mem, lowmem_limit) +above_4g = total_mem - below_4g +``` + +If the two sides disagree on where above-4G memory sits in the file, gem5 reads stale or zeroed data (e.g., GART PTEs reading as all zeros, causing infinite NOP loops in the PM4 command processor). + +### Sink Mechanism + +In co-simulation mode, some GART PTEs may be zero (uninitialized) or point to VRAM-internal addresses. If gem5 cannot translate these addresses, the original behavior was to throw a `GenericPageTableFault`, causing a DMA retry loop that hangs the simulation. + +The sink mechanism prevents this: + +```cpp +// amdgpu_vm.cc: GARTTranslationGen::translate() + +if (pte == 0) { + if (origAddr < vramShmemSize && vramShmemPtr) { + // VRAM address -> map to sink (paddr=0) + range.paddr = 0; + warn_once("GART: VRAM address mapped to sink -- " + "VRAM write-backs are no-ops in cosim"); + } else if (vramShmemPtr) { + // Unmapped GART page -> sink + range.paddr = 0; + warn_once("GART cosim: unmapped page -> sink"); + } +} +``` + +Sink semantics: + +- `paddr=0` is always a valid physical address in gem5 (system RAM base) +- DMA reads return zeros +- DMA writes are silently discarded +- Prevents the fault -> retry deadloop + +This behavior is safe: diagnostics confirmed that the first GART page (ptStart itself) is normally unmapped, while subsequent PTEs contain valid entries. The sink ensures the simulation stays alive even when the GPU attempts DMA to pages the driver has not yet mapped. + +--- + +## GPU Address Translation and GART + +The MI300X (GFX 9.4.3) uses multiple address spaces and apertures to access memory. Each memory access issued by the GPU is first classified by aperture, then translated into a physical address. + +### GPU Address Spaces and Apertures + +``` +GPU Virtual Address (48-bit) +| ++-- AGP aperture [agpBot, agpTop] +| +-- Direct offset: paddr = vaddr - agpBot + agpBase +| ++-- GART aperture [ptStart<<12, ptEnd<<12] +| +-- Page table: paddr = GART_PTE[page_num].phys_addr | offset +| ++-- Framebuffer (FB) [fbBase, fbTop] +| +-- VRAM offset: vram_off = vaddr - fbBase +| ++-- System aperture [sysAddrL, sysAddrH] +| +-- Direct map: paddr = vaddr (system memory) +| ++-- MMHUB aperture [mmhubBase, mmhubTop] +| +-- VRAM mirror: vram_off = vaddr - mmhubBase +| ++-- User VM (VMID>0) [arbitrary VAs] + +-- Multi-level page table walk (4 or 5 levels) +``` + +### Aperture Registers + +These MMIO registers define the boundaries of each aperture. The values are programmed by the amdgpu driver during GMC (Graphics Memory Controller) initialization. + +| Register | gem5 Field | Format | Description | +|----------|-----------|--------|-------------| +| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | Start address of VRAM in MC address space | +| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 | 0xFFFFFF` | End address of VRAM | +| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB relocation offset | +| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP remap base address | +| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture bottom | +| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 | 0xFFFFFF` | AGP aperture top | +| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture low address | +| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture high address | +| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | Location of GART table in VRAM | +| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture start address (page number) | +| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture end address (page number) | + +Typical values in co-simulation (from driver initialization diagnostics): + +``` +ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB +ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 +ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) +fbBase = 0x8000000000 VRAM starts at MC address 512 GiB +fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) +sysAddrL = 0x0 System aperture start +sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) +``` + +### GART Structure and Table Layout + +GART is a single-level page table used by VMID 0 (kernel mode) to map GPU virtual addresses to system physical addresses. It enables the GPU to perform DMA access to host (guest) RAM for ring buffers, fence values, IH cookies, and other kernel-mode data structures. + +The GART table resides in VRAM at offset `ptBase`: + +``` +VRAM offset = ptBase (gartBase) ++-------------------+ ptBase + 0 +| PTE[0] (8 bytes) | maps page ptStart ++-------------------+ ptBase + 8 +| PTE[1] | maps page ptStart + 1 ++-------------------+ ptBase + 16 +| PTE[2] | maps page ptStart + 2 +| ... | ++-------------------+ +| PTE[N] | maps page ptStart + N ++-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 +``` + +### PTE Format + +Each PTE is 8 bytes: + +``` +63 52 51 48 47 12 11 6 5 2 1 0 ++-------+------+-----------------+------+----+---+---+ +| Flags | BlkF | Physical Page | Rsvd |Frag|Sys| V | +| | | (PA >> 12) | | | | | ++-------+------+-----------------+------+----+---+---+ +``` + +| Bit Range | Field | Description | +|-----------|-------|-------------| +| 0 | Valid | Entry is valid | +| 1 | System | 1 = system memory (Guest RAM), 0 = local VRAM | +| 5:2 | Fragment | Page fragment size | +| 47:12 | Physical Page | Physical address >> 12 | +| 51:48 | Block Fragment | Block fragment size | +| 63:52 | Flags | MTYPE, PRT, etc. | + +Physical address extraction: `paddr = (bits(PTE, 47, 12) << 12) | page_offset` + +### getGARTAddr Transform + +Before GART lookup, addresses are transformed via `getGARTAddr()`, which multiplies the page number by 8 (the size of a PTE), converting a GPU VA into a byte offset within the GART table: + +```cpp +// In pm4_packet_processor.cc and sdma_engine.cc: +Addr getGARTAddr(Addr addr) const { + if (!gpuDevice->getVM().inAGP(addr)) { + Addr low_bits = bits(addr, 11, 0); + addr = (((addr >> 12) << 3) << 12) | low_bits; + } + return addr; +} +``` + +### Translation Flow + +The complete GART translation sequence: + +``` +Original GPU VA (e.g., 0x7FFF00032000) + | + v getGARTAddr() +Transformed addr = ((VA>>12) * 8) << 12 | low_bits + = 0x3FFF80019_0000 (example) + | + v GARTTranslationGen::translate() +gart_addr = bits(transformed, 63, 12) = page_num * 8 + | + +-- Look up gartTable hash map (populated by writeFrame / SDMA shadow) + | + +-- Cosim fallback: read PTE from shared VRAM + | pte_offset = gart_addr - (ptStart * 8) + | pte = *(vramShmemPtr + ptBase + pte_offset) + | + v Extract physical address +paddr = (bits(PTE, 47, 12) << 12) | bits(VA, 11, 0) +``` + +The driver writes GART PTEs through the following path: + +``` +amdgpu driver (guest) + | + +- amdgpu_gart_map(): compute PTE value + | pte = (phys_addr >> 12) << 12 | flags + | + +- write to BAR0 + ptBase + (gpu_page * 8) + | | + | +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram + | +- data immediately appears in shared memory + | + +- TLB invalidate: write VM_INVALIDATE_ENG17 register + +- MMIO -> vfio-user -> gem5 -> invalidateTLBs() +``` + +### gartTable Hash Map vs. Shared VRAM + +In standalone gem5 mode, GART entries are maintained in a hash map (`AMDGPUVM::gartTable`), populated by: + +1. **Direct writes** (`amdgpu_device.cc:writeFrame()`): When the driver writes to the GART region of VRAM via BAR0, the values are stored in `gartTable[offset]`. +2. **SDMA shadow copies** (`sdma_engine.cc`): When SDMA writes to the GART range in device memory, the shadow copy updates `gartTable`. + +In co-simulation mode, the driver writes GART PTEs through QEMU's BAR0 mapping, going directly into shared VRAM without passing through gem5's `writeFrame()`. Therefore, `gartTable` is essentially empty. The co-simulation fallback reads PTEs directly from shared VRAM at `vramShmemPtr + ptBase`: + +```cpp +Addr pte_table_offset = gart_addr - (ptStart * 8); +Addr pte_vram_offset = gartBase() + pte_table_offset; +memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); +``` + +If a PTE is 0 (unmapped page), co-simulation mode maps to a sink (`paddr=0`) instead of faulting (see [Sink Mechanism](#sink-mechanism)). + +### Address Classification After Translation + +After GART translation yields a physical address, gem5 determines where it points: + +``` +Physical address paddr + | + +- Within fbBase ~ fbTop range? + | +- YES -> VRAM address + | +- Access directly via vramShmemPtr (zero-copy) + | + +- Within sysAddrL ~ sysAddrH range? + | +- YES -> Guest RAM address (GTT page) + | +- Access via Ruby memory system (shared memory direct access) + | + +- Neither? + +- Sink (paddr=0, safely discarded) +``` + +### MMHUB Aperture + +MMHUB (Memory Management Hub) provides a shadow mapping of VRAM. Addresses within the `[mmhubBase, mmhubTop]` range are translated by subtracting the base address: + +``` +vram_offset = vaddr - mmhubBase +``` + +SDMA uses this aperture to access device memory in VMID 0 mode. + +### User-Space Translation (VMID > 0) + +User-space GPU programs (such as HIP applications) use multi-level page tables similar to x86-64 paging. Each VMID (1-15) has its own page table base register. + +``` +VM_CONTEXT[N]_PAGE_TABLE_BASE_ADDR -> Page Directory Base + | + v 4-level walk (PDE3 -> PDE2 -> PDE1 -> PDE0 -> PTE) +Physical address +``` + +The `UserTranslationGen` class performs this walk using the GPU's page table walker (`VegaISA::Walker`). SDMA in user mode (vmid > 0) uses this path. + +VMID 0 (kernel mode) GART page tables are fully visible via shared VRAM. VMID > 0 (user mode) multi-level page tables are walked by `VegaISA::Walker`, which uses gem5's internal TLB/page walker rather than reading directly from shared memory. The practical impact is limited: after the driver writes page tables, it sends TLB invalidate MMIOs, gem5 flushes its TLB, and subsequent walker traversals read from the correct physical addresses. + +--- + +## DMA Data Flow + +### PM4 Packet Processor Routing + +``` +PM4PacketProcessor::translate(vaddr, size) + | + +-- inAGP(vaddr)? -> AGPTranslationGen (direct offset) + | + +-- else -> GARTTranslationGen (page table lookup) +``` + +All PM4 DMA uses GART translation (VMID 0). Addresses are transformed via `getGARTAddr()` before the DMA call. + +### SDMA Engine Routing + +SDMA has more aperture awareness than PM4, as it handles both kernel-mode (VMID 0) and user-mode (VMID > 0) operations: + +``` +SDMAEngine::translate(vaddr, size) + | + +-- cur_vmid > 0? -> UserTranslationGen (multi-level page table) + | + +-- inAGP(vaddr)? -> AGPTranslationGen + | + +-- inMMHUB(vaddr)?-> MMHUBTranslationGen (VRAM shadow) + | + +-- else -> GARTTranslationGen +``` + +### VRAM vs. System Memory Detection + +For PM4's RELEASE_MEM and WRITE_DATA packets, the destination can be either VRAM or system memory. The routing logic: + +```cpp +bool vram = isVRAMAddress(pkt->addr); // addr < gpuDevice->getVRAMSize() +Addr addr = vram ? pkt->addr : getGARTAddr(pkt->addr); + +if (vram) + gpuDevice->getMemMgr()->writeRequest(addr, data, size); // device memory +else + dmaWriteVirt(addr, size, cb, data); // system memory via GART +``` + +Without this check, VRAM addresses fed through `getGARTAddr()` have their page numbers multiplied by 8, and GART translation fails because VRAM addresses have no corresponding page table entries. The three-layer defense (PM4 layer, SDMA layer, GART fallback sink) prevents this from crashing the simulation. + +### vfio-user Backend: Shared Memory Direct Access + +With the vfio-user backend, gem5 accesses Guest RAM directly through the Ruby memory system's shared backstore, with no socket-based DMA operations: + +``` +gem5 GPU model (PM4/SDMA/IH) + | + | Needs to read ring buffer commands / write fence values + | + v Ruby memory system request + | + +- Address translated by GART -> Guest physical address + | + +- Ruby memory controller accesses PhysicalMemory + | | + | +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED) + | +- read/write directly hits shared memory + | +- QEMU sees changes immediately (same mmap file) + | + +- Done (no socket round-trip needed) +``` + +Advantages: + +- **Zero-copy**: DMA reads and writes operate directly on shared memory with no serialization/deserialization +- **Low latency**: Eliminates the socket request-response round-trip overhead +- **Simplified architecture**: No custom DMA protocol needed; Ruby's memory system natively supports shared backstores + +### Legacy Backend: Socket DMA Protocol + +The legacy backend routes DMA through the socket using a custom binary protocol. + +**gem5 reads from Guest RAM** (ring buffers / fences): + +``` +gem5 GPU model (PM4/SDMA/IH) + | + v cosimBridge->sendDmaRead(guestPhysAddr, length) + | + +- Construct DmaRead message (32-byte header) + | { type=DmaRead, addr=guestPhysAddr, data=length } + | + +- sendAll(eventFd, &msg, 32) --> QEMU event thread + | | + | +- pci_dma_read(addr, buf, len) + | | (reads from /dev/shm/cosim-guest-ram) + | | + | +- sendAll(eventFd, &resp, 32) + | <------------------------------------------+- sendAll(eventFd, data, len) + | + +- memcpy(dest, recvBuf, length) // data arrives at gem5 +``` + +**gem5 writes to Guest RAM** (fences / IH cookies): + +``` +gem5 GPU model + | + v cosimBridge->sendDmaWrite(guestPhysAddr, length, data) + | + +- Construct DmaWrite message + data payload + | { type=DmaWrite, addr=guestPhysAddr, data=length, size=length } + | + +- sendAll(eventFd, &msg, 32) --> QEMU event thread + +- sendAll(eventFd, data, length) --> | + | +- pci_dma_write(addr, buf, len) + | | (writes to /dev/shm/cosim-guest-ram) + | + +- Done (DMA writes don't wait for response) +``` + +Maximum single DMA transfer in the legacy backend is 4 MiB (`COSIM_DMA_BUF_SIZE`). In practice, the driver typically submits page-sized transfers. + +### Interrupt Handler (IH) DMA + +The interrupt handler uses raw system physical addresses (not GART): + +``` +IH Ring Buffer: regs.baseAddr (from IH_RB_BASE register) +Wptr Address: regs.WptrAddr (from IH_RB_WPTR_ADDR registers) +``` + +These are GPAs (Guest Physical Addresses) programmed by the driver. The IH write flow: + +1. Write the interrupt cookie (32 bytes) to `baseAddr + IH_Wptr` +2. Write the updated write pointer to `WptrAddr` +3. Call `intrPost()` to send an MSI-X interrupt to the guest + +In co-simulation mode, DMA writes land in shared guest RAM (`/dev/shm/cosim-guest-ram`), and interrupts are forwarded to QEMU via the vfio-user irq_fd mechanism (or event socket in the legacy backend). + +### Complete Data Flow Example + +A HIP kernel dispatch illustrates the full memory interaction across both shared memory regions: + +``` +1. hipMalloc(&d_a, N*sizeof(int)) + Driver -> allocates buffer in VRAM + Writes GART PTEs to shared VRAM (BAR0) + +2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice) + Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer) + Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5 + gem5 -> reads ring buffer (Guest RAM via shared memory) + gem5 -> parses SDMA command -> GART translates source address -> Guest RAM + gem5 -> reads source data (Guest RAM via shared memory) + gem5 -> writes to VRAM destination (shared memory direct write) + +3. kernel<<<1, N>>>(d_a, d_b, d_c, N) + Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer) + Driver -> writes Doorbell -> gem5 + gem5 -> reads PM4 command (Guest RAM via shared memory) + gem5 -> launches shader execution + gem5 -> shader reads/writes VRAM (shared memory direct access) + gem5 -> writes fence on completion (Guest RAM via Ruby memory write) + gem5 -> sends MSI-X interrupt (irq_fd -> KVM) + +4. hipDeviceSynchronize() + Driver -> polls fence value (until Guest RAM value matches) + +- fence written by gem5 via Ruby memory write to shared backstore +``` + +A fence write (RELEASE_MEM) example showing address translation detail: + +``` +1. PM4 RELEASE_MEM packet: addr=0x113100000 (guest phys), data=0x1234 +2. isVRAMAddress(0x113100000)? No (< 16 GiB but not a VRAM offset) +3. getGARTAddr(0x113100000) -> 0x899800000000 (page * 8 transform) +4. dmaWriteVirt(0x899800000000, 8, cb, &data) +5. GARTTranslationGen::translate() + - gart_addr = 0x89980000 + - Look up PTE from shared VRAM -> PTE has paddr bits + - paddr = extracted address (in guest RAM) +6. DMA write lands in /dev/shm/cosim-guest-ram at paddr offset +7. Guest driver reads fence value from same shared memory +``` + +--- + +## MSI-X Interrupt Forwarding + +### Interrupt Delivery Path + +The GPU signals completion events (fence write-backs, IH ring entries) to the guest via MSI-X interrupts. The interrupt delivery chain differs between backends: + +**vfio-user backend**: + +``` +gem5 AMDGPUDevice::intrPost() + | + +-> cosimBridge->sendIrqRaise(0) + | + +-> MI300XVfioUser: vfu_irq_trigger(irq_fd) + | eventfd write -> KVM + | + +-> KVM injects MSI-X interrupt into guest + | + +-> Guest IH handler processes interrupt + reads IH ring buffer from Guest RAM +``` + +The vfio-user backend uses eventfd descriptors (`irq_fd`) registered with KVM. When gem5 triggers an interrupt, it writes to the eventfd, and KVM directly injects the interrupt into the guest -- no QEMU involvement in the hot path. + +**Legacy backend**: + +``` +gem5 AMDGPUDevice::intrPost() + | + +-> cosimBridge->sendIrqRaise(0) + | + +-> MI300XGem5Cosim: send IrqRaise message via event socket + | + +-> QEMU mi300x_gem5.c: event thread receives message + | msix_notify(pci_dev, vector) + | + +-> KVM injects MSI-X interrupt into guest + | + +-> Guest IH handler processes interrupt +``` + +The device supports 256 MSI-X vectors (BAR4). + +### IH Ring Buffer Interaction + +After the MSI-X interrupt arrives, the guest's IH (Interrupt Handler) reads the interrupt cookie from the IH ring buffer in Guest RAM: + +1. gem5 writes a 32-byte interrupt cookie to `IH_RB_BASE + IH_Wptr` in Guest RAM +2. gem5 updates the write pointer at `IH_RB_WPTR_ADDR` +3. gem5 calls `intrPost()` to deliver the MSI-X interrupt +4. Guest IH handler wakes up, reads the cookie from the ring buffer, and processes the event + +Both the ring buffer and the write pointer reside in shared Guest RAM, so the data is immediately visible to the guest once written by gem5's Ruby memory system. + +--- + +## xGMI Interconnect Model + +The xGMI (inter-chip Global Memory Interconnect) model provides GPU-to-GPU communication within a cosim-gpu multi-GPU hive. It attaches to each GPU's L2 cache (TCC) egress and routes remote VRAM accesses through a modeled xGMI link with configurable bandwidth, latency, and topology. + +### Packet Format + +| Field | Type | Description | +|-------|------|-------------| +| src_gpu | uint8 | Source GPU ID | +| dst_gpu | uint8 | Destination GPU ID | +| addr | uint64 | Target VRAM address | +| size | uint32 | Payload size in bytes | +| payload | bytes | Data (for write operations) | + +### Address Mapping + +Each GPU owns a contiguous VRAM address range: + +``` +GPU 0: [0, vram_size) +GPU 1: [vram_size, 2 * vram_size) +GPU N: [N * vram_size, (N+1) * vram_size) +``` + +The bridge determines whether an address is local or remote by checking which GPU's range it falls into. + +### Topology Configuration + +Launch-time parameter `--xgmi-topology`: + +- **mesh**: Every GPU has a direct link to every other GPU. An 8-GPU mesh creates 28 bidirectional links. +- **ring**: Each GPU connects to its two neighbors. Lower link count but multi-hop for non-adjacent GPUs. + +### Link Parameters + +| Parameter | Default | CLI Flag | +|-----------|---------|----------| +| Per-link bandwidth | 128 GB/s | `--xgmi-bandwidth` | +| Per-hop latency | 100 ns | `--xgmi-latency` | +| Lanes per link | 16 | (SimObject param) | +| Max links per GPU | 7 | (SimObject param) | +| Flow-control credits | 32 | (SimObject param) | + +### Flow Control + +Credit-based back-pressure prevents data loss: + +1. Each link starts with N credits (default 32). +2. Sending a packet consumes one credit. +3. The receiver returns a credit upon packet acceptance. +4. When credits reach zero, the sender stalls (never drops). + +### Architecture Phases + +**Path A (Self-built xGMI model)**: + +- Single-process multi-GPU: in-process function calls between GPU models +- Multi-process 8-GPU hive: IPC transport via shared memory ring buffers or Unix sockets + +**Path B (SST Merlin integration)**: + +- Replace xGMI transport with SST Merlin network engine +- Three-layer synchronization: QEMU (functional) <-> gem5 (GPU timing) <-> SST (network timing) +- Supports arbitrary topologies (fat-tree, dragonfly) + +### Key Source Files + +- `gem5/src/dev/amdgpu/XGMIBridge.py` -- SimObject definition +- `gem5/src/dev/amdgpu/xgmi_bridge.hh` -- C++ header +- `gem5/src/dev/amdgpu/xgmi_bridge.cc` -- C++ implementation +- `gem5/configs/example/gpufs/mi300_cosim.py` -- Configuration and wiring + +--- + +## Design History and Key Decisions + +This section documents the key architectural decisions and critical bug-fix insights that shaped the co-simulation system. + +### Why vfio-user Over a Custom Protocol + +The initial implementation used a custom binary protocol over two Unix socket connections (one synchronous for MMIO, one asynchronous for events). This worked but required maintaining a custom QEMU PCI device (`mi300x_gem5.c`) and a custom protocol definition. + +The migration to vfio-user was driven by three factors: + +1. **No custom QEMU code**: Any stock QEMU 10.0+ build can connect to gem5 directly via the built-in `vfio-user-pci` device, eliminating the need to maintain a QEMU fork. +2. **Protocol standardization**: BAR mapping, configuration space, interrupts, and DMA are all defined by the vfio-user specification, reducing the surface area for protocol bugs. +3. **Simpler deployment**: Users only need to build gem5 with libvfio-user support; QEMU is used as-is. + +Issues resolved during the vfio-user migration: + +- libvfio-user's BAR size field was `uint32_t`, unable to represent 16 GiB VRAM -- changed to `uint64_t`. +- The upper half of 64-bit BARs must return the high 32 bits of the size mask during PCI BAR size probing. +- PCIe Express and MSI-X capabilities must be registered before `vfu_realize_ctx()`. +- SDMA ring test timeout: `sdma_delay=1e9` caused ~500 ms wall-clock delay, exceeding the driver's ~200 ms timeout window -- reduced to 1000 and increased `KEEPALIVE_INTERVAL` to `1e9`. + +### Why Q35 + KVM + +The co-simulation uses QEMU's Q35 machine type with KVM acceleration: + +- **KVM**: Runs the guest CPU at near-native speed. A full Linux boot + driver loading completes in under a minute, compared to 10+ minutes under gem5's full-system mode. This dramatically reduces the debug cycle time. +- **Q35**: Provides a modern PCIe-capable chipset that supports 64-bit BARs (required for the 16 GiB VRAM BAR) and MSI-X interrupts. +- **StubWorkload on gem5**: gem5 runs no kernel of its own. It starts a minimal event loop and waits for MMIO requests from QEMU. This avoids dual-kernel complexity and focuses gem5 purely on GPU modeling. + +### Shared Memory Design + +The decision to use two separate POSIX shared memory files (`/dev/shm/cosim-guest-ram` and `/dev/shm/mi300x-vram`) rather than a single unified memory was driven by the fundamentally different nature of the two memory regions: + +- **Guest RAM** must be the backing store for QEMU's `memory-backend-file` (with `share=on`) and gem5's `PhysicalMemory` (via `shared_backstore`). The file layout must exactly replicate Q35's below-4G/above-4G memory split. +- **VRAM** is exposed to QEMU as BAR0 and to gem5 as device memory. It has its own internal layout (data area + GART page table) unrelated to guest physical address space. + +Combining them into one file would introduce complex offset arithmetic and coupling between two independent address spaces. + +### SIGIO Edge-Triggered Drain + +gem5's `PollQueue` uses `FASYNC`/`SIGIO` for socket monitoring, which is edge-triggered: the kernel sends one `SIGIO` when the socket buffer transitions from empty to non-empty, and only one. + +The amdgpu driver frequently writes an INDEX register (selecting which internal register to access) then immediately reads the DATA register (getting the value). These two messages arrive back-to-back in gem5's socket buffer, but only one SIGIO fires. If the message handler reads only one message per invocation, the second message sits in the buffer with no signal to wake gem5. QEMU blocks waiting for the read response. Result: deadlock after 15 messages. + +The fix: a `do/while` drain loop with `poll(fd, POLLIN, 0)` that consumes all pending messages on each SIGIO arrival: + +```cpp +do { + // read and process one message + ... + struct pollfd pfd = {fd, POLLIN, 0}; +} while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); +``` + +This issue only affects the legacy backend. The vfio-user backend uses libvfio-user's non-blocking poll mechanism. + +### GART Fallback Approach + +In standalone gem5 mode, GART entries are maintained in a hash map (`gartTable`), populated by `writeFrame()` and SDMA shadow copies. In co-simulation, the driver writes GART PTEs through QEMU's BAR0 mapping, going directly into shared VRAM without passing through gem5's `writeFrame()`. The hash map is empty. + +The co-simulation fallback reads PTEs directly from shared VRAM at `vramShmemPtr + ptBase`. When a PTE is zero (unmapped), the entry maps to a sink (`paddr=0`) instead of faulting. This prevents the `GenericPageTableFault` -> DMA retry deadloop that previously caused memory exhaustion and segfaults. + +Diagnostics confirmed that GART PTEs at `gartBase` (= `ptBase`) in shared VRAM were correctly populated by the driver. The first page (ptStart itself) is simply unmapped -- normal behavior -- while subsequent PTEs (offset 0x32E0+) contain valid entries. + +### VRAM Routing Discovery + +Address `0x1f72fa8000` triggered over 861,000 GART translation errors, memory exhaustion, and a segfault. The root cause: SDMA rptr writeback addresses and PM4 RELEASE_MEM destination addresses can point to VRAM (address < 16 GiB). When these addresses are fed through `getGARTAddr()`, the page number is multiplied by 8, and GART translation fails because VRAM addresses have no corresponding page table entries. + +The fix was a three-layer defense: + +1. **PM4 layer** (`pm4_packet_processor.cc`): `writeData()`, `releaseMem()`, `queryStatus()` check `isVRAMAddress(addr)` and route VRAM writes through `gpuDevice->getMemMgr()->writeRequest()` (device memory) instead of `dmaWriteVirt()` (system memory via GART). +2. **SDMA layer** (`sdma_engine.cc`): `setGfxRptrLo/Hi()` and rptr writeback skip `getGARTAddr()` for VRAM addresses, using `getMemMgr()->writeRequest()` instead. +3. **GART fallback** (`amdgpu_vm.cc`): `GARTTranslationGen::translate()` detects VRAM addresses by reversing the `getGARTAddr` transform (`orig_page = page_num >> 3`) and maps them to `paddr=0` as a sink instead of faulting. + +--- + +## Key Source Files + +| File | Purpose | +|------|---------| +| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | vfio-user server SimObject (default backend) | +| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | Legacy socket bridge SimObject | +| `src/dev/amdgpu/cosim_bridge.hh` | Abstract CosimBridge interface | +| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | All translation generators (GART, AGP, MMHUB, User) | +| `src/dev/amdgpu/pm4_packet_processor.{cc,hh}` | PM4 DMA routing, VRAM detection, `getGARTAddr` | +| `src/dev/amdgpu/sdma_engine.{cc,hh}` | SDMA DMA routing, GART shadow copies | +| `src/dev/amdgpu/interrupt_handler.cc` | IH ring buffer DMA and interrupt delivery | +| `src/dev/amdgpu/amdgpu_device.cc` | Device-level `intrPost()`, `writeFrame()` | +| `src/dev/amdgpu/xgmi_bridge.{cc,hh}` | xGMI interconnect bridge | +| `configs/example/gpufs/mi300_cosim.py` | System config, memory setup, backend selection | +| `scripts/cosim_launch.sh` | Launch orchestration | diff --git a/docs/en/build-disk-china-mirror.md b/docs/en/build-disk-china-mirror.md deleted file mode 100644 index fc758bc..0000000 --- a/docs/en/build-disk-china-mirror.md +++ /dev/null @@ -1,26 +0,0 @@ -[中文](../zh/build-disk-china-mirror.md) - -# Disk Image Build Acceleration Patch (China Network) - -## Problem - -On a China-based direct connection, `./scripts/run_mi300x_fs.sh build-disk` -often hangs because `apt` inside the VM fetches packages from -`us.archive.ubuntu.com` (Packer reports `Timeout waiting for SSH`, or the -provisioner aborts while installing ROCm). - -## Apply the patch - -```bash -cd gem5-resources -git apply ../scripts/patches/0001-user-data-cn-mirror.patch -``` - -Revert: - -```bash -cd gem5-resources -git apply -R ../scripts/patches/0001-user-data-cn-mirror.patch -``` - -To use a different mirror, edit the URI in the patch and re-apply. diff --git a/docs/en/cosim-debugging-pitfalls.md b/docs/en/cosim-debugging-pitfalls.md deleted file mode 100644 index a74d4dc..0000000 --- a/docs/en/cosim-debugging-pitfalls.md +++ /dev/null @@ -1,186 +0,0 @@ -[中文](../zh/cosim-debugging-pitfalls.md) - -# MI300X Co-simulation: Debugging Pitfalls and Fixes - -This document records bugs encountered and fixed during the QEMU+gem5 MI300X co-simulation bringup process, including some non-obvious root cause analyses. - -## 1. SIGIO Coalescing Deadlock (handleClientData Single Read) - -> **Note**: This issue is specific to the legacy cosim backend (MI300XGem5Cosim). The vfio-user backend uses libvfio-user's non-blocking poll mechanism and does not use FASYNC/SIGIO. - -**Symptom**: The driver hangs on its first access to the PCIe INDEX2/DATA2 register pair. gem5 stops responding after processing approximately 15 messages. - -**Root Cause**: Linux FASYNC/SIGIO is **edge-triggered**. When QEMU sends a fire-and-forget MMIO write immediately followed by a blocking MMIO read, both messages may arrive before gem5's SIGIO handler fires. In this case, only one signal is delivered. The original `handleClientData()` read only one message per SIGIO, leaving the second message stranded forever. - -**Fix** (`mi300x_gem5_cosim.cc`): Changed `handleClientData()` to a drain loop that checks for more data using `poll(fd, POLLIN, 0)` after processing each message: - -```cpp -void MI300XGem5Cosim::handleClientData(int fd) { - struct pollfd pfd; - do { - CosimMsgHeader msg; - if (!recvAll(fd, &msg, COSIM_MSG_HDR_SIZE)) { - closeClient(fd); return; - } - processMessage(fd, msg); - pfd = {fd, POLLIN, 0}; - } while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -} -``` - -**Lesson**: Any FASYNC-based I/O handler must drain all pending data rather than reading just one message. This pattern (write + read coalescing) is common in PCIe indirect register access. - ---- - -## 2. ip_block_mask Uses Discovery Order, Not Type Enum Values - -**Symptom**: `PSP load tmr failed!`, `hw_init of IP block failed -22`, `Fatal error during GPU init`. - -**Root Cause**: The ROCm 7.0 DKMS driver (`amdgpu_device.c:2807`) checks `(amdgpu_ip_block_mask & (1 << i))`, where `i` is the **discovery order index**, not the `amd_ip_block_type` enum value. - -MI300X discovery order (from dmesg): - -| Index | IP Block | Bit in Mask | -|-------|-----------------|-------------| -| 0 | soc15_common | 0x01 | -| 1 | gmc_v9_0 | 0x02 | -| 2 | vega20_ih | 0x04 | -| 3 | psp | 0x08 | -| 4 | smu | 0x10 | -| 5 | gfx_v9_4_3 | 0x20 | -| 6 | sdma_v4_4_2 | 0x40 | -| 7 | vcn_v4_0_3 | 0x80 | -| 8 | jpeg_v4_0_3 | 0x100 | - -**Fix**: Changed `ip_block_mask` from `0x6f` to `0x67`: -- `0x6f` = `0110_1111` -- enables common, gmc, ih, **psp**, gfx, sdma -- `0x67` = `0110_0111` -- enables common, gmc, ih, gfx, sdma (disables psp at index 3 and smu at index 4) - -**Pitfall**: The `amd_ip_block_type` enum in `amd_shared.h` shows PSP=4, but the actual mask bit for PSP is `(1 << 3)` because PSP is the third block discovered during IP discovery (index 3). The documentation and enum values are misleading. - ---- - -## 3. NULL Deref in amdgpu_atom_parse_data_header (Missing VGA ROM) - -**Symptom**: `modprobe amdgpu` causes kernel NULL pointer dereference at `amdgpu_atom_parse_data_header+0x1b`. Call chain: `amdgpu_ras_init → amdgpu_atomfirmware_mem_ecc_supported → amdgpu_atom_parse_data_header`. RAX=0 (NULL `atom_context`). - -**Root Cause**: The amdgpu driver's BIOS discovery chain has 5 methods, all of which fail in cosim mode: - -| Method | Why it fails | -|--------|-------------| -| `amdgpu_atrm_get_bios()` | No ACPI ATRM method in QEMU Q35 | -| `amdgpu_acpi_vfct_bios()` | No ACPI VFCT table | -| `amdgpu_read_bios_from_rom()` | Reads via SMU registers, but SMU is disabled by `ip_block_mask=0x67` | -| `amdgpu_read_platform_bios()` | No platform-provided ROM | -| `amdgpu_read_disabled_bios()` | Not functional in cosim | - -The driver logs `"Unable to locate a BIOS ROM"` and `"VBIOS image optional, proceeding"`, but the RAS init path unconditionally calls `amdgpu_atom_parse_data_header()` without checking for NULL `atom_context`. - -**Fix**: Write the VGA ROM to physical address `0xC0000` (shared memory) **before** `modprobe`: - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -The ROM data at `0xC0000` is accessible by gem5 via `/dev/shm/cosim-guest-ram`. When the driver reads the ROM via SMU MMIO registers, gem5's `AMDGPUDevice::readROM()` reads from `system->getPhysMem()` at `VGA_ROM_DEFAULT + offset` and returns the ROM content through the cosim socket. - -**Pitfall**: QEMU's `romfile=` property loads the ROM into the PCI expansion ROM BAR, but the amdgpu driver does **not** read from the PCI ROM BAR directly -- it uses SMU register-based ROM access. The `romfile` alone is insufficient; the `dd` step is always required. - ---- - -## 5. PM4ReleaseMem.dataSelect Panic - -**Symptom**: gem5 panics with `Unimplemented PM4ReleaseMem.dataSelect`. - -**Root Cause**: `pm4_packet_processor.cc` only implemented `dataSelect == 1` (32-bit data write). The driver uses other modes during GFX initialization. - -**Fix**: Added handling for all common dataSelect values: - -| dataSelect | Behavior | -|------------|-----------------------------------------| -| 0 | No data written (event trigger only) | -| 1 | Write 32-bit value (already existed) | -| 2 | Write 64-bit value | -| 3 | Write 64-bit GPU clock counter | -| Other | Warn and treat as no-op | - ---- - -## 6. GART Table Not Populated in Co-simulation Mode - -**Symptom**: Massive `GART translation for X not found` warnings. PM4 processor reads all-zero memory (opcode 0x0). KIQ ring test times out. - -**Root Cause**: In co-simulation mode, QEMU's BAR2 (VRAM, 16GB) is backed by a shared memory file (`/dev/shm/mi300x-vram`). Driver writes to VRAM go directly into the shared file, **completely bypassing gem5's socket protocol**. gem5's `AMDGPUVM::gartTable` hash table is populated in `AMDGPUDevice::writeFrame()`, which only executes when writes go through gem5's memory system. Since VRAM writes bypass gem5, `gartTable` remains empty. - -> **Note**: This issue applies to both the legacy cosim and vfio-user backends, because in both architectures VRAM is passed through a shared memory file (`/dev/shm/mi300x-vram`), and driver writes to VRAM always bypass the gem5 memory system. - -**Fix** (`amdgpu_vm.cc` + `amdgpu_vm.hh`): Added a shared VRAM fallback in `GARTTranslationGen::translate()`: - -1. Added `vramShmemPtr` / `vramShmemSize` fields to `AMDGPUVM` -2. `MI300XGem5Cosim` sets these fields after mapping the shared VRAM -3. When `gartTable` misses, read the PTE directly from shared VRAM: - -```cpp -Addr gart_byte_offset = bits(range.vaddr, 63, 12); -Addr pte_vram_offset = (gartBase() - getFBBase()) + gart_byte_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); -``` - -**Key Detail**: `getGARTAddr()` (called before translate) already multiplies the page index by 8 to get a byte offset: -```cpp -addr = (((addr >> 12) << 3) << 12) | low_bits; // page_num *= 8 -``` -Therefore `bits(vaddr, 63, 12)` in the translate function is already the PTE's **byte offset**, not a page index. Multiplying by 8 again would cause the address to overshoot 8x into the GART table. - -**Architecture Note**: The "expansion formula" in the original translate code (`gart_addr += lsb * 7`) is effectively a no-op for addresses processed by `getGARTAddr()`, because `lsb = (page_num * 8) & 7 = 0` (`page_num * 8` is always 8-aligned, so the lower 3 bits are always zero). - ---- - -## 7. SDMA Ring Test Timeout (sdma_delay Timing Issue) - -**Symptom**: SDMA ring test returns `-110` (`-ETIMEDOUT`) during driver initialization. - -**Root Cause**: The `sdma_delay` parameter in gem5's `sdma_engine.hh` defaults to `1e9` ticks. In co-simulation mode, the ratio between gem5's simulation clock and wall-clock time causes `1e9` ticks to correspond to approximately 500ms of real delay. The amdgpu driver's SDMA ring test timeout threshold is approximately 200ms, far shorter than this delay. - -Detailed flow: -1. The driver writes to the SDMA ring buffer and rings the doorbell -2. gem5 receives the doorbell and schedules the SDMA processing event with a delay of `sdma_delay` ticks -3. Due to the excessive delay, the driver times out before gem5 completes processing -4. The driver reports `sdma v4_4_2: ring 0 test failed (-110)` - -**Fix**: -- Reduced `sdma_delay` from `1e9` to `1000` ticks (`sdma_engine.hh`) -- Increased the cosim `KEEPALIVE_INTERVAL` to `1e9` to prevent keepalive messages from interfering with timing - -**Lesson**: Timing parameters in co-simulation mode cannot be directly reused from standalone simulation defaults. The ratio difference between gem5's simulation clock and wall-clock time amplifies or reduces delay effects. - ---- - -## General Notes on Co-simulation Architecture - -### Operations That Bypass the Communication Protocol - -**Legacy backend (custom socket protocol):** - -| Resource | QEMU BAR | gem5 BAR | Via Socket? | Via Shared Memory? | -|------------------|----------|----------|-------------|--------------------| -| MMIO Registers | BAR0 | BAR5 | Yes | No | -| VRAM (16GB) | BAR2 | BAR0 | **No** | Yes | -| Doorbells | BAR4 | BAR2 | Yes | No | - -**vfio-user backend (standard vfio-user protocol):** - -| Resource | QEMU Mapping Method | gem5 Side | Via vfio-user? | Via Shared Memory? | -|------------------|---------------------------|----------------|----------------|--------------------| -| MMIO Registers | vfio-user region callback | BAR5 | Yes | No | -| VRAM (16GB) | vfio-user DMA region | BAR0 | **No** | Yes | -| Doorbells | vfio-user region callback | BAR2 | Yes | No | - -> **Note**: With the vfio-user backend, QEMU uses its built-in `vfio-user-pci` device. No custom QEMU device code is needed. QEMU maps all BARs through the vfio-user protocol: BAR0 (VRAM) is mapped via DMA region, BAR2 (doorbell) and BAR5 (MMIO) use vfio-user region callbacks. - -Any gem5 data structure populated by intercepting VRAM writes (such as `gartTable`, page tables, ring buffers) will **not** be populated in co-simulation mode. These structures require explicit fallback mechanisms to read data from the shared VRAM. This limitation applies to both backends. - -### Guest Must Be Rebooted After Driver Load Failure - -After a driver `hw_init` failure, executing `rmmod amdgpu` causes a kernel oops (page fault in `kgd2kfd_device_exit`). The module gets stuck in a "busy" state and cannot be reloaded. The only workaround is to restart the entire co-simulation environment (kill QEMU, restart the gem5 Docker container, restart QEMU). diff --git a/docs/en/cosim-dev-story.md b/docs/en/cosim-dev-story.md deleted file mode 100644 index 3bc42fa..0000000 --- a/docs/en/cosim-dev-story.md +++ /dev/null @@ -1,343 +0,0 @@ -[中文](../zh/cosim-dev-story.md) - -# Two Days, Two Submodules, Fifteen Bugs: How I Used Claude to Bring a $14,000 MI300X GPU into QEMU - -> AMD Instinct MI300X: 304 compute units, 192GB HBM3, retail price over $14,000 per card. -> Now, all you need is an ordinary x86 Linux machine to run full ROCm/HIP workloads on QEMU. - -## 01 -Origin: When Boot Time Outlasts Debugging - -I've been working on GPU simulators for a while. gem5 has a device model for the MI300X and supports full-system simulation, but its KVM fast-forward mode is still slow -- a Linux boot takes 5 minutes, driver loading takes another 5, and every time you debug an MMIO register issue, you're staring at a 10-minute blank wait. - -I'd been wanting to do something: let QEMU run Linux and the amdgpu driver, while gem5 handles only the GPU compute model, bridged by some IPC mechanism. That way, QEMU uses KVM for the CPU part at near-native speed, and gem5 only processes GPU MMIO/Doorbell/DMA, focusing on simulation accuracy. - -The idea sounds straightforward, but in practice it touches QEMU PCIe device models, gem5 SimObject architecture, Linux amdgpu driver initialization flow, GART address translation, shared memory file offset alignment, and Unix domain socket edge-triggered semantics -- and every intersection of these is a pitfall. - -On the morning of March 6, 2026, I opened Claude Code and started this project. By the early hours of March 8, the first HIP vector addition test printed `PASSED!` in the co-simulation environment. - -This article documents the pitfalls encountered and key decisions made throughout the process. - ---- - -## 02 -Architecture: The One-Liner Version - -``` -+-----------------------------+ +----------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker) | -| +-----------------------+ | | +----------------------+ | -| | Guest Linux | | | | MI300X GPU Model | | -| | amdgpu driver | | | | Shader / CU / SDMA | | -| | ROCm 7.0 / HIP | | | | PM4 / Ruby caches | | -| +----------+------------+ | | +---------+------------+ | -| | | | | | -| +----------v------------+ | | +---------v------------+ | -| | vfio-user-pci (built- |<-------->| | MI300XVfioUser | | -| | in) | |vfio- | +----------------------+ | -| +-----------------------+ |user | | -| |socket | | -+-----------------------------+ +----------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (shared guest RAM) (shared GPU VRAM) -``` - -> **Backend selection**: The default is the vfio-user backend (`MI300XVfioUser`), where QEMU uses its built-in `vfio-user-pci` device with no custom QEMU code required. The legacy backend (`MI300XGem5Cosim` + custom `mi300x-gem5` QEMU device) is also supported via `--cosim-backend=legacy`. - -On the QEMU side, there's a full Q35 virtual machine running Ubuntu 24.04 + ROCm 7.0 + amdgpu driver. The vfio-user backend uses QEMU's built-in `vfio-user-pci` device, which forwards all MMIO reads/writes and doorbell writes to gem5 via the standard vfio-user protocol. - -On the gem5 side, it runs the MI300X GPU device model -- Shader, CU arrays, PM4 command processor, SDMA engines, Ruby cache hierarchy -- but **no Linux kernel**. It starts with a `StubWorkload` shell and just waits for MMIO requests from QEMU over the socket. - -Guest physical memory and GPU VRAM each have a shared memory file (`/dev/shm/`), both QEMU and gem5 can mmap directly, achieving zero-copy DMA. - -The BAR layout must strictly match the amdgpu driver's hardcoded expectations: - -| BAR | Content | Size | Communication | -|-----|---------|------|---------------| -| BAR0+1 | VRAM | 16 GiB | Shared memory | -| BAR2+3 | Doorbell | 4 MiB | Socket forwarding | -| BAR4 | MSI-X | 256 vectors | QEMU local | -| BAR5 | MMIO registers | 512 KiB | Socket forwarding | - ---- - -## 03 -First Steps: Writing a PCIe Device from Scratch - -At 6:30 AM on March 6, I had Claude help me write the QEMU-side `mi300x_gem5.c`. It's a standard QEMU PCIe device, but with several special aspects: - -1. **Six BARs**, three of which need 64-bit address space (16GB VRAM can't fit below 4G) -2. **Two socket connections**: one synchronous (MMIO request/response), one asynchronous (interrupts and DMA events) -3. **MSI-X support**: 256 interrupt vectors, gem5 notifies QEMU via the event socket to trigger `msix_notify()` - -The gem5-side `MI300XGem5Cosim` SimObject is slightly more complex -- it's a socket server that listens for QEMU connections, dispatches received MMIO messages to `AMDGPUDevice` for processing, and sends results back. - -The first version was about 1,500 lines (QEMU 700 + gem5 800), clean in structure but full of bugs. - ---- - -## 04 -Pitfalls: From SIGIO Deadlock to GART Translation - -### Bug #1: SIGIO Edge-Triggered Deadlock -- The Most Insidious Problem - -gem5's event system uses `FASYNC`/`SIGIO` to monitor socket data. This is **edge-triggered** -- when the socket buffer transitions from empty to non-empty, the kernel sends one `SIGIO`, and only one. - -The problem lies in the amdgpu driver's register access pattern. The driver frequently writes an INDEX register (selecting which internal register to access), then immediately reads the DATA register (getting the value). The write is fire-and-forget, the read blocks waiting for a response. When these two messages arrive back-to-back in gem5's socket buffer, only one SIGIO fires. - -My initial `handleClientData()` read only one message per invocation. Result: gem5 reads the write message, processes it, then waits for the next SIGIO. But the read message is already in the buffer, and no new SIGIO will come to wake it up. QEMU blocks waiting for the read response. **Perfect deadlock.** - -gem5 processed 15 messages and then hung forever. - -The fix was simple -- change single-read to a drain loop: - -```cpp -void MI300XGem5Cosim::handleClientData(int fd) { - struct pollfd pfd; - do { - CosimMsgHeader msg; - if (!recvAll(fd, &msg, COSIM_MSG_HDR_SIZE)) { - closeClient(fd); return; - } - processMessage(fd, msg); - pfd = {fd, POLLIN, 0}; - } while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -} -``` - -After this fix, MMIO message count jumped from 15 to **35,181**. Driver initialization pushed all the way to the PSP firmware loading stage. - -**Lesson: Any FASYNC-based I/O handler must drain all pending data. This is inevitable in PCIe indirect register access scenarios.** - -### Bug #2: ip_block_mask -- The Documentation Lies - -The amdgpu driver has an `ip_block_mask` parameter controlling which IP blocks to initialize. In cosim mode, PSP (security processor) and SMU (power management) aren't needed and must be disabled. - -I initially used `0x6f`, thinking I'd disabled PSP (enum value 4) while keeping everything else. But PSP was still being initialized, firmware loading failed with `-EINVAL`, and the entire GPU init failed. - -It took a while to figure out: `ip_block_mask` bits correspond to the **IP discovery detection order index**, not the `amd_ip_block_type` enum values. MI300X's detection order is: - -``` -0: soc15_common 1: gmc_v9_0 2: vega20_ih -3: psp 4: smu 5: gfx_v9_4_3 -6: sdma_v4_4_2 7: vcn_v4_0_3 8: jpeg_v4_0_3 -``` - -PSP is 4 in the enum but 3 in detection order. `0x6f` = `0110_1111` disables index 4 (smu), but index 3 (psp) remains enabled. The correct value is `0x67` = `0110_0111`, disabling both index 3 and 4. - -**Lesson: There's no correspondence between the enum values in amd_shared.h and the actual bitmask the driver uses. Only the dmesg detection log tells the truth.** - -### Bug #3: Shared Memory Offset -- Two Systems Disagree on Memory Layout - -This bug was the most bizarre. GART page table entries read back as all zeros, the PM4 command processor kept reading opcode 0x0 (NOP) in an infinite loop. - -The issue was a disagreement between QEMU Q35 and gem5 on memory splitting. With 8GB RAM configured: - -- **QEMU Q35** hardcodes `below_4g = 2 GiB` (when `ram_size >= 0xB0000000`), placing the upper 6GB at file offset 2G -- **gem5** defaults to `below_4g = 3 GiB`, placing the upper 5GB at file offset 3G - -Both sides mmap the same shared memory file, but disagree on "where above-4G memory sits in the file." gem5 reads GART page tables from offset 3G -- which is all zeros, because QEMU wrote the data at offset 2G. - -Fix: Replicate Q35's split logic exactly in `mi300_cosim.py`. - -**Lesson: When sharing a memory-backend-file, both parties must agree on file offsets for every range, not just total size.** - -### Bug #4: VRAM Addresses Incorrectly Routed Through GART Translation - -PM4's `RELEASE_MEM` and SDMA's rptr write-back sometimes target VRAM addresses (address < 16 GiB). The original code fed all addresses through `getGARTAddr()` for translation, but VRAM addresses have no corresponding GART page table entries. Translation failed 861,000+ times, eventually exhausting memory and segfaulting. - -The fix used three layers of defense: - -1. **PM4 layer**: `writeData()` / `releaseMem()` check `isVRAMAddress(addr)`, routing VRAM writes directly to device memory -2. **SDMA layer**: rptr write-back skips `getGARTAddr()` for VRAM addresses -3. **GART fallback**: Unmapped GART pages map to `paddr=0` (sink) instead of faulting - ---- - -## 05 -Validation: HIP Vector Addition PASSED - -Early morning of March 8. All bugs fixed, driver loading normal, `rocm-smi` sees MI300X (0x74a0), `rocminfo` reports gfx942 architecture with 320 CUs. - -In the guest, I wrote the simplest HIP test -- four-element vector addition: - -```cpp -__global__ void add(int *a, int *b, int *c, int n) { - int i = blockIdx.x * blockDim.x + threadIdx.x; - if (i < n) c[i] = a[i] + b[i]; -} -``` - -Compile, run: - -``` -Result: 11 22 33 44 -PASSED! -``` - -`{1+10, 2+20, 3+30, 4+40}` = `{11, 22, 33, 44}`. hipMalloc, hipMemcpy (host-to-device / device-to-host), kernel dispatch, hipDeviceSynchronize all returned normally. MSI-X interrupts forwarded from gem5 through the event socket to QEMU, QEMU triggered `msix_notify()`, the guest IH handler processed them correctly -- the entire interrupt chain ran end-to-end for the first time. - -This is the best practice for gem5 serving as a "remote GPU" driven by a real amdgpu driver inside a QEMU guest for actual computation. - ---- - -## 06 -Collaboration: Not a Code Tool, but a Systems Partner - -The entire development happened in one massive conversation session, resumed as context ran out. The workflow was: - -1. **I provide raw terminal output**: dmesg logs, gem5 panic messages, socket communication hexdumps -2. **Claude analyzes the output**, searches gem5/QEMU/Linux kernel source code to locate root causes -3. **Claude proposes and implements fixes** -- directly editing gem5 C++ code, QEMU C code, Python configs, shell scripts -4. **Background builds**: gem5 compilation ~30 min, QEMU ~5 min, disk image ~40 min -- all running in the background -5. **I test and post new output**, cycle continues - -Claude's role in this project wasn't "a tool that writes code for me," but more like **a collaborator with deep understanding of gem5 and QEMU internals**. A few typical scenarios: - -- **SIGIO deadlock**: I only posted "gem5 hangs after 15 messages," Claude immediately identified the FASYNC edge-triggered semantics and proposed the drain loop -- **ip_block_mask**: I posted the dmesg IP discovery log, Claude directly mapped out the detection order vs. bitmask mismatch -- **GART translation**: Claude traced the `getGARTAddr()` multiply-by-8 transformation through gem5 source code, discovering VRAM addresses being misdirected into the GART path -- **Q35 memory split**: Claude dug out the hardcoded 2GiB boundary at `qemu/hw/i386/pc_q35.c:161` and compared it with gem5's 3GiB default - -Throughout the process, 15 blocking bugs were resolved one by one. Each fix was built on accurate understanding of underlying system behavior -- not trial and error, but root cause analysis. - ---- - -## 07 -Memory: Knowledge Persistence Across Sessions - -Development of this project spanned multiple conversation sessions -- Claude Code's context window is finite, and when a marathon debugging session exhausts the context, a new conversation must pick up where the old one left off. This raises a critical question: how does the new conversation know what was already done, which bugs are fixed, and which are still in progress? - -The answer is Claude's auto memory system. Under `~/.claude/projects/`, Claude automatically maintains a set of memory files that record key information across sessions. This project had three memory files: - -1. **MEMORY.md** (main memory, 43 lines): project structure, gem5 runtime configuration (Docker image names, build flags, Python versions), DRM Client -13 crash fix record, overall co-simulation status -2. **cosim-details.md** (architecture details, 69 lines): complete BAR layout, summaries of 8 key fixes, gem5/QEMU launch commands, precise GART page table parameters (ptBase, fbBase, PTE format) -3. **cosim-debugging.md** (debugging progress, 63 lines): file locations and root causes for each bug, fix status (including intermediate states like "partially fixed"), current blockers - -These memory files played several critical roles during actual development: - -**Avoiding repeated diagnosis.** When a new session began, Claude didn't need to re-analyze the entire codebase to understand the project state. The memory files recorded that "SIGIO deadlock is fixed, ip_block_mask changed to 0x67, GART fallback implemented," allowing work to resume exactly where it left off. - -**Maintaining environment consistency.** gem5 must be built and run inside a specific Docker image (`ghcr.io/gem5/gpu-fs:latest`), QEMU's serial parameters can't be mixed with `-nographic`, disk images need packer with specific flags -- these environment details were scattered across different sessions but unified in the memory files. New sessions wouldn't waste time using wrong Docker images or build parameters. - -**Tracking incremental progress.** Debugging isn't linear. The GART translation fix went through a "partially fixed" to "fully fixed" progression -- the memory files faithfully recorded this intermediate state, preventing new sessions from mistakenly assuming the problem was fully resolved and skipping verification. - -**Cross-codebase associative indexing.** The memory files recorded key file paths (`mi300x_gem5_cosim.cc`, `amdgpu_vm.cc`, `mi300_cosim.py`), key constants (`ptBase=0x3EE600000`, `fbBase=0x8000000000`), and key formulas (`getGARTAddr()`'s multiply-by-8 transformation). This information was scattered across three different codebases; the memory system consolidated it into an efficient associative index. - -If Claude's value in a single conversation is "rapid root cause identification," then the memory system's value is **making that capability persist across sessions**. Without the memory system, every resumed conversation needed 10-15 minutes to rebuild context; with it, a new session could return to the previous working state in seconds. - ---- - -## 08 -Results: What Two Days Delivered - -| Metric | Data | -|--------|------| -| Development time | ~24 hours (Mar 6 06:30 - Mar 8 06:00) | -| New code | ~2,500 lines (gem5 C++ ~800, QEMU C ~700, Python config ~200, shell scripts ~800) | -| Blocking bugs resolved | 15 | -| Technical documentation | 6 articles (bilingual zh+en, ~2,000 lines total) | -| Git commits | 16 (cosim main repo) | -| MMIO operations | 65,000+ without crashes | -| HIP compute test | PASSED | -| vfio-user migration | Completed (Mar 9), vector_add / transpose / gemm all PASSED | - -The final system supports: - -- **Full amdgpu driver loading**: DRM initialized, 7 XCP partitions, gfx942 architecture -- **ROCm toolchain**: rocm-smi, rocminfo working normally -- **HIP GPU compute**: hipMalloc, kernel dispatch, hipDeviceSynchronize -- **MSI-X interrupt forwarding**: gem5 to QEMU event notification -- **Shared memory DMA**: zero-copy VRAM + Guest RAM -- **vfio-user backend**: standard protocol, no custom QEMU code needed -- **One-click launch**: `./scripts/cosim_launch.sh` - ---- - -## 08.5 -vfio-user Migration: From Custom Protocol to Industry Standard - -After validating end-to-end feasibility with the custom socket protocol in the initial version, we migrated the QEMU-gem5 communication to the standard vfio-user protocol on March 9. - -vfio-user is a standard protocol for exposing PCI devices to remote processes (QEMU 10.0+ includes a built-in `vfio-user-pci` client). On the gem5 side, Nutanix's libvfio-user library serves as the server. This means: - -- **No custom QEMU code required**: any QEMU with vfio-user support can connect to gem5 directly -- **Protocol standardization**: BAR mapping, configuration space, interrupts, and DMA are all defined by the vfio-user specification -- **Simpler deployment**: no need to maintain a QEMU fork - -Several key issues were resolved during the migration: -- libvfio-user's BAR size field was `uint32_t`, unable to represent 16GB VRAM → changed to `uint64_t` -- The upper half of 64-bit BARs must return the high 32 bits of the size mask during size probing -- PCIe Express and MSI-X capabilities must be registered before `vfu_realize_ctx()` -- SDMA ring test timeout: `sdma_delay=1e9` caused ~500ms wall-clock delay → reduced to 1000 - -Post-migration test results: vector_add (120ms), transpose (6.5s), gemm (4.7s) all PASSED. - ---- - -## 09 -Significance: A $14,000 GPU Within Reach - -The MI300X is AMD's most powerful data center GPU, priced over $14,000 per card -- ordinary developers simply can't get their hands on one. But through QEMU + gem5 co-simulation, you can, on any x86 Linux machine: - -- Run the full ROCm 7.0 software stack -- Compile and run HIP programs -- Perform performance analysis on a cycle-accurate GPU model -- Debug the amdgpu driver initialization flow -- Develop and validate new GPU architecture features - -All code is open source: [github.com/zevorn/cosim-gpu](https://github.com/zevorn/cosim-gpu) - -```bash -git clone --recurse-submodules git@github.com:zevorn/cosim-gpu.git -cd cosim-gpu -GEM5_BUILD_IMAGE=ghcr.io/gem5/gpu-fs:latest ./scripts/run_mi300x_fs.sh build-all -cd scripts && docker build -t gem5-run:local -f Dockerfile.run . && cd .. -./scripts/cosim_launch.sh -``` - ---- - -## 10 -Afterword: An Amplifier, Not a Replacement - -Some might ask: "Can code written in two days be reliable?" - -Honestly, without Claude, this project would have taken at least two weeks. Not because of the code volume -- 2,500 lines isn't much for a PCIe device bridge -- but because the debugging process requires simultaneously understanding the internal behavior of three systems: QEMU's Q35 memory layout, gem5's event-driven I/O model, and the Linux amdgpu driver's IP block initialization sequence. Misunderstanding any single aspect means hours in a debugging black hole. - -Claude's value isn't in writing code for me, but in **dramatically shortening the time from "seeing a symptom" to "understanding the root cause."** When I paste a segment of dmesg output, Claude can correlate it in seconds to specific functions in gem5 source code and hardcoded constants in QEMU -- this kind of cross-codebase correlation analysis simply can't be done at human speed by manually reading source code. - -Of course, Claude isn't omnipotent. All testing was done by me, all architectural decisions were mine (like choosing two socket connections instead of one, choosing StubWorkload instead of full-system boot), and all final verification required confirmation in the real environment. AI is an amplifier, not a replacement. - -But this amplifier is genuinely powerful. Two days, one person, one AI, and a $14,000 GPU was brought into QEMU. - ---- - -## References - -**Project & Source Code** - -- [cosim-gpu](https://github.com/zevorn/cosim-gpu) — This project repository (with gem5, QEMU, gem5-resources submodules) -- [Complete Usage Guide](cosim-usage-guide.md) — Build, run, and test walkthrough -- [Technical Notes](cosim-technical-notes.md) — Architecture, pitfalls, and fixes -- [MI300X Memory Management](mi300x-memory-management.md) — GART, address translation, memory mapping -- [Guest GPU Init](cosim-guest-gpu-init.md) — Driver loading and device initialization -- [Debugging Pitfalls](cosim-debugging-pitfalls.md) — Common issues and solutions - -**Upstream Projects** - -- [gem5](https://www.gem5.org/) — Modular computer architecture simulator -- [QEMU](https://www.qemu.org/) — Open-source machine emulator and virtualizer -- [ROCm](https://rocm.docs.amd.com/) — AMD open-source GPU computing platform -- [AMD Instinct MI300X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) — Product specifications -- [libvfio-user](https://github.com/nutanix/libvfio-user) — vfio-user protocol server library - -**Development Tools** - -- [Claude Code](https://docs.anthropic.com/en/docs/claude-code) — Anthropic's CLI programming assistant - ---- - -*Zewen, March 2026* diff --git a/docs/en/cosim-guest-gpu-init.md b/docs/en/cosim-guest-gpu-init.md deleted file mode 100644 index 7b5c583..0000000 --- a/docs/en/cosim-guest-gpu-init.md +++ /dev/null @@ -1,158 +0,0 @@ -[中文](../zh/cosim-guest-gpu-init.md) - -# MI300X Co-simulation: Guest GPU Initialization Guide - -## Overview - -The MI300X GPU driver can be loaded **automatically** or **manually** after the QEMU guest boots. The disk image includes a systemd service (`cosim-gpu-setup.service`) that handles the full initialization sequence at boot time. - -All required files (ROM, firmware, kernel modules) are already included in the disk image. - -## Automatic Loading (Default) - -The disk image ships with `cosim-gpu-setup.service`, which runs at boot and performs: - -1. `dd` the VGA ROM to `0xC0000` (required for gem5's `readROM()` via shared memory) -2. Symlink IP discovery firmware -3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` - -The service completes in ~40 seconds. After guest login, GPU is ready: - -```bash -rocm-smi # should show device 0x74a0 -rocminfo # should show gfx942 -``` - -The service file: - -```ini -# /etc/systemd/system/cosim-gpu-setup.service -[Unit] -Description=MI300X GPU Setup for Co-simulation -After=local-fs.target -Before=multi-user.target - -[Service] -Type=oneshot -RemainAfterExit=yes -ExecStart=/usr/local/bin/cosim-gpu-setup.sh - -[Install] -WantedBy=multi-user.target -``` - -> **Note:** `modprobe.blacklist=amdgpu` must remain in the kernel command line to prevent the PCI subsystem from auto-loading the driver before the ROM is written to shared memory. The systemd service handles the explicit `modprobe` after `dd`. - -## Manual Loading - -If the systemd service is not installed, run these commands manually after guest boot. - -### Prerequisites - -- `cosim_launch.sh` is running (gem5 + QEMU are connected) -- The guest has booted and you have a root shell -- `modprobe.blacklist=amdgpu` was passed on the kernel command line - -### Quick Reference (Copy-Paste Ready) - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -## Detailed Steps - -### Step 1: Load the VGA BIOS ROM - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -``` - -**What it does**: Writes the MI300X VBIOS ROM image to the legacy VGA ROM region at physical address `0xC0000` (768 KB). - -**Why it is needed**: The amdgpu driver reads the VBIOS from the legacy VGA ROM space (`0xC0000--0xDFFFF`, 128 KB) during initialization. The QEMU co-simulation device registers as `PCI_CLASS_DISPLAY_VGA`, so the kernel recognizes that address range as "shadowed ROM". Without the ROM, the driver will report `"Unable to locate a BIOS ROM"`. - -**Parameter description**: -| Parameter | Value | Meaning | -|-----------|-------|---------| -| `if` | `/root/roms/mi300.rom` | ROM binary file (in the disk image) | -| `of` | `/dev/mem` | Physical memory device | -| `bs` | `1k` | Block size = 1024 bytes | -| `seek` | `768` | Seek to 768 x 1024 = `0xC0000` | -| `count` | `128` | Write 128 x 1024 = 128 KB | - -### Step 2: Symlink the IP Discovery Firmware - -```bash -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ - /usr/lib/firmware/amdgpu/ip_discovery.bin -``` - -**What it does**: Points the driver's IP discovery firmware path to the MI300X-specific discovery binary. - -**Why it is needed**: The amdgpu driver uses `discovery=2` mode, which reads GPU IP block information from a firmware file on disk rather than from the GPU's own ROM/registers. The gem5 GPU model provides this file via its `ipt_binary` parameter (empty string = use on-disk firmware). The driver looks for `/usr/lib/firmware/amdgpu/ip_discovery.bin`, which must point to the MI300X-specific file. - -**Note**: Both files are already included in the disk image; this command only creates the correct symlink. If `mi300_discovery` does not exist, the driver will fall back to built-in defaults (which may not match MI300X). - -### Step 3: Load the amdgpu Kernel Module - -```bash -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -**What it does**: Loads the amdgpu driver with co-simulation parameters. - -**amdgpu module parameters**: - -| Parameter | Value | Meaning | -|-----------|-------|---------| -| `ip_block_mask` | `0x67` | Disable PSP (bit 3) and SMU (bit 4); cosim does not model these | -| `ppfeaturemask` | `0` | Disable PowerPlay features; cosim has no power management hardware | -| `dpm` | `0` | Disable Dynamic Power Management | -| `audio` | `0` | Disable audio; no HDMI/DP audio in cosim | -| `ras_enable` | `0` | Disable RAS — prevents NULL deref on `atom_context` when VBIOS is minimal | -| `discovery` | `2` | Use firmware file for IP discovery | - -> **Warning**: Using `ip_block_mask=0x6f` (only disables SMU) will cause PSP firmware load failure and kernel panic. Always use `0x67`. - -> **Warning**: The `dd` step (Step 1) is **mandatory** before `modprobe`. Without it, the driver's BIOS discovery chain fails (ACPI unavailable, SMU disabled), resulting in `"Unable to locate a BIOS ROM"` followed by a NULL pointer crash in `amdgpu_ras_init` → `amdgpu_atom_parse_data_header`. - -## Verification - -After completing step 3, check that the driver has loaded: - -```bash -# Check dmesg for amdgpu initialization -dmesg | grep -i amdgpu | tail -20 - -# Check PCI device -lspci | grep -i amd - -# Check ROCm (if available) -rocm-smi -rocminfo | head -40 -``` - -**Expected results**: `dmesg` should show amdgpu initializing the GPU with no fatal errors. MMIO traffic should appear in the gem5 debug log. - -## Troubleshooting - -| Symptom | Cause | Fix | -|---------|-------|-----| -| `Unable to locate a BIOS ROM` + NULL deref crash | Step 1 (dd ROM) was not executed before modprobe | Run `dd` first; check `/root/roms/mi300.rom` exists | -| `insmod: ERROR: could not load module` | Kernel version mismatch | Rebuild the disk image with a matching kernel | -| `cosim-gpu-setup.service` failed | Check `journalctl -u cosim-gpu-setup` | Verify ROM file and module exist in disk image | -| MMIO reads all return zero | gem5 is not connected or has crashed | Check `docker logs gem5-cosim` | -| `probe failed with error -12` | BAR layout mismatch | Rebuild QEMU with the correct BAR5=MMIO layout | -| gem5 crashes with `schedule()` assertion | Timer event overflow | Ensure `disable_rtc_events` and `disable_timer_events` are set | - -## File Locations (Inside the Guest Disk Image) - -| File | Path | Source | -|------|------|--------| -| VGA BIOS ROM | `/root/roms/mi300.rom` | Built by Packer | -| IP Discovery firmware | `/usr/lib/firmware/amdgpu/mi300_discovery` | Built by Packer | -| Auto-load service | `/etc/systemd/system/cosim-gpu-setup.service` | Installed via `guestmount` | -| Auto-load script | `/usr/local/bin/cosim-gpu-setup.sh` | Installed via `guestmount` | -| amdgpu module | `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | ROCm 7.0 DKMS | diff --git a/docs/en/cosim-memory-architecture.md b/docs/en/cosim-memory-architecture.md deleted file mode 100644 index aed2782..0000000 --- a/docs/en/cosim-memory-architecture.md +++ /dev/null @@ -1,405 +0,0 @@ -[中文](../zh/cosim-memory-architecture.md) - -# QEMU+gem5 Co-simulation: Memory Sharing Architecture - -## 1. Background - -In the QEMU+gem5 MI300X co-simulation, the GPU device model (gem5) and the host system (QEMU/KVM) run as two separate processes. The GPU needs to access two types of memory: - -- **VRAM** (local video memory): GPU-private, stores textures, buffers, GART page tables, etc. -- **GTT** (Graphics Translation Table / System Memory): Host physical memory regions mapped by the GPU, used for ring buffers, fences, IH cookies, DMA buffers, etc. - -Both types of memory must be shared between QEMU and gem5 — otherwise gem5 cannot read commands written by the driver, and QEMU cannot see results written back by the GPU. - -### Key Takeaway - -> **Both VRAM and Guest RAM (where GTT pages reside) are already shared via shared memory for bidirectional visibility.** -> The GART page table itself lives in VRAM and is also shared. gem5 reads GART PTEs directly from shared VRAM, then accesses Guest RAM directly through the Ruby memory system's shared backstore for DMA operations. - -## 2. Overall Architecture - -``` -+----------------------------+ +-----------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker) | -| | | | -| Guest Linux | | MI300X GPU Model | -| amdgpu driver | | Shader / CU / SDMA | -| | | PM4 / IH / Ruby caches | -| +--------+ +---------+ | vfio-user (Unix) | +------------+ +--------+ | -| | BAR0 | | BAR5 |<---(MMIO/CFG/Doorbell)--->|MI300XVfio | |GPU core| | -| | (VRAM) | | (MMIO) | | | |User bridge | | | | -| +---+----+ +---------+ | | +-----+------+ +--------+ | -| | | | | | -+------+---------------------+ +--------+--------------------+ - | | - v v - /dev/shm/mi300x-vram (16 GiB) mmap same file - (VRAM: GPU data + GART page tables) (vramShmemPtr) - | | - v v - /dev/shm/cosim-guest-ram (8 GiB) mmap same file - (Guest RAM: ring buffers, fences, (system->getPhysMem()) - GTT pages, kernel/user data) -``` - -### 2.1 Three Sharing Channels - -| Channel | File/Socket | Size | Purpose | Access Method | -|---------|-------------|------|---------|---------------| -| VRAM Shared Memory | `/dev/shm/mi300x-vram` | 16 GiB | GPU VRAM + GART page tables | mmap (zero-copy) | -| Guest RAM Shared Memory | `/dev/shm/cosim-guest-ram` | 8 GiB | Host physical memory (GTT pages) | QEMU: mmap; gem5: Ruby memory system direct access to shared backstore | -| vfio-user Socket | `/tmp/gem5-mi300x.sock` | — | MMIO/config space/doorbell via vfio-user message passing; DMA via `vfu_dma_transfer()` or shared memory direct access; interrupts via irq_fd (eventfd -> KVM) | vfio-user protocol (single connection) | - -## 3. VRAM Sharing (BAR0) - -### 3.1 Initialization Flow - -**QEMU side** (vfio-user backend): - -QEMU uses the built-in `vfio-user-pci` device to connect to gem5's vfio-user server. BAR0 is exposed to QEMU through the vfio-user DMA region mapping mechanism — QEMU no longer directly opens the VRAM shared memory file, but instead obtains the BAR mapping through the vfio-user protocol. - -**gem5 side** (`mi300x_vfio_user.cc:setupVramShm`): - -```cpp -shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666); -ftruncate(shmemFd, vramSize); -shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0); - -// Key: pass the shared pointer to the GART translator -gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr; -gpuDevice->getVM().vramShmemSize = vramSize; -``` - -### 3.2 VRAM Content Layout - -``` -Offset 0x000000000 +------------------------------+ - | GPU Data Area | - | - hipMalloc allocations | - | - Kernel args, textures | - | - Driver internal allocs | - | | - | ... | - | | -Offset ~0x3EE600000 +------------------------------+ -(ptBase) | GART Page Table (PTEs) | - | 8 bytes per PTE | - | Maps GPU VA -> phys addr | -Offset 0x400000000 +------------------------------+ -(16 GiB) -``` - -### 3.3 Access Patterns - -| Scenario | Writer | Reader | Path | -|----------|--------|--------|------| -| GPU buffer allocation | Driver (via BAR0 write) | gem5 (via vramShmemPtr) | Shared memory direct access | -| GART PTE writes | Driver (via BAR0 write) | gem5 GART translator | memcpy from vramShmemPtr | -| IP Discovery table | gem5 initialization | Driver (via BAR0 read) | Shared memory direct access | - -**Zero-copy**: Since QEMU's BAR0 and gem5's `vramShmemPtr` both mmap the same `/dev/shm` file, data written by the driver to BAR0 is **immediately visible** to gem5 with no socket communication required. - -## 4. Guest RAM Sharing (GTT Pages) - -### 4.1 What GTT Really Is - -In AMD GPUs, **GTT = GART = Graphics Address Remapping Table**. It is a single-level page table (VMID 0) that maps GPU virtual addresses to host physical addresses. The host physical memory pages being mapped are the so-called "GTT pages." - -Typical GTT page contents: - -| Data Structure | Description | Access Direction | -|---------------|-------------|-----------------| -| PM4 Ring Buffer | GFX command queue | Driver writes -> GPU reads | -| SDMA Ring Buffer | DMA command queue | Driver writes -> GPU reads | -| IH Ring Buffer | Interrupt handler queue | GPU writes -> Driver reads | -| Fence values | Completion signals | GPU writes -> Driver reads | -| MQD (Map Queue Descriptor) | Queue descriptors | Driver writes -> GPU reads | -| User DMA buffers | hipMemcpy src/dst | Bidirectional | - -### 4.2 Guest RAM Sharing Initialization - -**QEMU side** (command-line arguments): - -```bash --object memory-backend-file,id=mem0,size=8G,\ - mem-path=/dev/shm/cosim-guest-ram,share=on --numa node,memdev=mem0 -``` - -`share=on` ensures the file mapping uses `MAP_SHARED`, making QEMU's modifications to guest memory visible to other processes. - -**gem5 side** (`mi300_cosim.py`): - -```python -system.shared_backstore = args.shmem_host_path # "/cosim-guest-ram" -system.auto_unlink_shared_backstore = True -system.memories[0].shared_backstore = args.shmem_host_path -``` - -gem5's `PhysicalMemory` uses the same POSIX shared memory file as its backing store, achieving memory sharing with QEMU. `MI300XVfioUser` also sets `gpuDevice->getVM().vramShmemPtr` to enable the GART translator to correctly access shared VRAM. - -### 4.3 Why GTT Needs No Extra Sharing Mechanism - -GTT pages reside in Guest RAM. Guest RAM is already shared between QEMU and gem5 via `/dev/shm/cosim-guest-ram`. Therefore: - -1. **Driver writes to ring buffer** -> writes to Guest RAM -> `/dev/shm/cosim-guest-ram` -> gem5 can read -2. **gem5 writes fence** -> Ruby memory controller writes to Guest RAM -> `/dev/shm/cosim-guest-ram` -> driver can read -3. **Physical addresses in GART PTEs** -> offsets within Guest RAM -> accessible by both sides - -**vfio-user backend**: Both VRAM and Guest RAM are accessed via zero-copy mmap. gem5's SDMA/PM4 DMA operations go through the Ruby memory system which directly accesses the shared backstore memory, with no socket relay needed. - -## 5. GART Translation Flow - -### 5.1 Driver Writes GART PTEs - -``` -amdgpu driver (guest) - | - +- amdgpu_gart_map(): compute PTE value - | pte = (phys_addr >> 12) << 12 | flags - | - +- write to BAR0 + ptBase + (gpu_page * 8) - | | - | +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram - | +- data immediately appears in shared memory - | - +- TLB invalidate: write VM_INVALIDATE_ENG17 register - +- MMIO -> vfio-user -> gem5 -> invalidateTLBs() -``` - -### 5.2 gem5 Reads GART PTEs - -```cpp -// amdgpu_vm.cc: GARTTranslationGen::translate() - -// Step 1: compute PTE offset within VRAM -gart_addr = bits(transformedAddr, 63, 12); // GPU VA page number -pte_table_offset = gart_addr - (ptStart * 8); - -// Step 2: read PTE directly from shared VRAM (zero-copy) -pte_vram_offset = gartBase() + pte_table_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(uint64_t)); - -// Step 3: extract physical address -if (pte != 0) { - paddr = (bits(pte, 47, 12) << 12) | bits(vaddr, 11, 0); - // paddr points to Guest RAM (GTT page) or VRAM -} -``` - -### 5.3 PTE Format - -``` -63 52 51 48 47 12 11 6 5 2 1 0 -+-------+------+-----------------+------+----+---+---+ -| Flags | BlkF | Physical Page | Rsvd |Frag|Sys| V | -| | | (PA >> 12) | | | | | -+-------+------+-----------------+------+----+---+---+ - -Bit 0: Valid -- PTE is valid -Bit 1: System -- 1=system memory (Guest RAM), 0=local VRAM -Bit 47:12 -- physical page number -``` - -### 5.4 Address Classification - -After GART translation yields a physical address, gem5 determines where it points: - -``` -Physical address paddr - | - +- Within fbBase ~ fbTop range? - | +- YES -> VRAM address - | +- Access directly via vramShmemPtr (zero-copy) - | - +- Within sysAddrL ~ sysAddrH range? - | +- YES -> Guest RAM address (GTT page) - | +- Access via Ruby memory system (shared memory direct access) - | - +- Neither? - +- Sink (paddr=0, safely discarded) -``` - -## 6. DMA Flow - -### 6.1 vfio-user Backend: Shared Memory Direct Access - -With the vfio-user backend, gem5 accesses Guest RAM directly through the Ruby memory system's shared backstore (`/dev/shm/cosim-guest-ram`), with no socket-based DMA operations needed. - -``` -gem5 GPU model (PM4/SDMA/IH) - | - | Needs to read ring buffer commands / write fence values - | - v Ruby memory system request - | - +- Address translated by GART -> Guest physical address - | - +- Ruby memory controller accesses PhysicalMemory - | | - | +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED) - | +- read/write directly hits shared memory - | +- QEMU sees changes immediately (same mmap file) - | - +- Done (no socket round-trip needed) -``` - -**Key advantages**: - -- **Zero-copy**: DMA reads and writes operate directly on shared memory with no serialization/deserialization -- **Low latency**: Eliminates the socket request-response round-trip overhead -- **Simplified architecture**: No custom DMA protocol needed; Ruby's memory system natively supports shared backstores - -Interrupts are delivered via vfio-user's irq_fd mechanism (eventfd -> KVM), with no custom interrupt messages needed. - -### 6.2 Legacy Backend: Socket DMA Protocol - -> The following describes the legacy custom cosim socket backend (`MI300XGem5Cosim`), retained for reference. - -#### 6.2.1 gem5 Reads from Guest RAM (ring buffers / fences) - -``` -gem5 GPU model (PM4/SDMA/IH) - | - | Needs to read ring buffer commands from Guest RAM - | - v cosimBridge->sendDmaRead(guestPhysAddr, length) - | - +- Construct DmaRead message (32-byte header) - | { type=DmaRead, addr=guestPhysAddr, data=length } - | - +- sendAll(eventFd, &msg, 32) --> QEMU event thread - | | - | +- pci_dma_read(addr, buf, len) - | | (reads from /dev/shm/cosim-guest-ram) - | | - | +- sendAll(eventFd, &resp, 32) - | <------------------------------------------+- sendAll(eventFd, data, len) - | - +- memcpy(dest, recvBuf, length) // data arrives at gem5 -``` - -#### 6.2.2 gem5 Writes to Guest RAM (fences / IH cookies) - -``` -gem5 GPU model - | - | Needs to write fence value to Guest RAM - | - v cosimBridge->sendDmaWrite(guestPhysAddr, length, data) - | - +- Construct DmaWrite message + data payload - | { type=DmaWrite, addr=guestPhysAddr, data=length, size=length } - | - +- sendAll(eventFd, &msg, 32) --> QEMU event thread - +- sendAll(eventFd, data, length) --> | - | +- pci_dma_write(addr, buf, len) - | | (writes to /dev/shm/cosim-guest-ram) - | | - +- Done (DMA writes don't wait for response) +- Driver can see data immediately -``` - -### 6.3 vfio-user vs Legacy Backend Comparison - -| Dimension | vfio-user Backend | Legacy Socket Backend | -|-----------|-------------------|----------------------| -| Guest RAM DMA | Ruby memory system direct access to shared backstore | Socket request-response protocol | -| VRAM access | mmap zero-copy (same) | mmap zero-copy (same) | -| Interrupts | irq_fd (eventfd -> KVM) | Socket messages | -| MMIO | vfio-user message passing | Custom socket protocol | -| QEMU-side device | Built-in `vfio-user-pci` | Custom `mi300x_gem5.c` | -| Address translation | gem5-internal GART translation | QEMU-side `pci_dma_read/write` | - -The reasons the legacy backend routed Guest RAM DMA through the socket (address translation, event-driven simulation, IOMMU compatibility, etc.) no longer apply with vfio-user: gem5's Ruby memory controllers directly access the shared backstore memory, and GART address translation is performed internally within gem5. - -## 7. Sink Mechanism - -### 7.1 Problem Scenario - -In co-simulation mode, some GART PTEs may be zero (uninitialized) or point to VRAM-internal addresses. If gem5 cannot translate these addresses, it throws a `GenericPageTableFault`, causing a DMA retry loop that hangs the simulation. - -### 7.2 Solution - -```cpp -// amdgpu_vm.cc: GARTTranslationGen::translate() - -if (pte == 0) { - if (origAddr < vramShmemSize && vramShmemPtr) { - // VRAM address -> map to sink (paddr=0) - range.paddr = 0; - warn_once("GART: VRAM address mapped to sink -- " - "VRAM write-backs are no-ops in cosim"); - } else if (vramShmemPtr) { - // Unmapped GART page -> sink - range.paddr = 0; - warn_once("GART cosim: unmapped page -> sink"); - } -} -``` - -**Sink semantics**: -- `paddr=0` is always a valid physical address in gem5 (system RAM base) -- DMA reads return zeros -- DMA writes are silently discarded -- Prevents the fault -> retry deadloop - -## 8. Complete Data Flow Example - -Using a HIP kernel dispatch to illustrate the full memory interaction: - -``` -1. hipMalloc(&d_a, N*sizeof(int)) - Driver -> allocates buffer in VRAM - Writes GART PTEs to shared VRAM (BAR0) - -2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice) - Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer) - Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5 - gem5 -> reads ring buffer (Guest RAM via shared memory) - gem5 -> parses SDMA command -> GART translates source address -> Guest RAM - gem5 -> reads source data (Guest RAM via shared memory) - gem5 -> writes to VRAM destination (shared memory direct write) - -3. kernel<<<1, N>>>(d_a, d_b, d_c, N) - Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer) - Driver -> writes Doorbell -> gem5 - gem5 -> reads PM4 command (Guest RAM via shared memory) - gem5 -> launches shader execution - gem5 -> shader reads/writes VRAM (shared memory direct access) - gem5 -> writes fence on completion (Guest RAM via Ruby memory write) - gem5 -> sends MSI-X interrupt (irq_fd -> KVM) - -4. hipDeviceSynchronize() - Driver -> polls fence value (until Guest RAM value matches) - +- fence written by gem5 via Ruby memory write to shared backstore -``` - -## 9. Known Limitations - -### 9.1 DMA Buffer Size (Legacy Backend) - -> This limitation only applies to the legacy socket backend. The vfio-user backend accesses shared memory directly and has no such limit. - -Maximum single DMA transfer is 4 MiB (`COSIM_DMA_BUF_SIZE`). Transfers exceeding this size must be chunked. In practice, the driver typically submits page-sized transfers, so this limit is rarely hit. - -### 9.2 User-space Page Tables (VMID > 0) - -VMID 0 (kernel mode) GART page tables are fully visible via shared VRAM. However, VMID > 0 (user mode) multi-level page tables are walked by `VegaISA::Walker`, which uses gem5's internal TLB/page walker rather than reading directly from shared memory. - -The practical impact is limited: after the driver writes page tables, it sends TLB invalidate MMIOs. gem5 flushes its TLB upon receiving these, and subsequent walker traversals read from the correct physical addresses (which point to shared VRAM or Guest RAM). - -### 9.3 VRAM Write-back Semantics - -Some GART addresses in gem5 point back to VRAM itself (VRAM-to-VRAM DMA). These addresses are routed to the sink (paddr=0), and writes are silently discarded. For pure compute workloads, this does not affect correctness. - -## 10. File Reference - -| File | Key Function/Region | Role | -|------|---------------------|------| -| `gem5/src/dev/amdgpu/amdgpu_vm.cc:396-557` | `GARTTranslationGen::translate()` | Core GART translation logic | -| `gem5/src/dev/amdgpu/amdgpu_vm.hh` | `AMDGPUSysVMContext`, `vramShmemPtr` | GART data structures | -| `gem5/src/dev/amdgpu/mi300x_vfio_user.cc` | `setupVramShm()` | VRAM shared memory initialization (vfio-user backend) | -| `gem5/src/dev/amdgpu/mi300x_vfio_user.hh` | `MI300XVfioUser` | vfio-user server-side bridge | -| `gem5/src/dev/amdgpu/mi300x_gem5_cosim.cc` | `setupSharedMemory()`, `sendDmaRead/Write()` | Legacy socket backend (VRAM init + DMA) | -| `gem5/configs/example/gpufs/mi300_cosim.py` | `shared_backstore` config, `--cosim-backend` | Guest RAM sharing setup + backend selection | -| `gem5/src/dev/amdgpu/MI300XVfioUser.py` | SimObject definition | vfio-user backend Python binding | diff --git a/docs/en/cosim-technical-notes.md b/docs/en/cosim-technical-notes.md deleted file mode 100644 index ac596a5..0000000 --- a/docs/en/cosim-technical-notes.md +++ /dev/null @@ -1,352 +0,0 @@ -[中文](../zh/cosim-technical-notes.md) - -# QEMU + gem5 MI300X Co-simulation: Technical Notes - -This document summarizes the architecture, implementation details, resolved issues, and known limitations of the QEMU + gem5 MI300X co-simulation system. - -## 1. Architecture Overview - -``` -+--------------------------------------+ -| QEMU (Q35 + KVM) | -| +--------------------------------+ | -| | Guest Linux (Ubuntu 24) | | -| | amdgpu driver (ROCm 7) | | -| | ROCm userspace | | -| +--------------+-----------------+ | -| | MMIO / Doorbell | -| +--------------v-----------------+ | -| | vfio-user-pci | | -| | (QEMU built-in device) | | -| +--------------+-----------------+ | -| | vfio-user protocol | -+-----------------+--------------------+ - | /tmp/gem5-mi300x.sock - | (Unix socket) -+-----------------+--------------------+ -| gem5 | | -| +--------------v-----------------+ | -| | MI300XVfioUser | | -| | (mi300x_vfio_user.cc) | | -| | [libvfio-user server] | | -| +--------------+-----------------+ | -| | AMDGPUDevice API | -| +--------------v-----------------+ | -| | AMDGPUDevice | | -| | PM4PacketProcessor | | -| | SDMAEngine | | -| | Shader / CU array | | -| +--------------------------------+ | -+--------------------------------------+ - -Shared Memory: - /dev/shm/cosim-guest-ram Guest physical RAM (QEMU <-> gem5 DMA) - /dev/shm/mi300x-vram GPU VRAM (QEMU BAR0 <-> gem5 device memory) -``` - -> **Note**: The legacy backend (`mi300x-gem5` QEMU device + `MI300XGem5Cosim` gem5 bridge) is still available via `--cosim-backend=legacy`. The vfio-user backend is the current default. - -### Key Components - -| Component | Location | Purpose | -|---|---|---| -| `MI300XVfioUser` | `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5 vfio-user server; handles BAR access and interrupts via libvfio-user (**default backend**) | -| `vfio-user-pci` | QEMU built-in device | QEMU-side vfio-user client; no custom QEMU code needed | -| `CosimBridge` | `src/dev/amdgpu/cosim_bridge.hh` | Abstract co-simulation bridge interface, implemented by both vfio-user and legacy backends | -| `MI300XGem5Cosim` | `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | Legacy socket bridge SimObject (**legacy backend**) | -| `mi300x_gem5.c` | `qemu/hw/misc/` (legacy) | Legacy QEMU PCI device; forwards MMIO/doorbell via custom socket protocol (**legacy backend**) | -| `mi300_cosim.py` | `configs/example/gpufs/` | gem5 config; select backend via `--cosim-backend=vfio-user\|legacy` | -| `cosim_launch.sh` | `scripts/` | Orchestrates Docker (gem5) + QEMU launch sequence | - -### PCI BAR Layout - -``` -BAR0+1 VRAM 64-bit prefetchable 16 GiB (shared memory) -BAR2+3 Doorbell 64-bit 4 MiB -BAR4 MSI-X exclusive -BAR5 MMIO regs 32-bit 512 KiB (forwarded to gem5) -``` - -This layout **must** match the expectations hardcoded in the amdgpu driver (`AMDGPU_VRAM_BAR=0`, `AMDGPU_DOORBELL_BAR=2`, `AMDGPU_MMIO_BAR=5`). - -## 2. Resolved Issues (Pitfall Log) - -### 2.1 Shared Memory File Offset Mismatch (Critical) - -**Symptom**: GART page table entries read back as all zeros; PM4 opcode 0x0 (NOP, count 0) repeats infinitely. - -**Root cause**: QEMU Q35 and gem5 split memory below/above 4G differently, resulting in different file offsets within the shared backing store. - -- QEMU Q35 with 8 GiB RAM: `below_4g = 2 GiB` (hardcoded when `ram_size >= 0xB0000000`). See `qemu/hw/i386/pc_q35.c:161`. -- gem5 configured as 3 GiB below / 5 GiB above. -- QEMU places above-4G data at file offset 2 GiB; gem5 reads from offset 3 GiB -> all zeros. - -**Fix**: `mi300_cosim.py` replicates the Q35 split logic: - -```python -total_mem = convert.toMemorySize(args.mem_size) -lowmem_limit = 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000 -below_4g = min(total_mem, lowmem_limit) -above_4g = total_mem - below_4g -``` - -**Key lesson**: When two systems share a memory-backend-file, they must agree on file offsets for each range, not just the total size. - -### 2.2 SIGIO Edge-Triggered Drain Issue (Critical, Legacy Backend) - -**Symptom**: gem5 hangs forever after processing the first MMIO message. QEMU's socket buffer fills up. - -**Root cause**: gem5's `PollQueue` uses `FASYNC`/`SIGIO`, which is **edge-triggered**. If multiple messages arrive before the first one is processed, only one `SIGIO` fires. After handling one message, the remaining messages sit in the socket buffer with no signal to wake gem5. - -**Fix**: `mi300x_gem5_cosim.cc:handleClientData()` uses a `do/while` loop with `poll(fd, POLLIN, 0)` to drain **all** pending messages on each SIGIO arrival. - -```cpp -do { - // read and process one message - ... - struct pollfd pfd = {fd, POLLIN, 0}; -} while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -``` - -> **Note**: This issue only affects the legacy backend. The vfio-user backend uses libvfio-user's non-blocking poll mechanism and does not rely on SIGIO signals. - -### 2.3 VRAM Address GART Translation Error (Critical) - -**Symptom**: Address `0x1f72fa8000` triggers over 861,000 GART translation errors, memory exhaustion, and segfault. - -**Root cause**: SDMA rptr writeback addresses and PM4 RELEASE_MEM destination addresses may point to VRAM (address < 16 GiB). When these addresses go through `getGARTAddr()`, the page number is multiplied by 8, and GART translation fails because VRAM addresses have no corresponding page table entries. - -**Fix (three-layer defense)**: - -1. **PM4 layer** (`pm4_packet_processor.cc`): `writeData()`, `releaseMem()`, `queryStatus()` check `isVRAMAddress(addr)` and route VRAM writes through `gpuDevice->getMemMgr()->writeRequest()` (device memory) instead of `dmaWriteVirt()` (system memory via GART). - -2. **SDMA layer** (`sdma_engine.cc`): `setGfxRptrLo/Hi()` and rptr writeback skip `getGARTAddr()` for VRAM addresses, using `getMemMgr()->writeRequest()` instead. - -3. **GART fallback** (`amdgpu_vm.cc`): `GARTTranslationGen::translate()` detects VRAM addresses by reversing the `getGARTAddr` transform (`orig_page = page_num >> 3`) and maps them to `paddr=0` as a sink instead of faulting. - -### 2.4 Timer Overflow in Co-simulation Mode - -**Symptom**: After billions of ticks, gem5 crashes due to `curTick()` integer overflow (RTC and PIT timers continuously scheduling events). - -**Fix**: Added a `disable_rtc_events` parameter to `Cmos` and a `disable_timer_events` parameter to `I8254`. Both are disabled in `mi300_cosim.py`. A keepalive event in `MI300XGem5Cosim` prevents the event queue from becoming empty. - -### 2.5 PSP / SMU Firmware Load Failure - -**Symptom**: `modprobe amdgpu` with `ip_block_mask=0x6f` fails with `-EINVAL` during PSP firmware loading. - -**Root cause**: In ROCm 7.0's `amdgpu_discovery.c`, the IP block enumeration order is: -``` -0: soc15_common 1: gmc_v9_0 2: vega20_ih -3: psp 4: smu 5: gfx_v9_4_3 -6: sdma_v4_4_2 7: vcn_v4_0_3 8: jpeg_v4_0_3 -``` - -`ip_block_mask=0x6f` = `0b01101111` disables bit 4 (SMU) but does **not** disable bit 3 (PSP). Use `ip_block_mask=0x67` = `0b01100111` to disable both PSP (bit 3) and SMU (bit 4). - -### 2.6 QEMU Serial Console Conflict with `-nographic` - -**Symptom**: No serial output from guest when using `-serial unix:/tmp/serial.sock -nographic` together. - -**Root cause**: `-nographic` implies `-serial mon:stdio`, which creates serial0 mapped to stdio. The explicit `-serial unix:...` becomes serial1 (ttyS1), but the kernel uses `console=ttyS0`. - -**Fix**: Use `-nographic` alone (serial output goes to stdio). For programmatic access, run QEMU inside `screen`: -```bash -screen -dmS qemu-cosim -L -Logfile /tmp/log -screen -S qemu-cosim -X stuff 'command\n' -``` - -### 2.7 Unsupported PM4 Opcodes - -| Opcode | Name | Description | Fix | -|--------|------|-------------|-----| -| `0x58` | `ACQUIRE_MEM` | Memory barrier / cache flush | NOP (skip packet body) | -| `0xA0` | `SET_RESOURCES` | Queue resource configuration | NOP (skip packet body) | - -Both have been added to `pm4_defines.hh` and handled in `pm4_packet_processor.cc:decodeHeader()` as skip-and-continue. - -### 2.8 Out-of-Memory (OOM) During Linking - -**Symptom**: Linker killed by OOM killer even with `-j2`. - -**Fix**: Use the gold linker and limit to a single job: -```bash -scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold -``` - -### 2.9 PCI Class Code - -**Symptom**: amdgpu driver skips the legacy VGA ROM check at `0xC0000`. - -**Fix**: Changed PCI class from `PCI_CLASS_DISPLAY_OTHER (0x0380)` to `PCI_CLASS_DISPLAY_VGA (0x0300)`. With VGA class, the kernel automatically detects it as a "video device with shadowed ROM". - -### 2.10 GART Unmapped Page Crash (Critical) - -**Symptom**: After a HIP program outputs `hipMalloc OK`, gem5 segfaults with repeated `GART translation for 0x3fff800000000 not found` warnings. - -**Root cause**: The GPU's PM4/SDMA engines attempt DMA to GART pages that the driver has not yet mapped (PTE = 0 in shared VRAM). The original code created a `GenericPageTableFault`, but the DMA callback chain retried the same failing address infinitely, exhausting memory and crashing. - -**Fix**: In co-simulation mode, unmapped GART pages are mapped to a sink (`paddr=0`) instead of faulting. DMA reads return zeros, writes are discarded, but the simulation stays alive. GART sink diagnostics also log `fbBase` to aid debugging. - -**Key finding**: GART PTEs at `gartBase` (= `ptBase`) in shared VRAM were correctly populated by the driver. Diagnostics confirmed that subsequent PTEs (offset 0x32E0+) contain valid entries, while the first page (ptStart itself) is simply unmapped -- this is normal behavior. - -### 2.11 SDMA Ring Test Timeout - -**Symptom**: SDMA ring test returns -110 (ETIMEDOUT) during driver initialization. - -**Root cause**: `sdma_delay = 1e9` in `sdma_engine.hh` causes each SDMA processing step to take 1 billion simulation ticks. Combined with the keepalive-driven event loop, SDMA completes in ~500ms wall-clock time, exceeding the driver's ~200ms timeout window. - -**Fix**: Reduced `sdma_delay` from `1e9` to `1000` and increased `KEEPALIVE_INTERVAL` to `1e9`. This dramatically shortens the wall-clock latency of SDMA operations, allowing the ring test to complete within the driver's timeout window. - -## 3. Current Status - -### Implemented Features - -- **vfio-user backend (default)**: QEMU uses its built-in `vfio-user-pci` device, gem5 runs `MI300XVfioUser` as a vfio-user server. No custom QEMU code needed; stock QEMU 10.0+ works out of the box -- **Driver initialization**: amdgpu 3.64.0 fully loaded - - IP discovery from firmware files (`discovery=2`) - - GMC (memory controller), GFX (compute), SDMA, IH (interrupt handler) - - 8 KIQ rings mapped (mec 2 pipe 1 q 0) - - 4 SDMA engines x 4 queues = 16 SDMA rings - - 64+ compute rings across 8 XCP partitions - - 7 DRM XCP device nodes (`/dev/dri/renderD129..135`) - - SDMA ring test passes (after `sdma_delay` tuning) - - Fence fallback timer issue resolved -- **ROCm tools**: - - `rocm-smi`: device 0x74a0, SPX partition, 1% VRAM - - `rocminfo`: Agent gfx942, 320 CU, 4 SIMD/CU, KERNEL_DISPATCH -- **KFD** (Kernel Fusion Driver): node added, 16383 MB VRAM, HSA agent registered -- **GPU compute (HIP)**: fully functional! - - `hipMalloc` / `hipMemcpy` (host-to-device, device-to-host) - - Kernel dispatch (`addKernel<<<1, N>>>`) runs on gfx942 - - `hipDeviceSynchronize` returns `hipSuccess` - - Results verified correct: `{1+10, 2+20, 3+30, 4+40}` = `{11, 22, 33, 44}` - - Test results: vector_add (120ms), transpose (6.5s), gemm (4.7s) all PASSED -- **MSI-X interrupt forwarding**: gem5 -> QEMU via vfio-user protocol (vfio-user backend) or event socket (legacy backend) - - `AMDGPUDevice::intrPost()` -> `cosimBridge->sendIrqRaise(0)` - - QEMU -> guest IH handler -- **GART translation**: co-simulation fallback reads PTEs from shared VRAM; unmapped pages safely routed to sink -- **65,000+ MMIO operations** handled without crashes -- **Disk image**: `cosim-gpu-setup.service` auto-loads driver at boot (dd ROM → modprobe with `ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`) - -### Known Limitations - -1. **VGA BIOS ROM must be dd'd first**: The `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128` step is mandatory before `modprobe`. The driver's BIOS discovery chain (ACPI ATRM/VFCT, SMU ROM read, platform ROM) all fail in cosim mode. Without the ROM at `0xC0000`, `atom_context` is NULL and `amdgpu_ras_init` crashes with a NULL pointer dereference. - -2. **GART unmapped pages**: Some GART pages have PTE=0 and are routed to sink. This is safe but means DMA reads to those addresses return zeros. - -## 4. File Change Summary - -### gem5 (New Files - vfio-user Backend) -| File | Description | -|---|---| -| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | vfio-user server SimObject | -| `src/dev/amdgpu/MI300XVfioUser.py` | SimObject Python wrapper | -| `src/dev/amdgpu/cosim_bridge.hh` | Abstract CosimBridge interface (implemented by both vfio-user and legacy backends) | -| `ext/libvfio-user/` | libvfio-user library (submodule) | - -### gem5 (New Files - Legacy Backend) -| File | Description | -|---|---| -| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | Socket bridge SimObject | -| `src/dev/amdgpu/MI300XGem5Cosim.py` | SimObject Python wrapper | - -### gem5 (New Files - Common) -| File | Description | -|---|---| -| `configs/example/gpufs/mi300_cosim.py` | Co-simulation system config (`--cosim-backend=vfio-user\|legacy`) | -| `scripts/cosim_launch.sh` | Launch orchestration script | - -### gem5 (Modified Files) -| File | Changes | -|---|---| -| `src/dev/amdgpu/pm4_packet_processor.{cc,hh}` | VRAM write routing, `isVRAMAddress()`, ACQUIRE_MEM/SET_RESOURCES NOP | -| `src/dev/amdgpu/pm4_defines.hh` | Added `IT_ACQUIRE_MEM`, `IT_SET_RESOURCES` | -| `src/dev/amdgpu/sdma_engine.{cc,hh}` | VRAM rptr writeback routing, `sdma_delay` tuning | -| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | GART co-simulation fallback (shared VRAM PTE reads), VRAM address sink | -| `src/dev/amdgpu/amdgpu_device.cc` | Co-simulation integration hooks | -| `src/dev/amdgpu/amdgpu_nbio.cc` | ASIC initialization complete register | -| `src/dev/intel_8254_timer.{cc,hh}` | `disable_timer_events` parameter | -| `src/dev/mc146818.{cc,hh}` | `disable_rtc_events` parameter | - -### QEMU (New Files - Legacy Backend) -| File | Description | -|---|---| -| `hw/misc/mi300x_gem5.c` | MI300X PCI device with socket bridge | -| `hw/misc/mi300x_gem5.h` | Header file | -| `hw/misc/trace-events` | Trace event definitions | - -> **Note**: The vfio-user backend uses QEMU's built-in `vfio-user-pci` device and requires no custom QEMU code. - -## 5. How to Run - -### Prerequisites -- Docker installed with `gem5-run:local` image built -- QEMU 10.0+ (native vfio-user support); legacy backend requires QEMU compiled from `cosim/qemu/` -- Disk image `x86-ubuntu-rocm70` + kernel `vmlinux-rocm70` - -### Quick Start -```bash -cd cosim -./scripts/cosim_launch.sh -# GPU driver loads automatically via cosim-gpu-setup.service (~40s) -# After guest boots, verify: -rocm-smi # should show device 0x74a0 -rocminfo # should show gfx942 -``` - -### Manual Launch (for Debugging) -```bash -# 1. Run gem5 in Docker -docker run -d --name gem5-cosim \ - -v "$PWD:/gem5" -v /tmp:/tmp -v /dev/shm:/dev/shm -w /gem5 \ - -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ - gem5-run:local build/VEGA_X86/gem5.opt \ - --debug-flags=MI300XCosim --listener-mode=on \ - configs/example/gpufs/mi300_cosim.py \ - --socket-path=/tmp/gem5-mi300x.sock \ - --shmem-path=/mi300x-vram \ - --shmem-host-path=/cosim-guest-ram \ - --dgpu-mem-size=16GiB --num-compute-units=40 --mem-size=8G - -# 2. Wait for socket creation and fix permissions -docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock -docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram - -# 3. Run QEMU in screen (vfio-user backend, default) -screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \ - qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,\ - mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel ../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 \ - modprobe.blacklist=amdgpu earlyprintk=serial,ttyS0,115200" \ - -drive file=../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,\ - format=raw,if=virtio \ - -device vfio-user-pci,socket=/tmp/gem5-mi300x.sock \ - -nographic -no-reboot - -# For legacy backend, replace the -device line above with: -# -device mi300x-gem5,gem5-socket=/tmp/gem5-mi300x.sock,\ -# shmem-path=/dev/shm/mi300x-vram,vram-size=17179869184 -# and use QEMU compiled from cosim/qemu/ - -# 4. Manual GPU setup (if cosim-gpu-setup.service is not installed) -screen -S qemu-cosim -X stuff 'dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128\n' -screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2\n' -``` - -## 6. Debugging Tips - -- **gem5 debug flags**: `--debug-flags=MI300XCosim,AMDGPUDevice,PM4PacketProcessor` -- **QEMU trace**: `--qemu-trace 'mi300x_gem5_*'` -- **Check gem5 logs**: `docker logs gem5-cosim 2>&1 | grep -E "warn|error|GART"` -- **Check guest dmesg**: `screen -S qemu-cosim -X stuff 'dmesg | tail -20\n'` -- **Incremental rebuild**: Delete stale `.o` files and rebuild with gold linker: - ```bash - docker run --rm -v "$PWD:/gem5" -w /gem5 gem5-run:local \ - sh -c 'rm -f build/VEGA_X86/dev/amdgpu/.o' - docker run --rm -v "$PWD:/gem5" -w /gem5 \ - gem5-run:local scons build/VEGA_X86/gem5.opt -j1 - ``` diff --git a/docs/en/cosim-usage-guide.md b/docs/en/cosim-usage-guide.md deleted file mode 100644 index 7810f98..0000000 --- a/docs/en/cosim-usage-guide.md +++ /dev/null @@ -1,579 +0,0 @@ -[中文](../zh/cosim-usage-guide.md) - -# QEMU + gem5 MI300X Co-simulation Usage Guide - -A complete workflow from compilation to running HIP GPU compute. - -## Architecture Overview - -``` -+---------------------------------+ +------------------------------+ -| QEMU (Q35 + KVM) | | gem5 (inside Docker) | -| +---------------------------+ | | +------------------------+ | -| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU Model | | -| | amdgpu driver | | | | - Shader + CU | | -| | ROCm 7.0 / HIP runtime | | | | - PM4 / SDMA Engines | | -| +-----------+---------------+ | | | - Ruby Cache Hierarchy | | -| | MMIO/Doorbell | | +----------+-------------+ | -| +-----------v---------------+ | | +----------v-------------+ | -| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | -| +---------------------------+ |vfio-| +------------------------+ | -| |user | | -+---------------------------------+ +------------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (Guest Physical Memory, Shared) (GPU VRAM, Shared) -``` - -- **QEMU** is responsible for: CPU execution, Linux kernel boot, PCIe enumeration, amdgpu driver loading -- **gem5** is responsible for: MI300X GPU compute model (Shader, CU, Cache, DMA engines) -- They communicate via the **vfio-user protocol** (over a Unix domain socket). QEMU uses its built-in `vfio-user-pci` device, while gem5 runs `MI300XVfioUser` as a vfio-user server, transparently handling MMIO / Doorbell / PCI Config accesses. Data is shared via **shared memory** - -## Prerequisites - -| Requirement | Description | -|---|---| -| Host OS | Linux x86_64 with KVM support (verified on WSL2 6.6.x) | -| Docker | Daemon running, current user in `docker` group | -| KVM | `/dev/kvm` accessible | -| Disk Space | At least 120 GB (55G disk image + build artifacts) | -| Memory | 16 GB or more recommended (gem5 compilation and runtime are memory-intensive) | -| Tools | `git`, `screen`, `unzip` | - -## Directory Structure - -``` -/home/zevorn/cosim/ - gem5/ # gem5 source (cosim branch) - build/VEGA_X86/gem5.opt # gem5 binary - configs/example/gpufs/ - mi300_cosim.py # cosim config script - scripts/ - run_mi300x_fs.sh # orchestration script - cosim_launch.sh # cosim one-click launch script - Dockerfile.run # runtime Docker image - gem5-resources/ # disk images, kernels, GPU apps - src/x86-ubuntu-gpu-ml/ - disk-image/x86-ubuntu-rocm70 # 55G raw disk image - vmlinux-rocm70 # kernel - docs/ # documentation - qemu/ # QEMU source (only needed for legacy backend) - build/qemu-system-x86_64 # QEMU binary -``` - ---- - -## Step 1: Build gem5 - -The gem5 binary links against Ubuntu 24.04 libraries and must be compiled in a compatible environment. - -> **Note:** The vfio-user backend requires `libjson-c-dev` (build-time) and `libjson-c5` (runtime). The `ghcr.io/gem5/gpu-fs:latest` image already includes this dependency. If building directly on the host, install `libjson-c-dev` first. - -### Option 1: Build inside Docker (Recommended) - -```bash -cd /home/zevorn/cosim/gem5 - -# Build using the gpu-fs image (amd64, includes all dependencies) -docker run --rm \ - -v "$(pwd):/gem5" -w /gem5 \ - gem5-run:local \ - scons build/VEGA_X86/gem5.opt -j4 -``` - -> **Note:** Reduce parallelism (`-j1` or `-j2`) if running out of memory. Using the gold linker reduces memory usage during the linking stage. - -### Option 2: Orchestration Script - -```bash -./scripts/run_mi300x_fs.sh build-gem5 -``` - -Output: `build/VEGA_X86/gem5.opt` (approximately 1.1 GB). - -### Build the Runtime Docker Image - -```bash -cd scripts -docker build -t gem5-run:local -f Dockerfile.run . -``` - -This image is based on `ghcr.io/gem5/gpu-fs` with Python 3.12 support added, used for running gem5. - ---- - -## Step 2: Build QEMU - -With the vfio-user backend, a **stock QEMU 10.0+** build works out of the box (the `vfio-user-pci` device is built-in) -- no custom QEMU code is needed. Standard build: - -```bash -# Any QEMU 10.0+ source tree works -mkdir -p qemu-build && cd qemu-build -/path/to/qemu/configure --target-list=x86_64-softmmu -make -j$(nproc) -``` - -Output: `qemu-system-x86_64`. - -> **Legacy backend:** If using `--cosim-backend=legacy`, the `cosim/qemu/` source containing the `mi300x-gem5` device is required. The build procedure is the same, but you must use the cosim branch QEMU source. - -Alternatively, via the orchestration script: - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-qemu -``` - ---- - -## Step 3: Prepare the Disk Image and Kernel - -The disk image contains Ubuntu 24.04 + ROCm 7.0 + kernel 6.8.0-79-generic with amdgpu DKMS modules. - -### Automated Build - -```bash -./scripts/run_mi300x_fs.sh build-disk -``` - -### Manual Build - -```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml -./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" -``` - -> On Arch Linux the QEMU path is `/usr/sbin/`, other distributions may use `/usr/bin/`. - -### Output - -| Artifact | Path | Size | -|---|---|---| -| Disk Image | `../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | ~55 GB | -| Kernel | `../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | ~64 MB | - ---- - -## Step 4: Launch cosim - -### Option 1: One-click Launch Script (Recommended) - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/cosim_launch.sh -``` - -This script automatically performs all the following steps (starts the gem5 container, waits for readiness, fixes permissions, starts QEMU), and enters the QEMU serial console in interactive mode. - -Available options: - -```bash -./scripts/cosim_launch.sh --help -./scripts/cosim_launch.sh --gem5-debug MI300XCosim # enable gem5 debug output -./scripts/cosim_launch.sh --vram-size 32GiB # custom VRAM size -./scripts/cosim_launch.sh --num-cus 80 # custom CU count -./scripts/cosim_launch.sh --cosim-backend=vfio-user # use vfio-user backend (default) -./scripts/cosim_launch.sh --cosim-backend=legacy # use legacy custom socket backend -``` - -### Option 2: Manual Step-by-step Launch - -#### 4.1 Start gem5 (Docker Container) - -```bash -docker run -d --name gem5-cosim \ - -v /home/zevorn/cosim/gem5:/gem5 \ - -v /tmp:/tmp \ - -v /dev/shm:/dev/shm \ - -w /gem5 \ - -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ - gem5-run:local \ - /gem5/build/VEGA_X86/gem5.opt --listener-mode=on \ - /gem5/configs/example/gpufs/mi300_cosim.py \ - --socket-path=/tmp/gem5-mi300x.sock \ - --shmem-path=/mi300x-vram \ - --shmem-host-path=/cosim-guest-ram \ - --dgpu-mem-size=16GiB \ - --num-compute-units=40 \ - --mem-size=8G -``` - -#### 4.2 Wait for gem5 to be Ready - -```bash -# Watch gem5 logs, wait for "listening" or "ready" -docker logs -f gem5-cosim -``` - -The following output indicates readiness: - -``` -============================================================ -gem5 MI300X co-simulation server ready - Socket: /tmp/gem5-mi300x.sock - VRAM SHM: /mi300x-vram - Host SHM: /cosim-guest-ram - VRAM size: 16GiB - Host RAM: 8GiB - CUs: 40 -Waiting for QEMU to connect... -============================================================ -``` - -#### 4.3 Fix Permissions - -Files created by Docker are owned by root; permissions must be fixed so QEMU can access them: - -```bash -docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock -docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram -``` - -#### 4.4 Start QEMU - -```bash -# Foreground interactive mode (vfio-user backend, stock QEMU 10.0+) -qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host \ - -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ - -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ - -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ - -nographic -no-reboot -``` - -> **Important:** The kernel command line must include `modprobe.blacklist=amdgpu` to prevent the PCI subsystem from auto-loading the driver before the VGA ROM is written to shared memory. The `cosim-gpu-setup.service` handles the correct initialization order (dd ROM → modprobe). -> -> **Note:** With the vfio-user backend, there is no need to specify `shmem-path` or `vram-size` on the QEMU side. Shared memory is created and managed by the `MI300XVfioUser` server in gem5. - -Or run in background screen mode: - -```bash -screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \ - qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host \ - -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ - -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ - -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ - -nographic -no-reboot - -# Attach to the screen session to view serial output -screen -r qemu-cosim -# Detach from screen: Ctrl-A D -``` - -#### 4.5 SSH Access to Guest - -The `cosim_launch.sh` script enables user networking and SSH port forwarding by default (`-netdev user,id=net0,hostfwd=tcp::2222-:22` + `virtio-net-pci`). To use SSH access to the guest, configure networking inside the guest first. - -**1. Identify the network interface name:** - -```bash -ip a -``` - -Look for the virtio NIC interface (e.g., `enp0s2`). The exact name may vary depending on the PCI topology. - -**2. Configure netplan:** - -Edit `/etc/netplan/50-cloud-init.yaml`: - -```yaml -network: - version: 2 - ethernets: - enp0s2: - dhcp4: true -``` - -> **Note:** Replace `enp0s2` with the actual interface name from the `ip a` output. - -**3. Apply the configuration:** - -```bash -netplan apply -``` - -**4. SSH from the host:** - -Open another terminal on the host and connect: - -```bash -ssh -p 2222 gem5@localhost -``` - -Default password: `12345`. - -> **Tip:** SSH access is much more convenient than the QEMU serial console for interactive use, file transfers (`scp -P 2222`), and running multiple sessions. - ---- - -## Step 5: Load the GPU Driver - -After the guest Linux finishes booting (auto-login as root), run the following commands to load the amdgpu driver. - -### Option 1: Automatic Loading (Default) - -The disk image includes `cosim-gpu-setup.service` which runs at boot: - -1. Writes VGA ROM to `0xC0000` via `dd` (required for gem5 `readROM()`) -2. Symlinks IP discovery firmware -3. Runs `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` - -The service completes in ~40 seconds. After login, verify with `rocm-smi`. - -### Option 2: Manual Loading - -```bash -# 1. Load VGA ROM (REQUIRED before modprobe) -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 - -# 2. Symlink the IP discovery firmware -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ - /usr/lib/firmware/amdgpu/ip_discovery.bin - -# 3. Load the amdgpu driver -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -> **Key parameter notes:** -> - `ip_block_mask=0x67` (binary 0110_0111) enables GMC, IH, DCN, GFX, SDMA, VCN, and disables PSP and SMU -> - Using an incorrect mask (e.g., 0x6f) will cause PSP initialization to trigger a GPU reset, resulting in a kernel panic -> - `ras_enable=0` is required to prevent a NULL pointer crash in `amdgpu_atom_parse_data_header` (the 3KB cosim ROM has minimal ATOMBIOS data) -> - The `dd` step is **mandatory** -- without it, the driver's BIOS discovery chain fails and `atom_context` is NULL - -### Verify Driver Loading - -```bash -# Check dmesg - should see "amdgpu: DRM initialized" and "7 XCP partitions" -dmesg | grep -i amdgpu | tail -20 - -# Verify device recognition -rocm-smi - -# Verify GPU capabilities -rocminfo | head -40 -``` - -Expected output: - -``` -# rocm-smi output -GPU[0] : Device Name: 0x74a0 -GPU[0] : Partition: SPX - -# rocminfo output -Name: gfx942 -Compute Unit: 320 -KERNEL_DISPATCH capable -``` - -> Approximately 80 fence fallback timer warnings may appear during the loading process. This is normal -- the DRM subsystem uses a polling-mode timeout fallback mechanism when probing all ring buffers. - ---- - -## Step 6: Run GPU Compute Tests - -### Compile a HIP Test Program - -Write a simple vector addition program inside the guest: - -```bash -cat > /tmp/vec_add.cpp << 'EOF' -#include -#include - -__global__ void vec_add(int *a, int *b, int *c, int n) { - int i = blockIdx.x * blockDim.x + threadIdx.x; - if (i < n) c[i] = a[i] + b[i]; -} - -int main() { - const int N = 4; - int ha[N] = {1, 2, 3, 4}; - int hb[N] = {10, 20, 30, 40}; - int hc[N] = {0}; - - int *da, *db, *dc; - hipMalloc(&da, N * sizeof(int)); - hipMalloc(&db, N * sizeof(int)); - hipMalloc(&dc, N * sizeof(int)); - - hipMemcpy(da, ha, N * sizeof(int), hipMemcpyHostToDevice); - hipMemcpy(db, hb, N * sizeof(int), hipMemcpyHostToDevice); - - vec_add<<<1, N>>>(da, db, dc, N); - - hipMemcpy(hc, dc, N * sizeof(int), hipMemcpyDeviceToHost); - - printf("Result: %d %d %d %d\n", hc[0], hc[1], hc[2], hc[3]); - - bool pass = (hc[0]==11 && hc[1]==22 && hc[2]==33 && hc[3]==44); - printf("%s\n", pass ? "PASSED!" : "FAILED!"); - - hipFree(da); hipFree(db); hipFree(dc); - return pass ? 0 : 1; -} -EOF -``` - -Compile and run: - -```bash -# Compile (gfx942 = MI300X architecture) -/opt/rocm/bin/hipcc --offload-arch=gfx942 -o /tmp/vec_add /tmp/vec_add.cpp - -# Run -/tmp/vec_add -``` - -### Expected Output - -``` -Result: 11 22 33 44 -PASSED! -``` - -### Using the square Test from gem5-resources - -You can also use the square test program included in gem5-resources. First compile it on the host: - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-app square -``` - -Then copy the compiled binary into the guest (via scp or by directly mounting the disk image) and run it inside the guest: - -```bash -./square.default -``` - ---- - -## Shutting Down cosim - -### In the QEMU Serial Console - -``` -# Normal shutdown -poweroff - -# Or force quit QEMU -Ctrl-A X -``` - -### Clean Up Docker Container and Shared Memory - -```bash -docker rm -f gem5-cosim -rm -f /dev/shm/mi300x-vram /dev/shm/cosim-guest-ram -rm -f /tmp/gem5-mi300x.sock -``` - -> When using `cosim_launch.sh`, cleanup is performed automatically after exiting QEMU. - ---- - -## Troubleshooting - -### gem5 Container Exits Immediately After Starting - -```bash -docker logs gem5-cosim -``` - -Common causes: -- `gem5.opt` not compiled or incorrect path -- Python module import failure (check PYTHONPATH) -- Shared memory creation permission issues - -### QEMU Fails to Connect to gem5 - -``` -Failed to connect to /tmp/gem5-mi300x.sock -``` - -- Confirm gem5 has finished initialization (look for "Waiting for QEMU to connect") -- Confirm socket permissions have been fixed (`chmod 777`) - -### Driver Loading Fails -- PSP GPU Reset Panic - -``` -BUG: kernel NULL pointer dereference at psp_gpu_reset+0x43 -``` - -- An incorrect `ip_block_mask` was used. Must use `0x67` (disables PSP+SMU), not `0x6f` - -### gem5 Crash -- GART Translation Not Found - -``` -GART translation for 0x3fff800000000 not found -``` - -- This is a fixed bug: unmapped GART pages are now routed to a sink address (paddr=0) and no longer cause crashes -- If this still occurs, confirm you are using the latest compiled gem5 binary - -### hipcc Compilation Error -- Offload Arch - -``` -error: cannot find ROCm device library -``` - -- Confirm ROCm is properly installed: `ls /opt/rocm/lib/` -- Use the correct architecture flag: `--offload-arch=gfx942` - -### GPU Compute Timeout - -- Check gem5 logs (`docker logs gem5-cosim`) for errors -- A small number of fence timeouts is normal; a large number may indicate issues with the DMA or interrupt path - ---- - -## Key Parameter Reference - -| Parameter | Default | Description | -|---|---|---| -| `--socket-path` | `/tmp/gem5-mi300x.sock` | QEMU <-> gem5 communication socket (vfio-user protocol) | -| `--shmem-path` | `/mi300x-vram` | GPU VRAM shared memory name (under /dev/shm) | -| `--shmem-host-path` | `/cosim-guest-ram` | Guest RAM shared memory name | -| `--dgpu-mem-size` | `16GiB` | GPU VRAM size | -| `--num-compute-units` | `40` | Number of GPU compute units | -| `--mem-size` | `8GiB` | Guest physical memory size | -| `--cosim-backend` | `vfio-user` | Cosim backend type (`vfio-user` or `legacy`) | -| `ip_block_mask` | `0x67` | amdgpu driver IP block mask | -| `discovery` | `2` | Use IP discovery firmware | - -## Key File Reference - -| File | Purpose | -|---|---| -| `scripts/cosim_launch.sh` | cosim one-click launch script | -| `scripts/run_mi300x_fs.sh` | Orchestration script (compile, build image, run) | -| `configs/example/gpufs/mi300_cosim.py` | gem5 cosim configuration | -| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5-side vfio-user server (default backend) | -| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | gem5-side legacy bridge (legacy backend) | -| `src/dev/amdgpu/amdgpu_device.cc` | GPU device model | -| `src/dev/amdgpu/amdgpu_vm.cc` | GPU address translation (GART, etc.) | -| `qemu/hw/misc/mi300x_gem5.c` | QEMU-side mi300x-gem5 PCIe device (legacy backend only) | - -## Version Matrix - -| Component | Version | -|---|---| -| Guest OS | Ubuntu 24.04.2 LTS | -| Guest Kernel | 6.8.0-79-generic | -| ROCm | 7.0.0 | -| amdgpu DKMS | Matches ROCm 7.0 | -| gem5 Build Target | VEGA_X86 | -| GPU Device | MI300X (gfx942, DeviceID 0x74A0) | -| Coherence Protocol | GPU_VIPER | -| QEMU | 10.0+ (vfio-user backend) or cosim branch (legacy backend) | diff --git a/docs/en/getting-started.md b/docs/en/getting-started.md new file mode 100644 index 0000000..9a98e02 --- /dev/null +++ b/docs/en/getting-started.md @@ -0,0 +1,532 @@ +[中文](../zh/getting-started.md) + +# Getting Started + +A quick-start guide for newcomers to the QEMU + gem5 MI300X co-simulation project. +From building the components to running your first HIP GPU compute test. + +## Overview + +``` ++---------------------------------+ +------------------------------+ +| QEMU (Q35 + KVM) | | gem5 (inside Docker) | +| +---------------------------+ | | +------------------------+ | +| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU Model | | +| | amdgpu driver | | | | - Shader + CU | | +| | ROCm 7.0 / HIP runtime | | | | - PM4 / SDMA Engines | | +| +-----------+---------------+ | | | - Ruby Cache Hierarchy | | +| | MMIO/Doorbell | | +----------+-------------+ | +| +-----------v---------------+ | | +----------v-------------+ | +| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | +| +---------------------------+ |vfio-| +------------------------+ | +| |user | | ++---------------------------------+ +------------------------------+ + | | + v v + /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram + (Guest Physical Memory, Shared) (GPU VRAM, Shared) +``` + +- **QEMU** handles CPU execution, Linux kernel boot, PCIe enumeration, and amdgpu driver loading. +- **gem5** models the MI300X GPU: Shader, Compute Units, Cache hierarchy, and DMA engines. +- They communicate via the **vfio-user protocol** over a Unix domain socket. QEMU uses its built-in `vfio-user-pci` device; gem5 runs `MI300XVfioUser` as the server. +- Guest RAM and GPU VRAM are shared via **shared memory** under `/dev/shm/`. + +For a deeper dive into the memory architecture and BAR layout, see [Architecture](architecture.md#memory-sharing-architecture). + +## Prerequisites + +| Requirement | Description | +|---|---| +| Host OS | Linux x86_64 with KVM support (verified on WSL2 6.6.x) | +| Docker | Daemon running, current user in `docker` group | +| KVM | `/dev/kvm` accessible | +| QEMU | `qemu-system-x86_64` installed (used by Packer during disk image build) | +| Disk Space | At least 120 GB (55G disk image + build artifacts) | +| Memory | 16 GB or more recommended (gem5 compilation and runtime are memory-intensive) | +| Tools | `git`, `screen`, `unzip` | + +## Building gem5 and QEMU + +### Build the Runtime Docker Image + +Before building gem5, create the runtime Docker image: + +```bash +cd scripts +docker build -t gem5-run:local -f Dockerfile.run . +``` + +This image is based on `ghcr.io/gem5/gpu-fs` with Python 3.12 support added. + +### Build gem5 + +The gem5 binary links against Ubuntu 24.04 libraries and must be compiled in a compatible environment. + +> **Note:** The vfio-user backend requires `libjson-c-dev` (build-time) and `libjson-c5` (runtime). The `gem5-run:local` image already includes this dependency. + +**Option 1: Orchestration Script** + +```bash +./scripts/run_mi300x_fs.sh build-gem5 +``` + +**Option 2: Build inside Docker (Manual)** + +```bash +cd /home/zevorn/cosim/gem5 + +docker run --rm \ + -v "$(pwd):/gem5" -w /gem5 \ + gem5-run:local \ + scons build/VEGA_X86/gem5.opt -j4 +``` + +> **Tip:** Reduce parallelism (`-j1` or `-j2`) if OOM-killed during linking. + +Output: `build/VEGA_X86/gem5.opt` (approximately 1.1 GB). + +### Build QEMU + +With the vfio-user backend, a **stock QEMU 10.0+** build works out of the box -- the `vfio-user-pci` device is built-in and no custom QEMU code is needed. + +```bash +mkdir -p qemu-build && cd qemu-build +/path/to/qemu/configure --target-list=x86_64-softmmu +make -j$(nproc) +``` + +Or via the orchestration script: + +```bash +./scripts/run_mi300x_fs.sh build-qemu +``` + +Output: `qemu-system-x86_64`. + +> **Legacy backend:** If using `--cosim-backend=legacy`, the `cosim/qemu/` source containing the `mi300x-gem5` device is required. The build procedure is the same, but you must use the cosim branch QEMU source. + +## Building the Disk Image + +The disk image contains Ubuntu 24.04 + ROCm 7.0 + kernel 6.8.0-79-generic with amdgpu DKMS modules. + +### Automated Build + +```bash +./scripts/run_mi300x_fs.sh build-disk +``` + +If `gem5-resources` does not exist, it will be cloned automatically before the build begins. + +### Manual Build + +```bash +cd ../gem5-resources/src/x86-ubuntu-gpu-ml +./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" +``` + +> On Arch Linux the QEMU path is `/usr/sbin/`; other distributions may use `/usr/bin/`. + +### Output + +| Artifact | Path | Size | +|---|---|---| +| Disk Image | `gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | ~55 GB | +| Kernel | `gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | ~64 MB | + +> **Tip (China network):** If the build hangs on package downloads, apply the China mirror patch to speed up `apt` inside the VM. See [Reference §7](reference.md#7-china-mirror-configuration) for instructions. + +## Launching Co-simulation + +### Option 1: One-click Launch Script (Recommended) + +```bash +./scripts/cosim_launch.sh +``` + +This script automatically starts the gem5 container, waits for readiness, fixes permissions, starts QEMU, and enters the serial console in interactive mode. + +Common options: + +```bash +./scripts/cosim_launch.sh --gem5-debug MI300XCosim # enable gem5 debug output +./scripts/cosim_launch.sh --vram-size 32GiB # custom VRAM size +./scripts/cosim_launch.sh --num-cus 80 # custom CU count +./scripts/cosim_launch.sh --cosim-backend=legacy # use legacy socket backend +``` + +### Option 2: Manual Step-by-step Launch + +#### Start gem5 (Docker Container) + +```bash +docker run -d --name gem5-cosim \ + -v /home/zevorn/cosim/gem5:/gem5 \ + -v /tmp:/tmp \ + -v /dev/shm:/dev/shm \ + -w /gem5 \ + -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ + gem5-run:local \ + /gem5/build/VEGA_X86/gem5.opt --listener-mode=on \ + /gem5/configs/example/gpufs/mi300_cosim.py \ + --socket-path=/tmp/gem5-mi300x.sock \ + --shmem-path=/mi300x-vram \ + --shmem-host-path=/cosim-guest-ram \ + --dgpu-mem-size=16GiB \ + --num-compute-units=40 \ + --mem-size=8G +``` + +#### Wait for gem5 to be Ready + +```bash +docker logs -f gem5-cosim +``` + +The following output indicates readiness: + +``` +============================================================ +gem5 MI300X co-simulation server ready + Socket: /tmp/gem5-mi300x.sock + VRAM SHM: /mi300x-vram + Host SHM: /cosim-guest-ram + VRAM size: 16GiB + Host RAM: 8GiB + CUs: 40 +Waiting for QEMU to connect... +============================================================ +``` + +#### Fix Permissions + +Files created by Docker are owned by root; permissions must be fixed so QEMU can access them: + +```bash +docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock +docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram +``` + +#### Start QEMU + +```bash +qemu-system-x86_64 \ + -machine q35 -enable-kvm -cpu host \ + -m 8G -smp 4 \ + -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ + -numa node,memdev=mem0 \ + -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ + -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ + -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ + -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ + -nographic -no-reboot +``` + +> **Important:** The kernel command line must include `modprobe.blacklist=amdgpu` to prevent auto-loading the driver before the VGA ROM is written to shared memory. The `cosim-gpu-setup.service` handles the correct initialization order. + +#### SSH Access to Guest + +The `cosim_launch.sh` script enables user networking and SSH port forwarding by default. After configuring a network interface inside the guest with `netplan`, connect from the host: + +```bash +ssh -p 2222 gem5@localhost +# Default password: 12345 +``` + +### Shutting Down + +```bash +# In the QEMU serial console: +poweroff +# Or force quit: Ctrl-A X + +# Clean up Docker container and shared memory: +docker rm -f gem5-cosim +rm -f /dev/shm/mi300x-vram /dev/shm/cosim-guest-ram +rm -f /tmp/gem5-mi300x.sock +``` + +> When using `cosim_launch.sh`, cleanup is performed automatically after exiting QEMU. + +## GPU Driver Initialization + +The MI300X GPU driver can be loaded **automatically** or **manually** after the QEMU guest boots. All required files (ROM, firmware, kernel modules) are already included in the disk image. + +### Automatic Loading (Default) + +The disk image ships with `cosim-gpu-setup.service`, which runs at boot and performs: + +1. `dd` the VGA ROM to `0xC0000` (required for gem5's `readROM()` via shared memory) +2. Symlink IP discovery firmware +3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` + +The service completes in ~40 seconds. After guest login, the GPU is ready: + +```bash +rocm-smi # should show device 0x74a0 +rocminfo # should show gfx942 +``` + +The service file: + +```ini +# /etc/systemd/system/cosim-gpu-setup.service +[Unit] +Description=MI300X GPU Setup for Co-simulation +After=local-fs.target +Before=multi-user.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=/usr/local/bin/cosim-gpu-setup.sh + +[Install] +WantedBy=multi-user.target +``` + +> **Note:** `modprobe.blacklist=amdgpu` must remain in the kernel command line to prevent the PCI subsystem from auto-loading the driver before the ROM is written to shared memory. The systemd service handles the explicit `modprobe` after `dd`. + +### Manual Loading + +If the systemd service is not installed, or you need to reload the driver, run these commands manually after guest boot. + +**Prerequisites:** `cosim_launch.sh` is running (gem5 + QEMU are connected), the guest has booted with a root shell, and `modprobe.blacklist=amdgpu` was passed on the kernel command line. + +**Quick reference (copy-paste ready):** + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +### Detailed Steps + +#### Step 1: Load the VGA BIOS ROM + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +``` + +Writes the MI300X VBIOS ROM image to the legacy VGA ROM region at physical address `0xC0000` (768 KB). The amdgpu driver reads the VBIOS from this address during initialization. Without the ROM, the driver will report `"Unable to locate a BIOS ROM"`. + +| Parameter | Value | Meaning | +|-----------|-------|---------| +| `if` | `/root/roms/mi300.rom` | ROM binary file (in the disk image) | +| `of` | `/dev/mem` | Physical memory device | +| `bs` | `1k` | Block size = 1024 bytes | +| `seek` | `768` | Seek to 768 x 1024 = `0xC0000` | +| `count` | `128` | Write 128 x 1024 = 128 KB | + +#### Step 2: Symlink the IP Discovery Firmware + +```bash +ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ + /usr/lib/firmware/amdgpu/ip_discovery.bin +``` + +Points the driver's IP discovery firmware path to the MI300X-specific discovery binary. The `discovery=2` mode reads GPU IP block information from this firmware file rather than from GPU ROM/registers. + +#### Step 3: Load the amdgpu Kernel Module + +```bash +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +Key parameters: + +| Parameter | Value | Meaning | +|-----------|-------|---------| +| `ip_block_mask` | `0x67` | Disable PSP (bit 3) and SMU (bit 4); cosim does not model these | +| `ppfeaturemask` | `0` | Disable PowerPlay features; cosim has no power management hardware | +| `dpm` | `0` | Disable Dynamic Power Management | +| `audio` | `0` | Disable audio; no HDMI/DP audio in cosim | +| `ras_enable` | `0` | Disable RAS -- prevents NULL deref when VBIOS is minimal | +| `discovery` | `2` | Use firmware file for IP discovery | + +> **Warning**: Using `ip_block_mask=0x6f` (only disables SMU) will cause PSP firmware load failure and kernel panic. Always use `0x67`. + +> **Warning**: The `dd` step (Step 1) is **mandatory** before `modprobe`. Without it, the driver's BIOS discovery chain fails, resulting in a NULL pointer crash in `amdgpu_atom_parse_data_header`. + +### Verification + +```bash +# Check dmesg for amdgpu initialization +dmesg | grep -i amdgpu | tail -20 + +# Check PCI device +lspci | grep -i amd + +# Verify device recognition and capabilities +rocm-smi +rocminfo | head -40 +``` + +Expected output: + +``` +# rocm-smi +GPU[0] : Device Name: 0x74a0 +GPU[0] : Partition: SPX + +# rocminfo +Name: gfx942 +Compute Unit: 320 +KERNEL_DISPATCH capable +``` + +> Approximately 80 fence fallback timer warnings may appear during loading. This is normal -- the DRM subsystem uses a polling-mode timeout fallback when probing ring buffers. + +### File Locations (Inside the Guest Disk Image) + +| File | Path | +|------|------| +| VGA BIOS ROM | `/root/roms/mi300.rom` | +| IP Discovery firmware | `/usr/lib/firmware/amdgpu/mi300_discovery` | +| Auto-load service | `/etc/systemd/system/cosim-gpu-setup.service` | +| Auto-load script | `/usr/local/bin/cosim-gpu-setup.sh` | +| amdgpu module | `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | + +## Running HIP Tests + +### Compile a HIP Test Program + +Write a simple vector addition program inside the guest: + +```bash +cat > /tmp/vec_add.cpp << 'EOF' +#include +#include + +__global__ void vec_add(int *a, int *b, int *c, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) c[i] = a[i] + b[i]; +} + +int main() { + const int N = 4; + int ha[N] = {1, 2, 3, 4}; + int hb[N] = {10, 20, 30, 40}; + int hc[N] = {0}; + + int *da, *db, *dc; + hipMalloc(&da, N * sizeof(int)); + hipMalloc(&db, N * sizeof(int)); + hipMalloc(&dc, N * sizeof(int)); + + hipMemcpy(da, ha, N * sizeof(int), hipMemcpyHostToDevice); + hipMemcpy(db, hb, N * sizeof(int), hipMemcpyHostToDevice); + + vec_add<<<1, N>>>(da, db, dc, N); + + hipMemcpy(hc, dc, N * sizeof(int), hipMemcpyDeviceToHost); + + printf("Result: %d %d %d %d\n", hc[0], hc[1], hc[2], hc[3]); + + bool pass = (hc[0]==11 && hc[1]==22 && hc[2]==33 && hc[3]==44); + printf("%s\n", pass ? "PASSED!" : "FAILED!"); + + hipFree(da); hipFree(db); hipFree(dc); + return pass ? 0 : 1; +} +EOF +``` + +Compile and run: + +```bash +# Compile (gfx942 = MI300X architecture) +/opt/rocm/bin/hipcc --offload-arch=gfx942 -o /tmp/vec_add /tmp/vec_add.cpp + +# Run +/tmp/vec_add +``` + +### Expected Output + +``` +Result: 11 22 33 44 +PASSED! +``` + +### Using the square Test from gem5-resources + +You can also use the `square` test program included in gem5-resources. Compile it on the host: + +```bash +./scripts/run_mi300x_fs.sh build-app square +``` + +Then copy the compiled binary into the guest (via `scp -P 2222` or by mounting the disk image) and run it: + +```bash +./square.default +``` + +Expected output: + +``` +info: running on device AMD Instinct MI300X +info: allocate host and device mem ( 7.63 MB) +info: launch 'vector_square' kernel +info: check result +PASSED! +``` + +## Appendix: Standalone gem5 GPU FS Simulation + +The co-simulation workflow described above uses QEMU for fast KVM-accelerated boot with gem5 providing only the GPU model. An alternative workflow runs **everything inside gem5** (CPU + GPU), with no QEMU involved. This is the standard gem5 full-system GPU simulation. + +### Key Differences + +| Aspect | Co-simulation (QEMU + gem5) | Standalone gem5 | +|---|---|---| +| CPU execution | KVM (near-native speed) | gem5 atomic/timing model | +| Boot time | ~30 seconds | ~2-5 minutes (KVM fast-forward) | +| GPU model | gem5 MI300X via vfio-user | gem5 MI300X (same model) | +| Driver loading | systemd service or manual `modprobe` | Automated via `m5 readfile` | +| Use case | Driver development, interactive debugging | Microarchitecture research, benchmarking | + +### Quick Start + +**1. Build gem5 and disk image** (same as the co-simulation steps above). + +**2. Build a GPU test application:** + +```bash +./scripts/run_mi300x_fs.sh build-app square +``` + +**3. Run the simulation:** + +```bash +./scripts/run_mi300x_fs.sh run \ + ../gem5-resources/src/gpu/square/bin.default/square.default +``` + +> **Important:** The `--app` parameter must always be specified. Without it, the driver is never loaded inside the guest. + +**4. Monitor output:** + +```bash +tail -f m5out/board.pc.com_1.device +``` + +The simulation uses KVM to fast-forward through Linux boot, then automatically loads the GPU driver and runs the specified application. The guest calls `m5 exit` when the test completes. + +For full details on the standalone workflow, including legacy configuration, disk image verification with `guestfish`, and build process internals, see the gem5 documentation for details. + +## Quick Troubleshooting + +The five most common issues and their fixes: + +| Symptom | Cause | Fix | +|---------|-------|-----| +| gem5 container exits immediately | `gem5.opt` not compiled, wrong path, or Python import failure | Run `docker logs gem5-cosim` to see the error | +| `Failed to connect to /tmp/gem5-mi300x.sock` | gem5 not ready or socket permissions wrong | Wait for "Waiting for QEMU to connect" in gem5 logs; run `chmod 777` on the socket | +| NULL deref crash at `amdgpu_atom_parse_data_header` | VGA ROM was not written before `modprobe` | Run `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128` before loading the driver | +| PSP GPU reset kernel panic | Wrong `ip_block_mask` (e.g., `0x6f` instead of `0x67`) | Always use `ip_block_mask=0x67` to disable both PSP and SMU | +| `hipcc` error: cannot find ROCm device library | ROCm not installed or wrong arch flag | Verify `/opt/rocm/lib/` exists; use `--offload-arch=gfx942` | + +For the complete troubleshooting table and debugging techniques, see [Reference §4](reference.md#4-known-issues-and-pitfalls). diff --git a/docs/en/gpu-fs-guide.md b/docs/en/gpu-fs-guide.md deleted file mode 100644 index 27960c8..0000000 --- a/docs/en/gpu-fs-guide.md +++ /dev/null @@ -1,323 +0,0 @@ -[中文](../zh/gpu-fs-guide.md) - -# gem5 MI300X Full-System GPU Simulation Reproduction Guide - -Reproduce the full-system GPU simulation of AMD Instinct MI300X on the cosim branch from scratch, -until the `square` test passes. - -## Prerequisites - -| Requirement | Description | -|---|---| -| Host OS | Linux x86_64 with KVM support (verified on WSL2 6.6.x) | -| Docker | Daemon running, current user in `docker` group | -| KVM | `/dev/kvm` accessible (required for both disk image build and simulation) | -| QEMU | `qemu-system-x86_64` installed (used by Packer to build disk images) | -| Disk space | At least 120 GB free (55G disk image + build intermediates) | -| Tools | `git`, `unzip`, `guestfish` (optional, for disk image verification) | - -### Docker Images - -| Image | Purpose | -|---|---| -| `ghcr.io/gem5/gpu-fs:latest` | Base image for gem5 runtime container (amd64) | -| `gem5-run:local` | Runtime image built from `scripts/Dockerfile.run` | -| `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` | gem5 compilation (arm64 only, see note below) | - -> **Note:** `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` is arm64 only. -> On amd64 hosts, use `ghcr.io/gem5/gpu-fs` as the build image or compile natively. -> You can override the default image by setting the `GEM5_BUILD_IMAGE` environment variable. - -## Directory Structure - -``` -/home/zevorn/cosim/ - gem5/ # gem5 source (cosim branch) - build/VEGA_X86/gem5.opt # gem5 binary - configs/example/ - gem5_library/x86-mi300x-gpu.py # stdlib config - gpufs/mi300.py # legacy config - scripts/ - run_mi300x_fs.sh # orchestration script - Dockerfile.run # runtime Docker image - gem5-resources/ # disk images, kernels, GPU apps - src/x86-ubuntu-gpu-ml/ - disk-image/x86-ubuntu-rocm70 # 55G raw disk image - vmlinux-rocm70 # extracted kernel - src/gpu/square/ # square test app - docs/ # documentation - qemu/ # QEMU source (cosim device) - build/qemu-system-x86_64 -``` - -## Step 1: Build gem5 - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-gem5 -``` - -This command runs `scons build/VEGA_X86/gem5.opt` inside Docker. -Output: `build/VEGA_X86/gem5.opt` (approximately 1.1 GB). - -Manual build without Docker: - -```bash -scons build/VEGA_X86/gem5.opt -j$(nproc) -``` - -## Step 2: Build QEMU (Optional, Only Required for Cosim Mode) - -```bash -./scripts/run_mi300x_fs.sh build-qemu -``` - -Requires QEMU source at `../qemu/`. Configures with `--target-list=x86_64-softmmu` and builds. -Output: `../qemu/build/qemu-system-x86_64`. - -## Step 3: Obtain gem5-resources - -```bash -./scripts/run_mi300x_fs.sh build-disk -# If gem5-resources does not exist, it will be cloned automatically, then disk image build begins -``` - -Or clone manually: - -```bash -cd /home/zevorn/cosim -git clone --depth 1 https://github.com/gem5/gem5-resources.git gem5-resources -``` - -## Step 4: Build the Disk Image - -The disk image build uses Packer + QEMU/KVM to install Ubuntu 24.04.2 + ROCm 7.0 + -kernel 6.8.0-79-generic with all required DKMS modules. - -### Automated Build (via Orchestration Script) - -```bash -./scripts/run_mi300x_fs.sh build-disk -``` - -### Manual Build - -```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml - -# Download Packer and build -./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" -``` - -> **Important:** The default `qemu_path` in `x86-ubuntu-gpu-ml.pkr.hcl` is -> `/usr/bin/qemu-system-x86_64`. Some distributions (e.g., Arch) install it at -> `/usr/sbin/qemu-system-x86_64`, which requires overriding with `-var`. - -### Build Process Details - -1. Boot Ubuntu 24.04.2 ISO via QEMU/KVM for unattended installation -2. Run `scripts/rocm-install.sh`, which performs the following in order: - - Compile and install the `m5` tool from gem5 source (`/sbin/m5`) - - Install ROCm 7.0 from `repo.radeon.com/amdgpu/7.0/ubuntu` - - Install `amdgpu-dkms` (compile DKMS kernel modules) - - Install kernel `6.8.0-79-generic` and corresponding headers - - Extract `vmlinux` kernel for gem5 use - - Compile `gem5_wmi.ko` (ACPI patch module) - - Install PyTorch (ROCm 6.0 support) -3. Copy GPU BIOS ROM (`mi300.rom`), IP discovery files, and boot scripts into the image -4. Download the extracted kernel from the VM as `vmlinux-rocm70` - -### Output - -| Artifact | Path | Size | -|---|---|---| -| Disk image | `disk-image/x86-ubuntu-rocm70` | ~55 GB | -| Kernel | `vmlinux-rocm70` | ~64 MB | - -### Build Time - -Approximately 30-60 minutes, depending on network speed and host performance. - -### Verify the Disk Image (Optional) - -Use `guestfish` to inspect disk image contents without mounting: - -```bash -LIBGUESTFS_BACKEND=direct guestfish --ro \ - -a disk-image/x86-ubuntu-rocm70 -m /dev/sda1 <<'EOF' -echo "=== DKMS modules ===" -ls /lib/modules/6.8.0-79-generic/updates/dkms/ -echo "=== ROCm version ===" -cat /opt/rocm/.info/version -echo "=== load_amdgpu.sh ===" -cat /home/gem5/load_amdgpu.sh -echo "=== m5 binary ===" -is-file /sbin/m5 -echo "=== gem5_wmi module ===" -is-file /home/gem5/gem5_wmi.ko -EOF -``` - -Expected DKMS module list (all dependencies for the amdgpu driver): - -``` -amd-sched.ko.zst -amddrm_buddy.ko.zst -amddrm_exec.ko.zst # Critical module -- missing in older builds -amddrm_ttm_helper.ko.zst -amdgpu.ko.zst -amdkcl.ko.zst -amdttm.ko.zst -amdxcp.ko.zst -``` - -## Step 5: Build the GPU Test Application - -```bash -./scripts/run_mi300x_fs.sh build-app square -``` - -Compiles using Docker (`ghcr.io/gem5/gpu-fs`) or local `hipcc`. -Output: `../gem5-resources/src/gpu/square/bin.default/square.default`. - -## Step 6: Build the Runtime Docker Image - -The gem5 binary is linked against Ubuntu 24.04 libraries and requires a compatible runtime environment: - -```bash -cd scripts -docker build -t gem5-run:local -f Dockerfile.run . -``` - -## Step 7: Run the Simulation - -### stdlib Configuration (Recommended) - -```bash -./scripts/run_mi300x_fs.sh run \ - ../gem5-resources/src/gpu/square/bin.default/square.default -``` - -> **Important: The `--app` parameter must be specified.** Without it, `readfile_contents` is -> an empty string `""`, which Python evaluates as falsy, so `KernelDiskWorkload._set_readfile_contents` -> is never called, and the amdgpu driver in the guest is never loaded. - -### Legacy Configuration - -```bash -./scripts/run_mi300x_fs.sh run-legacy \ - ../gem5-resources/src/gpu/square/bin.default/square.default -``` - -### Simulation Process Details - -1. **KVM fast-boot phase** (~2-5 minutes): gem5 uses KVM to fast-forward Linux boot. - Guest kernel boots, systemd initializes, and auto-login as root occurs. -2. **readfile execution**: The guest runs `/home/gem5/run_gem5_app.sh` via `.bashrc`, - which calls `m5 readfile` to retrieve the host-injected script. -3. **Driver loading**: The script writes the GPU BIOS ROM to `/dev/mem`, creates symlinks - for IP discovery files, then runs `load_amdgpu.sh` to insmod all DKMS modules in dependency order. -4. **GPU application execution**: The script decodes the base64-encoded GPU binary, runs it, - then calls `m5 exit` to end the simulation. - -### Monitoring Output - -Guest serial console output is written to `m5out/board.pc.com_1.device`: - -```bash -tail -f m5out/board.pc.com_1.device -``` - -### Expected Output from the square Test - -``` -3+0 records in -3+0 records out -3072 bytes (3.1 kB, 3.0 KiB) copied, ... -info: running on device AMD Instinct MI300X -info: allocate host and device mem ( 7.63 MB) -info: launch 'vector_square' kernel -info: check result -PASSED! -``` - -## Troubleshooting - -### `Failed to init DRM client: -13` Followed by Kernel Panic - -**Root cause:** The disk image is missing the `amddrm_exec.ko.zst` DKMS module. Without this module, -the amdgpu TTM memory manager fails to initialize, `drm_dev_enter()` finds the device in an -"unplugged" state, and returns `-EACCES` (-13). The subsequent cleanup path triggers a NULL pointer -dereference in `ttm_resource_move_to_lru_tail`. - -**Fix:** Rebuild the disk image using the latest `gem5-resources` (`origin/stable` branch). -The updated `rocm-install.sh` installs kernel `6.8.0-79-generic`, which fully matches -the ROCm 7.0 DKMS packages and includes all required modules. - -**Verification:** Use `guestfish` to confirm that `amddrm_exec.ko.zst` exists in -`/lib/modules/6.8.0-79-generic/updates/dkms/`. - -### `Can't open /dev/gem5_bridge: No such file or directory` - -**Harmless warning.** The `m5` tool first attempts the `gem5_bridge` device driver, and falls back to -address-mapped MMIO mode (available when running as root) on failure. The readfile mechanism -still works correctly. - -### Packer Build Fails: `output_directory already exists` - -A leftover `disk-image/` directory from a previous build blocks Packer: - -```bash -mv disk-image disk-image-old -# Then re-run the build -``` - -### Packer Build Fails: git clone Fails Inside the VM - -Network issues inside the QEMU VM can cause `git clone` to fail. The `rocm-install.sh` script has -built-in retry logic (3 attempts, 10-second intervals). If it still fails, check the host network -connectivity and DNS resolution. - -### GPU Driver Does Not Load When `--app` Is Not Specified - -When running with `x86-mi300x-gpu.py` without the `--app` parameter, `readfile_contents` is -an empty string `""`. Python's truthiness check `elif readfile_contents:` evaluates to `False`, -so `_set_readfile_contents` is never called and the readfile is not written. The guest's -`run_gem5_app.sh` receives an empty file from `m5 readfile` and exits immediately. - -**Solution:** Always specify the `--app` parameter when running GPU simulations. - -### DRAM Capacity Warning - -``` -DRAM device capacity (16384 Mbytes) does not match the address range assigned (8192 Mbytes) -``` - -This is a configuration warning from the gem5 memory system and does not affect simulation correctness. - -## Key File Reference - -| File | Purpose | -|---|---| -| `scripts/run_mi300x_fs.sh` | Main orchestration script | -| `scripts/Dockerfile.run` | Runtime Docker image definition | -| `configs/example/gem5_library/x86-mi300x-gpu.py` | stdlib simulation config | -| `configs/example/gpufs/mi300.py` | Legacy simulation config | -| `src/python/gem5/prebuilt/viper/board.py` | ViperBoard: readfile injection, driver loading | -| `src/python/gem5/components/devices/gpus/amdgpu.py` | MI300X device definition | -| `src/dev/amdgpu/amdgpu_device.cc` | GPU device model core (modified in cosim branch) | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/scripts/rocm-install.sh` | Disk image configuration script | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/files/load_amdgpu.sh` | Guest-side driver loading script | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/x86-ubuntu-gpu-ml.pkr.hcl` | Packer configuration | - -## Version Matrix - -| Component | Version | -|---|---| -| Guest OS | Ubuntu 24.04.2 LTS | -| Guest kernel | 6.8.0-79-generic | -| ROCm | 7.0.0 | -| amdgpu DKMS | Matches ROCm 7.0 | -| gem5 build target | VEGA_X86 | -| GPU device | MI300X (DeviceID 0x74A1) | -| Coherence protocol | GPU_VIPER | diff --git a/docs/en/mi300x-memory-management.md b/docs/en/mi300x-memory-management.md deleted file mode 100644 index fcc4ba7..0000000 --- a/docs/en/mi300x-memory-management.md +++ /dev/null @@ -1,332 +0,0 @@ -[中文](../zh/mi300x-memory-management.md) - -# MI300X Memory Management, Address Translation, and Mapping - -This document describes how the AMD MI300X GPU manages memory addresses in both standalone gem5 simulation and QEMU+gem5 co-simulation environments. - -## 1. GPU Address Spaces - -The MI300X (GFX 9.4.3) GPU uses multiple address spaces and apertures to access memory. Each memory access issued by the GPU is first classified by aperture, then translated into a physical address. - -``` -GPU Virtual Address (48-bit) -| -+-- AGP aperture [agpBot, agpTop] -| +-- Direct offset: paddr = vaddr - agpBot + agpBase -| -+-- GART aperture [ptStart<<12, ptEnd<<12] -| +-- Page table: paddr = GART_PTE[page_num].phys_addr | offset -| -+-- Framebuffer (FB) [fbBase, fbTop] -| +-- VRAM offset: vram_off = vaddr - fbBase -| -+-- System aperture [sysAddrL, sysAddrH] -| +-- Direct map: paddr = vaddr (system memory) -| -+-- MMHUB aperture [mmhubBase, mmhubTop] -| +-- VRAM mirror: vram_off = vaddr - mmhubBase -| -+-- User VM (VMID>0) [arbitrary VAs] - +-- Multi-level page table walk (4 or 5 levels) -``` - -### 1.1 Aperture Registers - -These MMIO registers define the boundaries of each aperture. The values are programmed by the amdgpu driver during GMC (Graphics Memory Controller) initialization. - -| Register | gem5 Field | Format | Description | -|----------|-----------|--------|-------------| -| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | Start address of VRAM in MC address space | -| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 \| 0xFFFFFF` | End address of VRAM | -| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB relocation offset | -| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP remap base address | -| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture bottom | -| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 \| 0xFFFFFF` | AGP aperture top | -| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture low address | -| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture high address | -| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | Location of GART table in VRAM | -| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture start address (page number) | -| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture end address (page number) | - -**Typical values in co-simulation** (from driver initialization diagnostics): -``` -ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB -ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 -ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) -fbBase = 0x8000000000 VRAM starts at MC address 512 GiB -fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) -sysAddrL = 0x0 System aperture start -sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) -``` - -## 2. GART (Graphics Address Remapping Table) - -### 2.1 Overview - -GART is a single-level page table used by VMID 0 (kernel mode) to map GPU virtual addresses to system physical addresses. It enables the GPU to perform DMA access to host (guest) RAM for ring buffers, fence values, IH cookies, and other kernel-mode data structures. - -### 2.2 Table Layout - -``` -VRAM offset = ptBase (gartBase) -+-------------------+ ptBase + 0 -| PTE[0] (8 bytes) | maps page ptStart -+-------------------+ ptBase + 8 -| PTE[1] | maps page ptStart + 1 -+-------------------+ ptBase + 16 -| PTE[2] | maps page ptStart + 2 -| ... | -+-------------------+ -| PTE[N] | maps page ptStart + N -+-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 -``` - -Each PTE is 8 bytes with the following format: - -| Bit Range | Field | Description | -|------|-------|-------------| -| 0 | Valid | Entry is valid | -| 1 | System | 1 = system memory, 0 = local VRAM | -| 5:2 | Fragment | Page fragment size | -| 47:12 | Physical Page | Physical address >> 12 | -| 51:48 | Block Fragment | Block fragment size | -| 63:52 | Flags | MTYPE, PRT, etc. | - -**Physical address extraction**: `paddr = (bits(PTE, 47, 12) << 12) | page_offset` - -### 2.3 getGARTAddr Transform - -Before GART lookup, addresses are transformed via `getGARTAddr()`: - -```cpp -// In pm4_packet_processor.cc and sdma_engine.cc: -Addr getGARTAddr(Addr addr) const { - if (!gpuDevice->getVM().inAGP(addr)) { - Addr low_bits = bits(addr, 11, 0); - addr = (((addr >> 12) << 3) << 12) | low_bits; - } - return addr; -} -``` - -This function multiplies the page number by 8 (the size of a PTE), effectively converting a GPU VA into a byte offset within the GART table. The subsequent GART translation uses this transformed address to look up the PTE. - -### 2.4 Translation Flow - -``` -Original GPU VA (e.g., 0x7FFF00032000) - | - v getGARTAddr() -Transformed addr = ((VA>>12) * 8) << 12 | low_bits - = 0x3FFF80019_0000 (example) - | - v GARTTranslationGen::translate() -gart_addr = bits(transformed, 63, 12) = page_num * 8 - | - +-- Look up gartTable hash map (populated by writeFrame / SDMA shadow) - | - +-- Cosim fallback: read PTE from shared VRAM - | pte_offset = gart_addr - (ptStart * 8) - | pte = *(vramShmemPtr + ptBase + pte_offset) - | - v Extract physical address -paddr = (bits(PTE, 47, 12) << 12) | bits(VA, 11, 0) -``` - -### 2.5 gartTable Hash Map vs. Shared VRAM - -In standalone gem5 mode, GART entries are maintained in a hash map (`AMDGPUVM::gartTable`), populated by: - -1. **Direct writes** (`amdgpu_device.cc:writeFrame()`): When the driver writes to the GART region of VRAM via BAR0, the values are stored in `gartTable[offset]`. - -2. **SDMA shadow copies** (`sdma_engine.cc`): When SDMA writes to the GART range in device memory, the shadow copy updates `gartTable`. - -In co-simulation mode, the driver writes GART PTEs through QEMU's BAR0 mapping, going directly into shared VRAM without passing through gem5's `writeFrame()`. Therefore, `gartTable` is essentially empty. The co-simulation fallback reads PTEs directly from shared VRAM at `vramShmemPtr + ptBase`. - -## 3. MMHUB Aperture - -MMHUB (Memory Management Hub) provides a shadow mapping of VRAM. Addresses within the `[mmhubBase, mmhubTop]` range are translated by subtracting the base address: - -``` -vram_offset = vaddr - mmhubBase -``` - -SDMA uses this aperture to access device memory in VMID 0 mode. - -## 4. User-Space Translation (VMID > 0) - -User-space GPU programs (such as HIP applications) use multi-level page tables similar to x86-64 paging. Each VMID (1-15) has its own page table base register. - -``` -VM_CONTEXT[N]_PAGE_TABLE_BASE_ADDR -> Page Directory Base - | - v 4-level walk (PDE3 -> PDE2 -> PDE1 -> PDE0 -> PTE) -Physical address -``` - -The `UserTranslationGen` class performs this walk using the GPU's page table walker (`VegaISA::Walker`). SDMA in user mode (vmid > 0) uses this path. - -## 5. DMA Routing in gem5 - -### 5.1 PM4 Packet Processor - -``` -PM4PacketProcessor::translate(vaddr, size) - | - +-- inAGP(vaddr)? -> AGPTranslationGen (direct offset) - | - +-- else -> GARTTranslationGen (page table lookup) -``` - -All PM4 DMA uses GART translation (VMID 0). Addresses are transformed via `getGARTAddr()` before the DMA call. - -### 5.2 SDMA Engine - -``` -SDMAEngine::translate(vaddr, size) - | - +-- cur_vmid > 0? -> UserTranslationGen (multi-level page table) - | - +-- inAGP(vaddr)? -> AGPTranslationGen - | - +-- inMMHUB(vaddr)?-> MMHUBTranslationGen (VRAM shadow) - | - +-- else -> GARTTranslationGen -``` - -SDMA has more aperture awareness than PM4, as it handles both kernel-mode (VMID 0) and user-mode (VMID > 0) operations. - -### 5.3 VRAM vs. System Memory Detection - -For PM4's RELEASE_MEM and WRITE_DATA packets, the destination can be either VRAM or system memory. Routing works as follows: - -```cpp -bool vram = isVRAMAddress(pkt->addr); // addr < gpuDevice->getVRAMSize() -Addr addr = vram ? pkt->addr : getGARTAddr(pkt->addr); - -if (vram) - gpuDevice->getMemMgr()->writeRequest(addr, data, size); // device memory -else - dmaWriteVirt(addr, size, cb, data); // system memory via GART -``` - -## 6. Interrupt Handler (IH) DMA - -The interrupt handler uses raw system physical addresses (not GART): - -``` -IH Ring Buffer: regs.baseAddr (from IH_RB_BASE register) -Wptr Address: regs.WptrAddr (from IH_RB_WPTR_ADDR registers) -``` - -These are GPAs (Guest Physical Addresses) programmed by the driver. The IH write flow: -1. Write the interrupt cookie (32 bytes) to `baseAddr + IH_Wptr` -2. Write the updated write pointer to `WptrAddr` -3. Then call `intrPost()` to send an MSI-X interrupt to the guest - -In co-simulation mode, DMA writes land in shared guest RAM (`/dev/shm/cosim-guest-ram`), and interrupts are forwarded to QEMU via the event socket. - -## 7. Co-simulation Memory Architecture - -``` -+-----------------------------------------------------+ -| Host (Linux) | -| | -| /dev/shm/cosim-guest-ram (8 GiB) | -| +--------------------------------------------+ | -| | Guest Physical RAM | | -| | <- QEMU memory-backend-file (share=on) | | -| | <- gem5 system.shared_backstore | | -| | | | -| | Contains: page tables, ring buffers, | | -| | IH ring, fence values, kernel code/data | | -| +--------------------------------------------+ | -| | -| /dev/shm/mi300x-vram (16 GiB) | -| +--------------------------------------------+ | -| | GPU VRAM | | -| | <- QEMU BAR0 mmap (driver writes here) | | -| | <- gem5 vramShmemPtr (GPU model reads) | | -| | | | -| | Contains: GART page table, GPU page tables,| | -| | frame data, device-local allocations | | -| | | | -| | Layout: | | -| | [0, ~15.7G) General VRAM allocations | | -| | [0x3EE600000] GART page table (ptBase) | | -| | [~15.7G, 16G) Reserved / metadata | | -| +--------------------------------------------+ | -| | -| /tmp/gem5-mi300x.sock (Unix domain socket) | -| +--------------------------------------------+ | -| | MMIO connection: QEMU <-> gem5 (sync) | | -| | Event connection: gem5 -> QEMU (async) | | -| | - IRQ raise/lower | | -| | - DMA read/write requests | | -| +--------------------------------------------+ | -+-----------------------------------------------------+ -``` - -### 7.1 Memory Split (Q35) - -QEMU Q35 splits memory into two regions when RAM >= 2.75 GiB: -- Below-4G region: first 2 GiB (file offset 0) -- Above-4G region: the remainder at file offset 2 GiB, mapped to PA 0x100000000+ - -gem5's `mi300_cosim.py` replicates this split to ensure both sides maintain a consistent file layout. - -### 7.2 GART PTE Co-simulation Fallback - -Since the driver writes GART PTEs through QEMU's BAR0 (shared memory), gem5's `gartTable` hash map is not populated. The co-simulation fallback reads PTEs directly from shared VRAM: - -```cpp -Addr pte_table_offset = gart_addr - (ptStart * 8); -Addr pte_vram_offset = gartBase() + pte_table_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); -``` - -If a PTE is 0 (unmapped page), co-simulation mode maps to a sink (`paddr=0`) instead of faulting, avoiding infinite DMA retry crashes caused by `GenericPageTableFault`. - -## 8. Address Flow Examples - -### 8.1 Fence Write (RELEASE_MEM) - -``` -1. PM4 RELEASE_MEM packet: addr=0x113100000 (guest phys), data=0x1234 -2. isVRAMAddress(0x113100000)? No (< 16 GiB but not a VRAM offset) -3. getGARTAddr(0x113100000) -> 0x899800000000 (page * 8 transform) -4. dmaWriteVirt(0x899800000000, 8, cb, &data) -5. GARTTranslationGen::translate() - - gart_addr = 0x89980000 - - Look up PTE from shared VRAM -> PTE has paddr bits - - paddr = extracted address (in guest RAM) -6. DMA write lands in /dev/shm/cosim-guest-ram at paddr offset -7. Guest driver reads fence value from same shared memory -``` - -### 8.2 HIP Kernel Dispatch - -``` -1. User writes AQL packet to queue ring buffer (user VA) -2. User writes doorbell -> QEMU -> gem5 (socket MMIO) -3. gem5 PM4 reads queue MQD (GART address -> guest RAM) -4. gem5 GPU command processor dispatches kernel to CU array -5. CUs execute wavefronts (compute work) -6. On completion: RELEASE_MEM writes fence + triggers interrupt -7. IH writes cookie to IH ring (raw DMA to guest RAM) -8. intrPost() -> sendIrqRaise(0) -> QEMU event socket -9. QEMU msix_notify() -> guest IH handler processes interrupt -10. hipDeviceSynchronize() returns success -``` - -## 9. Key Source Files - -| File | Purpose | -|------|------| -| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | All translation generators (GART, AGP, MMHUB, User) | -| `src/dev/amdgpu/pm4_packet_processor.cc` | PM4 DMA routing and GART address transform | -| `src/dev/amdgpu/sdma_engine.cc` | SDMA DMA routing, GART shadow copies | -| `src/dev/amdgpu/interrupt_handler.cc` | IH ring buffer DMA and interrupt delivery | -| `src/dev/amdgpu/amdgpu_device.cc` | Device-level intrPost(), writeFrame() | -| `src/dev/amdgpu/mi300x_gem5_cosim.cc` | Co-simulation socket bridge, IRQ forwarding | -| `configs/example/gpufs/mi300_cosim.py` | Memory configuration, shared backstore setup | diff --git a/docs/en/reference.md b/docs/en/reference.md new file mode 100644 index 0000000..3e0a79a --- /dev/null +++ b/docs/en/reference.md @@ -0,0 +1,564 @@ +[中文](../zh/reference.md) + +# Co-simulation Reference Guide + +Consolidated lookup reference for the QEMU + gem5 MI300X co-simulation system. For conceptual explanations, see [architecture.md](architecture.md). For step-by-step build/run instructions, see [getting-started.md](getting-started.md). + +--- + +## 1. Parameter Reference + +### 1.1 cosim_launch.sh / mi300_cosim.py Options + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `--socket-path` | `/tmp/gem5-mi300x.sock` | QEMU <-> gem5 communication socket (vfio-user protocol) | +| `--shmem-path` | `/mi300x-vram` | GPU VRAM shared memory name (under `/dev/shm`) | +| `--shmem-host-path` | `/cosim-guest-ram` | Guest RAM shared memory name (under `/dev/shm`) | +| `--dgpu-mem-size` | `16GiB` | GPU VRAM size | +| `--num-compute-units` | `40` | Number of GPU compute units | +| `--mem-size` | `8GiB` | Guest physical memory size | +| `--cosim-backend` | `vfio-user` | Cosim backend type: `vfio-user` (stock QEMU 10.0+) or `legacy` (custom QEMU) | +| `--gem5-debug` | (none) | gem5 debug flag(s), e.g. `MI300XCosim`, `AMDGPUDevice,PM4PacketProcessor` | +| `--vram-size` | `32GiB` | Custom VRAM size (alias for `--dgpu-mem-size`) | +| `--num-cus` | `80` | Custom CU count (alias for `--num-compute-units`) | + +### 1.2 amdgpu modprobe Parameters + +All parameters are required for co-simulation. The full command: + +```bash +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +| Parameter | Value | Purpose | +|-----------|-------|---------| +| `ip_block_mask` | `0x67` | Binary `0110_0111`. Enables common, GMC, IH, GFX, SDMA; disables PSP (bit 3) and SMU (bit 4). See [Section 3](#3-ip-block-mask-reference) for details | +| `ppfeaturemask` | `0` | Disable all PowerPlay features; cosim has no power management hardware | +| `dpm` | `0` | Disable Dynamic Power Management | +| `audio` | `0` | Disable HDMI/DP audio; no audio hardware in cosim | +| `ras_enable` | `0` | Disable RAS (Reliability, Availability, Serviceability). Prevents NULL deref on `atom_context` when VBIOS is minimal (3 KB cosim ROM) | +| `discovery` | `2` | Use firmware file on disk for IP discovery instead of GPU ROM/registers | + +> **Warning**: Using `ip_block_mask=0x6f` (enables PSP at bit 3) causes PSP firmware load failure and kernel panic. Always use `0x67`. + +> **Warning**: `ras_enable=0` is mandatory. Without it, `amdgpu_ras_init` calls `amdgpu_atom_parse_data_header` on NULL `atom_context`, crashing with a NULL pointer dereference. + +### 1.3 dd Command Parameters (VGA ROM) + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +``` + +| Parameter | Value | Meaning | +|-----------|-------|---------| +| `if` | `/root/roms/mi300.rom` | ROM binary file (inside disk image) | +| `of` | `/dev/mem` | Physical memory device | +| `bs` | `1k` | Block size = 1024 bytes | +| `seek` | `768` | Seek to 768 x 1024 = `0xC0000` (legacy VGA ROM region) | +| `count` | `128` | Write 128 x 1024 = 128 KB | + +The `dd` step writes the MI300X VBIOS to physical address `0xC0000`--`0xDFFFF` in shared memory (`/dev/shm/cosim-guest-ram`). gem5's `AMDGPUDevice::readROM()` reads from this address via `system->getPhysMem()`. This step is **mandatory** before `modprobe` -- all five BIOS discovery methods fail in cosim mode: + +| BIOS Discovery Method | Why It Fails in Cosim | +|-----------------------|----------------------| +| `amdgpu_atrm_get_bios()` | No ACPI ATRM method in QEMU Q35 | +| `amdgpu_acpi_vfct_bios()` | No ACPI VFCT table | +| `amdgpu_read_bios_from_rom()` | Reads via SMU registers, but SMU disabled by `ip_block_mask=0x67` | +| `amdgpu_read_platform_bios()` | No platform-provided ROM | +| `amdgpu_read_disabled_bios()` | Not functional in cosim | + +### 1.4 Kernel Command Line + +The kernel must be booted with: + +``` +console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu +``` + +`modprobe.blacklist=amdgpu` prevents auto-loading the driver before the ROM is written to shared memory. The `cosim-gpu-setup.service` handles the correct initialization order (dd ROM, then modprobe). + +--- + +## 2. Version Matrix + +| Component | Version | +|-----------|---------| +| Guest OS | Ubuntu 24.04.2 LTS | +| Guest Kernel | 6.8.0-79-generic | +| ROCm | 7.0.0 | +| amdgpu DKMS | Matches ROCm 7.0 | +| gem5 Build Target | VEGA_X86 | +| GPU Device | MI300X (gfx942, DeviceID 0x74A0) | +| Coherence Protocol | GPU_VIPER | +| QEMU | 10.0+ (vfio-user backend) or cosim branch (legacy backend) | + +### Docker Images + +| Image | Purpose | +|-------|---------| +| `ghcr.io/gem5/gpu-fs:latest` | Base image for gem5 runtime container (amd64) | +| `gem5-run:local` | Runtime image built from `scripts/Dockerfile.run` (adds Python 3.12 support) | +| `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` | gem5 compilation (arm64 only) | + +> On amd64 hosts, use `ghcr.io/gem5/gpu-fs` as the build image or compile natively. + +### Build Artifacts + +| Artifact | Path | Size | +|----------|------|------| +| gem5 binary | `build/VEGA_X86/gem5.opt` | ~1.1 GB | +| Disk image | `../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | ~55 GB | +| Kernel | `../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | ~64 MB | +| QEMU binary | `qemu/build/qemu-system-x86_64` | -- | + +--- + +## 3. IP Block Mask Reference + +### Discovery Order Table + +The `ip_block_mask` parameter uses the **discovery order index** as bit positions, NOT the `amd_ip_block_type` enum values from `amd_shared.h`. The enum values are misleading -- what matters is the order blocks appear during IP discovery. + +MI300X discovery order (ROCm 7.0 DKMS, from dmesg): + +| Index | IP Block | Bit in Mask | Enabled in 0x67? | +|-------|----------|-------------|-------------------| +| 0 | `soc15_common` | `0x01` | Yes | +| 1 | `gmc_v9_0` | `0x02` | Yes | +| 2 | `vega20_ih` | `0x04` | Yes | +| 3 | `psp` | `0x08` | **No** (disabled) | +| 4 | `smu` | `0x10` | **No** (disabled) | +| 5 | `gfx_v9_4_3` | `0x20` | Yes | +| 6 | `sdma_v4_4_2` | `0x40` | Yes | +| 7 | `vcn_v4_0_3` | `0x80` | No (not needed) | +| 8 | `jpeg_v4_0_3` | `0x100` | No (not needed) | + +### Bit Mask Calculation + +The driver checks `(amdgpu_ip_block_mask & (1 << i))` where `i` is the discovery order index (`amdgpu_device.c:2807`). + +``` +0x67 = 0110_0111 (binary) + ||||_|||| + |||| |||+-- bit 0: soc15_common (enabled) + |||| ||+--- bit 1: gmc_v9_0 (enabled) + |||| |+---- bit 2: vega20_ih (enabled) + |||| +----- bit 3: psp (DISABLED) + |||+------- bit 4: smu (DISABLED) + ||+-------- bit 5: gfx_v9_4_3 (enabled) + |+--------- bit 6: sdma_v4_4_2 (enabled) + +---------- bit 7: vcn_v4_0_3 (disabled) +``` + +### Common Mask Values + +| Mask | Binary | Enables | Use Case | +|------|--------|---------|----------| +| `0x67` | `0110_0111` | common, GMC, IH, GFX, SDMA | **Cosim (correct)** | +| `0x6f` | `0110_1111` | common, GMC, IH, PSP, GFX, SDMA | **Wrong -- PSP causes kernel panic** | +| `0xFF` | `1111_1111` | All blocks including PSP+SMU | Real hardware only | + +--- + +## 4. Known Issues and Pitfalls + +### 4.1 VGA ROM NULL Dereference + +| | | +|---|---| +| **Symptom** | `modprobe amdgpu` causes kernel NULL pointer dereference at `amdgpu_atom_parse_data_header+0x1b`. Call chain: `amdgpu_ras_init` -> `amdgpu_atomfirmware_mem_ecc_supported` -> `amdgpu_atom_parse_data_header`. RAX=0 (NULL `atom_context`) | +| **Root Cause** | All five BIOS discovery methods fail in cosim mode (see [Section 1.3](#13-dd-command-parameters-vga-rom)). The driver logs `"Unable to locate a BIOS ROM"` and proceeds, but the RAS init path unconditionally calls `amdgpu_atom_parse_data_header()` without NULL-checking `atom_context`. QEMU's `romfile=` property is insufficient -- the amdgpu driver uses SMU register-based ROM access, not the PCI ROM BAR | +| **Fix** | Run `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128` **before** `modprobe`. The `cosim-gpu-setup.service` does this automatically | + +### 4.2 PSP / SMU Firmware Load Failure + +| | | +|---|---| +| **Symptom** | `PSP load tmr failed!`, `hw_init of IP block failed -22`, `Fatal error during GPU init` | +| **Root Cause** | `ip_block_mask=0x6f` enables PSP (discovery index 3) but cosim does not model PSP hardware. The `amd_ip_block_type` enum in `amd_shared.h` shows PSP=4, but the mask uses discovery order where PSP is index 3 | +| **Fix** | Use `ip_block_mask=0x67` to disable both PSP (bit 3) and SMU (bit 4). See [Section 3](#3-ip-block-mask-reference) | + +### 4.3 SIGIO Coalescing Deadlock (Legacy Backend Only) + +| | | +|---|---| +| **Symptom** | Driver hangs on first INDEX2/DATA2 register pair access. gem5 stops responding after ~15 messages. QEMU socket buffer fills up | +| **Root Cause** | Linux FASYNC/SIGIO is edge-triggered. When QEMU sends a write + read in quick succession, both arrive before gem5's SIGIO handler fires. Only one signal is delivered; the handler reads one message and the second is stranded forever | +| **Fix** | `MI300XGem5Cosim::handleClientData()` uses a `do/while` drain loop with `poll(fd, POLLIN, 0)` to read all pending messages per SIGIO. Not applicable to vfio-user backend (uses libvfio-user's non-blocking poll) | + +### 4.4 GART Table Not Populated in Co-simulation + +| | | +|---|---| +| **Symptom** | Massive `GART translation for X not found` warnings. PM4 reads all-zero memory (opcode 0x0). KIQ ring test times out | +| **Root Cause** | In both backends, VRAM is backed by shared memory (`/dev/shm/mi300x-vram`). Driver writes to VRAM bypass gem5's memory system entirely, so `AMDGPUVM::gartTable` hash map is never populated via `AMDGPUDevice::writeFrame()` | +| **Fix** | Co-simulation fallback in `GARTTranslationGen::translate()`: when `gartTable` misses, read the PTE directly from shared VRAM at `vramShmemPtr + (gartBase - fbBase) + gart_byte_offset`. Key detail: `getGARTAddr()` already multiplies the page index by 8, so `bits(vaddr, 63, 12)` is already a byte offset -- do not multiply by 8 again | + +### 4.5 GART Unmapped Page Crash + +| | | +|---|---| +| **Symptom** | After `hipMalloc OK`, gem5 segfaults with repeated `GART translation for 0x3fff800000000 not found`. Memory exhaustion from infinite DMA retry | +| **Root Cause** | GPU PM4/SDMA engines attempt DMA to GART pages the driver has not mapped (PTE=0). The original code created `GenericPageTableFault`, but the DMA callback chain retried the same failing address infinitely | +| **Fix** | Unmapped GART pages are mapped to a sink (`paddr=0`). DMA reads return zeros, writes are discarded, simulation stays alive. This is normal: the first page at `ptStart` is simply unmapped | + +### 4.6 SDMA Ring Test Timeout + +| | | +|---|---| +| **Symptom** | SDMA ring test returns `-110` (`-ETIMEDOUT`) during driver initialization. `sdma v4_4_2: ring 0 test failed (-110)` | +| **Root Cause** | `sdma_delay` in `sdma_engine.hh` defaults to `1e9` ticks. In cosim mode, this translates to ~500ms wall-clock time, exceeding the driver's ~200ms timeout window. Flow: driver writes to SDMA ring, rings doorbell, gem5 schedules SDMA event with `sdma_delay` ticks delay, driver times out before gem5 completes | +| **Fix** | Reduced `sdma_delay` from `1e9` to `1000` ticks. Increased `KEEPALIVE_INTERVAL` to `1e9` to prevent keepalive interference | + +### 4.7 VRAM Address GART Translation Error + +| | | +|---|---| +| **Symptom** | Address `0x1f72fa8000` triggers 861,000+ GART translation errors, memory exhaustion, segfault | +| **Root Cause** | SDMA rptr writeback and PM4 RELEASE_MEM destination addresses may point to VRAM (address < 16 GiB). When these pass through `getGARTAddr()`, page number is multiplied by 8, and GART lookup fails because VRAM has no page table entries | +| **Fix** | Three-layer defense: (1) PM4: `writeData()`, `releaseMem()`, `queryStatus()` check `isVRAMAddress(addr)` and route to `getMemMgr()->writeRequest()`. (2) SDMA: `setGfxRptrLo/Hi()` and rptr writeback skip `getGARTAddr()` for VRAM addresses. (3) GART fallback: detect VRAM addresses and map to sink (`paddr=0`) | + +### 4.8 Shared Memory File Offset Mismatch + +| | | +|---|---| +| **Symptom** | GART page table entries read back as all zeros. PM4 opcode 0x0 (NOP, count 0) repeats infinitely | +| **Root Cause** | QEMU Q35 with 8 GiB RAM: `below_4g = 2 GiB` (hardcoded when `ram_size >= 0xB0000000`). gem5 configured as 3 GiB below / 5 GiB above. QEMU places above-4G data at file offset 2 GiB; gem5 reads from offset 3 GiB -- all zeros | +| **Fix** | `mi300_cosim.py` replicates the Q35 split logic: `below_4g = min(total_mem, 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000)` | + +### 4.9 Timer Overflow Crash + +| | | +|---|---| +| **Symptom** | After billions of ticks, gem5 crashes due to `curTick()` integer overflow. `schedule()` assertion failure | +| **Root Cause** | RTC and PIT timers continuously schedule events, causing tick counter overflow in cosim's long-running mode | +| **Fix** | Added `disable_rtc_events` parameter to `Cmos` and `disable_timer_events` to `I8254`. Both disabled in `mi300_cosim.py`. A keepalive event in the cosim bridge prevents the event queue from becoming empty | + +### 4.10 PM4ReleaseMem.dataSelect Panic + +| | | +|---|---| +| **Symptom** | gem5 panics with `Unimplemented PM4ReleaseMem.dataSelect` | +| **Root Cause** | `pm4_packet_processor.cc` only implemented `dataSelect == 1` (32-bit data write). Driver uses other modes during GFX initialization | +| **Fix** | Added all common dataSelect values: 0 = no data (event trigger only), 1 = 32-bit write (existed), 2 = 64-bit write, 3 = 64-bit GPU clock counter, other = warn and no-op | + +### 4.11 Unsupported PM4 Opcodes + +| | | +|---|---| +| **Symptom** | gem5 crashes on unrecognized PM4 opcode | +| **Root Cause** | `ACQUIRE_MEM` (0x58) and `SET_RESOURCES` (0xA0) were not handled | +| **Fix** | Both added to `pm4_defines.hh` and handled in `pm4_packet_processor.cc:decodeHeader()` as skip-and-continue (NOP) | + +### 4.12 PCI Class Code Mismatch + +| | | +|---|---| +| **Symptom** | amdgpu driver skips the legacy VGA ROM check at `0xC0000` | +| **Root Cause** | PCI class was `PCI_CLASS_DISPLAY_OTHER (0x0380)` instead of `PCI_CLASS_DISPLAY_VGA (0x0300)` | +| **Fix** | Changed to `PCI_CLASS_DISPLAY_VGA`. Kernel now recognizes the address range as "shadowed ROM" | + +### 4.13 QEMU Serial Console Conflict + +| | | +|---|---| +| **Symptom** | No serial output from guest when using `-serial unix:/tmp/serial.sock -nographic` together | +| **Root Cause** | `-nographic` implies `-serial mon:stdio`, creating serial0 on stdio. Explicit `-serial unix:...` becomes serial1 (ttyS1), but kernel uses `console=ttyS0` | +| **Fix** | Use `-nographic` alone. For programmatic access, run QEMU inside `screen` | + +### 4.14 OOM During gem5 Linking + +| | | +|---|---| +| **Symptom** | Linker killed by OOM killer even with `-j2` | +| **Root Cause** | Default linker uses too much memory | +| **Fix** | Use `scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold` | + +### 4.15 DRM Client Error -13 (Missing DKMS Module) + +| | | +|---|---| +| **Symptom** | `Failed to init DRM client: -13` followed by kernel panic. NULL pointer dereference in `ttm_resource_move_to_lru_tail` | +| **Root Cause** | Disk image missing `amddrm_exec.ko.zst` DKMS module. Without it, TTM memory manager fails, `drm_dev_enter()` returns `-EACCES` (-13) | +| **Fix** | Rebuild disk image using latest `gem5-resources` (`origin/stable` branch). Verify with `guestfish` that `amddrm_exec.ko.zst` exists in `/lib/modules/6.8.0-79-generic/updates/dkms/` | + +### 4.16 Driver rmmod After hw_init Failure + +| | | +|---|---| +| **Symptom** | After a driver `hw_init` failure, `rmmod amdgpu` causes kernel oops (page fault in `kgd2kfd_device_exit`). Module gets stuck in "busy" state | +| **Root Cause** | Cleanup path not robust after partial initialization | +| **Fix** | No workaround. Restart the entire cosim environment (kill QEMU, restart gem5 Docker container, restart QEMU) | + +--- + +## 5. Debugging Quick Reference + +### gem5 Debug Flags + +| Flag Combination | What It Shows | +|------------------|---------------| +| `MI300XCosim` | Cosim socket/vfio-user messages | +| `AMDGPUDevice` | MMIO register reads/writes | +| `PM4PacketProcessor` | PM4 packet decode and processing | +| `SDMAEngine` | SDMA operations | +| `AMDGPUDevice,PM4PacketProcessor` | MMIO + PM4 (combined) | +| `MI300XCosim,AMDGPUDevice,PM4PacketProcessor` | Full cosim debug | + +Usage: + +```bash +./scripts/cosim_launch.sh --gem5-debug MI300XCosim +# or manual: +build/VEGA_X86/gem5.opt --debug-flags=MI300XCosim,AMDGPUDevice ... +``` + +### QEMU Trace Events + +```bash +./scripts/cosim_launch.sh --qemu-trace 'mi300x_gem5_*' +``` + +### Log Inspection Commands + +```bash +# gem5 container logs (stderr) +docker logs gem5-cosim 2>&1 | tee /tmp/gem5.log + +# Filter for warnings/errors +docker logs gem5-cosim 2>&1 | grep -E "warn|error|GART" + +# Guest dmesg (via screen) +screen -S qemu-cosim -X stuff 'dmesg | tail -20\n' + +# Guest serial output (standalone sim) +tail -f m5out/board.pc.com_1.device +``` + +### Socket Test + +```bash +python3 scripts/cosim_test_client.py /tmp/gem5-mi300x.sock +``` + +### Incremental Rebuild + +```bash +# Delete stale object file, then rebuild +docker run --rm -v "$PWD:/gem5" -w /gem5 gem5-run:local \ + sh -c 'rm -f build/VEGA_X86/dev/amdgpu/.o' +docker run --rm -v "$PWD:/gem5" -w /gem5 \ + gem5-run:local scons build/VEGA_X86/gem5.opt -j1 +``` + +### Quick Diagnostic Table + +| Symptom | First Check | +|---------|-------------| +| gem5 container exits immediately | `docker logs gem5-cosim` | +| QEMU fails to connect | Is gem5 ready? (`chmod 777` socket?) | +| NULL deref at `psp_gpu_reset` | Wrong `ip_block_mask` (use `0x67`) | +| GART translation not found | Using latest gem5 binary? | +| SDMA ring test -110 | Check `sdma_delay` is `1000` | +| hipcc "cannot find ROCm device library" | `ls /opt/rocm/lib/`, use `--offload-arch=gfx942` | +| MMIO reads all return zero | gem5 not connected or crashed | +| `insmod: ERROR: could not load module` | Kernel version mismatch | +| `cosim-gpu-setup.service` failed | `journalctl -u cosim-gpu-setup` | +| BAR layout probe error -12 | Rebuild QEMU with correct BAR5=MMIO layout | + +--- + +## 6. GART Table Format and PTE Layout + +For conceptual explanation of GPU address spaces and translation flow, see architecture.md Section 5. + +### GART PTE Format + +Each GART page table entry is 8 bytes: + +| Bit Range | Field | Description | +|-----------|-------|-------------| +| 0 | Valid | Entry is valid | +| 1 | System | 1 = system memory, 0 = local VRAM | +| 5:2 | Fragment | Page fragment size | +| 47:12 | Physical Page | Physical address >> 12 | +| 51:48 | Block Fragment | Block fragment size | +| 63:52 | Flags | MTYPE, PRT, etc. | + +**Physical address extraction**: `paddr = (bits(PTE, 47, 12) << 12) | page_offset` + +### Aperture Registers + +| Register | gem5 Field | Format | Description | +|----------|-----------|--------|-------------| +| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | Start of VRAM in MC address space | +| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 \| 0xFFFFFF` | End of VRAM | +| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB relocation offset | +| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP remap base address | +| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture bottom | +| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 \| 0xFFFFFF` | AGP aperture top | +| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture low | +| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture high | +| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | GART table location in VRAM | +| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture start (page number) | +| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture end (page number) | + +### Typical Values in Co-simulation + +``` +ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB +ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 +ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) +fbBase = 0x8000000000 VRAM starts at MC address 512 GiB +fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) +sysAddrL = 0x0 System aperture start +sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) +``` + +### GART Table Layout in VRAM + +``` +VRAM offset = ptBase (gartBase) ++-------------------+ ptBase + 0 +| PTE[0] (8 bytes) | maps page ptStart ++-------------------+ ptBase + 8 +| PTE[1] | maps page ptStart + 1 ++-------------------+ ptBase + 16 +| PTE[2] | maps page ptStart + 2 +| ... | ++-------------------+ +| PTE[N] | maps page ptStart + N ++-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 +``` + +### Co-simulation PTE Fallback Lookup + +In cosim mode, `gartTable` is empty (VRAM writes bypass gem5). The fallback reads PTEs directly from shared VRAM: + +```cpp +Addr pte_table_offset = gart_addr - (ptStart * 8); +Addr pte_vram_offset = gartBase() + pte_table_offset; +memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); +``` + +If PTE is 0 (unmapped), the address is mapped to sink (`paddr=0`) instead of faulting. + +--- + +## 7. China Mirror Configuration + +When building disk images from China, `apt` inside the VM fetches from `us.archive.ubuntu.com`, which often hangs (Packer reports `Timeout waiting for SSH`, or the provisioner aborts during ROCm installation). + +### Apply the Patch + +```bash +cd gem5-resources +git apply ../scripts/patches/0001-user-data-cn-mirror.patch +``` + +### Revert the Patch + +```bash +cd gem5-resources +git apply -R ../scripts/patches/0001-user-data-cn-mirror.patch +``` + +To use a different mirror, edit the URI in the patch file and re-apply. + +--- + +## 8. File Reference + +### gem5 Source Files (`src/dev/amdgpu/`) + +| File | Purpose | +|------|---------| +| `mi300x_vfio_user.{cc,hh}` | vfio-user server SimObject (**default backend**) | +| `MI300XVfioUser.py` | SimObject Python wrapper (vfio-user) | +| `cosim_bridge.hh` | Abstract CosimBridge interface (both backends implement this) | +| `mi300x_gem5_cosim.{cc,hh}` | Legacy socket bridge SimObject | +| `MI300XGem5Cosim.py` | SimObject Python wrapper (legacy) | +| `amdgpu_device.cc` | GPU device model core, `readROM()`, `intrPost()`, `writeFrame()` | +| `amdgpu_vm.{cc,hh}` | All translation generators (GART, AGP, MMHUB, User), cosim VRAM fallback | +| `pm4_packet_processor.{cc,hh}` | PM4 packet decode, DMA routing, VRAM write routing, `isVRAMAddress()` | +| `pm4_defines.hh` | PM4 opcodes including `IT_ACQUIRE_MEM`, `IT_SET_RESOURCES` | +| `sdma_engine.{cc,hh}` | SDMA operations, rptr writeback routing, `sdma_delay` parameter | +| `interrupt_handler.cc` | IH ring buffer DMA and MSI-X interrupt delivery | +| `amdgpu_nbio.cc` | ASIC initialization complete register | + +### gem5 Configuration and Scripts + +| File | Purpose | +|------|---------| +| `configs/example/gpufs/mi300_cosim.py` | Cosim system config (`--cosim-backend=vfio-user\|legacy`) | +| `configs/example/gem5_library/x86-mi300x-gpu.py` | Standalone stdlib simulation config | +| `configs/example/gpufs/mi300.py` | Legacy standalone simulation config | +| `scripts/cosim_launch.sh` | Cosim orchestration (Docker + QEMU launch) | +| `scripts/run_mi300x_fs.sh` | Build orchestration (compile, disk image, run) | +| `scripts/Dockerfile.run` | Runtime Docker image definition | +| `scripts/cosim_test_client.py` | Socket connectivity test tool | +| `scripts/patches/0001-user-data-cn-mirror.patch` | China mirror patch for disk image build | + +### gem5 Modified Infrastructure Files + +| File | Changes | +|------|---------| +| `src/dev/intel_8254_timer.{cc,hh}` | `disable_timer_events` parameter (cosim timer overflow fix) | +| `src/dev/mc146818.{cc,hh}` | `disable_rtc_events` parameter (cosim timer overflow fix) | + +### gem5 Python Components + +| File | Purpose | +|------|---------| +| `src/python/gem5/prebuilt/viper/board.py` | ViperBoard: readfile injection, driver loading | +| `src/python/gem5/components/devices/gpus/amdgpu.py` | MI300X device definition | + +### QEMU Files (Legacy Backend Only) + +| File | Purpose | +|------|---------| +| `qemu/hw/misc/mi300x_gem5.c` | MI300X PCI device with socket bridge | +| `qemu/hw/misc/mi300x_gem5.h` | Header file | +| `qemu/hw/misc/trace-events` | Trace event definitions | + +> The vfio-user backend uses QEMU's built-in `vfio-user-pci` device. No custom QEMU code is needed. + +### External Dependencies + +| Path | Purpose | +|------|---------| +| `ext/libvfio-user/` | libvfio-user library (git submodule, vfio-user backend) | + +### Guest Disk Image Contents + +| File (inside guest) | Purpose | +|----------------------|---------| +| `/root/roms/mi300.rom` | VGA BIOS ROM binary | +| `/usr/lib/firmware/amdgpu/mi300_discovery` | IP discovery firmware | +| `/etc/systemd/system/cosim-gpu-setup.service` | Auto-load service unit | +| `/usr/local/bin/cosim-gpu-setup.sh` | Auto-load script | +| `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | amdgpu kernel module (ROCm 7.0 DKMS) | +| `/home/gem5/load_amdgpu.sh` | Driver loading script (standalone sim) | +| `/sbin/m5` | gem5 pseudo-instruction tool | + +### PCI BAR Layout + +| BAR | Resource | Type | Size | +|-----|----------|------|------| +| BAR0+1 | VRAM | 64-bit prefetchable | 16 GiB (shared memory) | +| BAR2+3 | Doorbell | 64-bit | 4 MiB | +| BAR4 | MSI-X | exclusive | -- | +| BAR5 | MMIO registers | 32-bit | 512 KiB (forwarded to gem5) | + +Driver constants: `AMDGPU_VRAM_BAR=0`, `AMDGPU_DOORBELL_BAR=2`, `AMDGPU_MMIO_BAR=5`. + +### Resource Routing (Both Backends) + +| Resource | Via Socket/vfio-user? | Via Shared Memory? | +|----------|----------------------|--------------------| +| MMIO Registers (BAR5) | Yes | No | +| VRAM (BAR0, 16 GiB) | **No** | Yes (`/dev/shm/mi300x-vram`) | +| Doorbells (BAR2) | Yes | No | + +Any gem5 data structure populated by intercepting VRAM writes (e.g., `gartTable`, page tables, ring buffers) will **not** be populated in cosim mode and requires explicit shared-VRAM fallback. diff --git a/docs/en/xgmi-model.md b/docs/en/xgmi-model.md deleted file mode 100644 index fb26a81..0000000 --- a/docs/en/xgmi-model.md +++ /dev/null @@ -1,83 +0,0 @@ -[中文](../zh/xgmi-model.md) - -# xGMI Interconnect Model Design - -## Overview - -The xGMI (inter-chip Global Memory Interconnect) model provides GPU-to-GPU -communication within a cosim-gpu multi-GPU hive. It attaches to each GPU's -L2 cache (TCC) egress and routes remote VRAM accesses through a modeled -xGMI link with configurable bandwidth, latency, and topology. - -## Packet Format - -| Field | Type | Description | -|----------|--------|------------------------------------| -| src_gpu | uint8 | Source GPU ID | -| dst_gpu | uint8 | Destination GPU ID | -| addr | uint64 | Target VRAM address | -| size | uint32 | Payload size in bytes | -| payload | bytes | Data (for write operations) | - -## Address Mapping - -Each GPU owns a contiguous VRAM address range: - -``` -GPU 0: [0, vram_size) -GPU 1: [vram_size, 2 * vram_size) -GPU N: [N * vram_size, (N+1) * vram_size) -``` - -The bridge determines whether an address is local or remote by checking -which GPU's range it falls into. - -## Topology Configuration - -Launch-time parameter `--xgmi-topology`: - -- **mesh**: Every GPU has a direct link to every other GPU. - An 8-GPU mesh creates 28 bidirectional links. -- **ring**: Each GPU connects to its two neighbors. - Lower link count but multi-hop for non-adjacent GPUs. - -## Link Parameters - -| Parameter | Default | CLI Flag | -|---------------------|----------|---------------------| -| Per-link bandwidth | 128 GB/s | `--xgmi-bandwidth` | -| Per-hop latency | 100 ns | `--xgmi-latency` | -| Lanes per link | 16 | (SimObject param) | -| Max links per GPU | 7 | (SimObject param) | -| Flow-control credits| 32 | (SimObject param) | - -## Flow Control - -Credit-based back-pressure prevents data loss: - -1. Each link starts with N credits (default 32). -2. Sending a packet consumes one credit. -3. The receiver returns a credit upon packet acceptance. -4. When credits reach zero, the sender stalls (never drops). - -## Architecture Phases - -### Path A (Milestones 1-3): Self-built xGMI model - -- Single-process multi-GPU (Milestones 1-2): in-process function calls -- Multi-process 8-GPU hive (Milestone 3): IPC transport via shared - memory ring buffers or Unix sockets - -### Path B (Milestones 4-5): SST Merlin integration - -- Replace xGMI transport with SST Merlin network engine -- Three-layer synchronization: QEMU (functional) ↔ gem5 (GPU timing) ↔ - SST (network timing) -- Supports arbitrary topologies (fat-tree, dragonfly) - -## Key Source Files - -- `gem5/src/dev/amdgpu/XGMIBridge.py` — SimObject definition -- `gem5/src/dev/amdgpu/xgmi_bridge.hh` — C++ header -- `gem5/src/dev/amdgpu/xgmi_bridge.cc` — C++ implementation -- `gem5/configs/example/gpufs/mi300_cosim.py` — Configuration and wiring diff --git a/docs/zh/architecture.md b/docs/zh/architecture.md new file mode 100644 index 0000000..eefa9f9 --- /dev/null +++ b/docs/zh/architecture.md @@ -0,0 +1,1003 @@ +[English](../en/architecture.md) + +# 协同仿真架构 + +本文档深入介绍 QEMU + gem5 MI300X 协同仿真系统的架构与设计。涵盖系统级结构、内存共享机制、GPU 地址转换、DMA 数据流、中断转发、xGMI 互连模型以及开发过程中的关键设计决策。 + +--- + +## 目录 + +- [系统架构概述](#系统架构概述) + - [组件图](#组件图) + - [关键组件](#关键组件) + - [通信通道](#通信通道) +- [vfio-user 与 Legacy 后端](#vfio-user-与-legacy-后端) + - [vfio-user 后端(默认)](#vfio-user-后端默认) + - [Legacy Socket 后端](#legacy-socket-后端) + - [后端对比](#后端对比) +- [PCI BAR 布局](#pci-bar-布局) +- [内存共享架构](#内存共享架构) + - [三个共享通道](#三个共享通道) + - [VRAM 共享(BAR0)](#vram-共享bar0) + - [Guest RAM 共享(GTT 页面)](#guest-ram-共享gtt-页面) + - [内存分割(Q35)](#内存分割q35) + - [Sink 机制](#sink-机制) +- [GPU 地址转换与 GART](#gpu-地址转换与-gart) + - [GPU 地址空间与 Aperture](#gpu-地址空间与-aperture) + - [Aperture 寄存器](#aperture-寄存器) + - [GART 结构与表布局](#gart-结构与表布局) + - [PTE 格式](#pte-格式) + - [getGARTAddr 变换](#getgartaddr-变换) + - [转换流程](#转换流程) + - [gartTable 哈希表 vs. 共享 VRAM](#garttable-哈希表-vs-共享-vram) + - [转换后地址分类](#转换后地址分类) + - [MMHUB Aperture](#mmhub-aperture) + - [用户空间转换(VMID > 0)](#用户空间转换vmid-0) +- [DMA 数据流](#dma-数据流) + - [PM4 Packet Processor 路由](#pm4-packet-processor-路由) + - [SDMA 引擎路由](#sdma-引擎路由) + - [VRAM vs. 系统内存检测](#vram-vs-系统内存检测) + - [vfio-user 后端:共享内存直接访问](#vfio-user-后端共享内存直接访问) + - [Legacy 后端:Socket DMA 协议](#legacy-后端socket-dma-协议) + - [中断处理器(IH)DMA](#中断处理器ihdma) + - [完整数据流示例](#完整数据流示例) +- [MSI-X 中断转发](#msi-x-中断转发) + - [中断传递路径](#中断传递路径) + - [IH Ring Buffer 交互](#ih-ring-buffer-交互) +- [xGMI 互连模型](#xgmi-互连模型) + - [数据包格式](#数据包格式) + - [地址映射](#地址映射) + - [拓扑配置](#拓扑配置) + - [链路参数](#链路参数) + - [流量控制](#流量控制) + - [架构阶段](#架构阶段) +- [设计历程与关键决策](#设计历程与关键决策) + - [为什么选择 vfio-user 而非自定义协议](#为什么选择-vfio-user-而非自定义协议) + - [为什么选择 Q35 + KVM](#为什么选择-q35-kvm) + - [共享内存设计](#共享内存设计) + - [SIGIO 边沿触发排空](#sigio-边沿触发排空) + - [GART 回退方案](#gart-回退方案) + - [VRAM 路由发现](#vram-路由发现) + +--- + +## 系统架构概述 + +协同仿真系统将 GPU 工作负载的执行拆分到两个进程中:QEMU(配合 KVM)负责宿主 CPU、Guest OS 和 amdgpu 驱动,以接近原生的速度运行;gem5 则建模 MI300X GPU 设备——Shader 阵列、命令处理器、SDMA 引擎和 Ruby 缓存层次结构——提供 cycle 级精度。两个进程通过 Unix 域套接字通信,并通过 POSIX 共享内存文件实现零拷贝 DMA。 + +### 组件图 + +``` ++--------------------------------------+ +| QEMU (Q35 + KVM) | +| +--------------------------------+ | +| | Guest Linux (Ubuntu 24) | | +| | amdgpu driver (ROCm 7) | | +| | ROCm userspace | | +| +--------------+-----------------+ | +| | MMIO / Doorbell | +| +--------------v-----------------+ | +| | vfio-user-pci | | +| | (QEMU built-in device) | | +| +--------------+-----------------+ | +| | vfio-user protocol | ++-----------------+--------------------+ + | /tmp/gem5-mi300x.sock + | (Unix socket) ++-----------------+--------------------+ +| gem5 | | +| +--------------v-----------------+ | +| | MI300XVfioUser | | +| | (mi300x_vfio_user.cc) | | +| | [libvfio-user server] | | +| +--------------+-----------------+ | +| | AMDGPUDevice API | +| +--------------v-----------------+ | +| | AMDGPUDevice | | +| | PM4PacketProcessor | | +| | SDMAEngine | | +| | Shader / CU array | | +| +--------------------------------+ | ++--------------------------------------+ + +Shared Memory: + /dev/shm/cosim-guest-ram Guest physical RAM (QEMU <-> gem5 DMA) + /dev/shm/mi300x-vram GPU VRAM (QEMU BAR0 <-> gem5 device memory) +``` + +gem5 运行在 Docker 容器内,使用 `StubWorkload`(不运行 Linux 内核)。它作为 vfio-user 服务端启动,监听 Unix 套接字,等待来自 QEMU 的 MMIO 请求。 + +### 关键组件 + +| 组件 | 位置 | 作用 | +|---|---|---| +| `MI300XVfioUser` | `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5 vfio-user 服务端;通过 libvfio-user 处理 BAR 访问和中断(默认后端) | +| `vfio-user-pci` | QEMU 内建设备 | QEMU 侧 vfio-user 客户端;无需自定义 QEMU 代码 | +| `CosimBridge` | `src/dev/amdgpu/cosim_bridge.hh` | 抽象协同仿真桥接接口,vfio-user 和 legacy 后端均实现此接口 | +| `MI300XGem5Cosim` | `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | 旧版 socket 桥接 SimObject | +| `mi300x_gem5.c` | `qemu/hw/misc/` | 旧版 QEMU PCI 设备;通过自定义 socket 协议转发 MMIO/doorbell | +| `mi300_cosim.py` | `configs/example/gpufs/` | gem5 配置;通过 `--cosim-backend=vfio-user|legacy` 选择后端 | +| `cosim_launch.sh` | `scripts/` | 编排 Docker (gem5) + QEMU 的启动流程 | + +### 通信通道 + +系统在 QEMU 和 gem5 之间使用三个不同的通道: + +1. **VRAM 共享内存**(`/dev/shm/mi300x-vram`,16 GiB)——GPU 显存,包括 GART 页表。双方 mmap 同一文件,实现零拷贝访问。 +2. **Guest RAM 共享内存**(`/dev/shm/cosim-guest-ram`,8 GiB)——宿主物理内存,包含 ring buffer、fence、GTT 页面。QEMU 使用 `memory-backend-file` 配合 `share=on`;gem5 使用 `shared_backstore`。 +3. **vfio-user socket**(`/tmp/gem5-mi300x.sock`)——承载 MMIO 读写、配置空间访问、Doorbell 写操作和中断通知,使用 vfio-user 协议。 + +--- + +## vfio-user 与 Legacy 后端 + +协同仿真系统支持两种通信后端,通过 gem5 配置中的 `--cosim-backend=vfio-user|legacy` 选择。 + +### vfio-user 后端(默认) + +vfio-user 后端使用行业标准的 vfio-user 协议(QEMU 10.0+ 内置支持)。gem5 侧使用 Nutanix 的 libvfio-user 库作为服务端。 + +- **QEMU 侧**:使用内建的 `vfio-user-pci` 设备。无需自定义 QEMU 代码;任何原生 QEMU 10.0+ 构建均可使用。 +- **gem5 侧**:`MI300XVfioUser` 向 libvfio-user 注册 BAR 区域、配置空间和 MSI-X capability,然后处理来自 QEMU 的请求。 +- **DMA**:gem5 通过 Ruby 内存系统的共享后端直接访问 Guest RAM,无需 socket 往返。 +- **中断**:通过 `irq_fd`(注入 KVM 的 eventfd)传递,不需要自定义中断消息。 + +### Legacy Socket 后端 + +旧版后端使用自定义的 `mi300x-gem5` QEMU PCI 设备和基于两条 Unix socket 连接的自定义二进制协议: + +- **同步连接**:MMIO 请求-响应对(QEMU 发送写/读,gem5 响应)。 +- **异步连接**:gem5 向 QEMU 发送 IRQ raise/lower 事件和 DMA 读写请求。 + +此后端需要从 `cosim/qemu/` 目录编译的 QEMU。 + +### 后端对比 + +| 维度 | vfio-user 后端 | Legacy Socket 后端 | +|------|---------------|-------------------| +| Guest RAM DMA | Ruby 内存系统直接访问共享后端 | Socket 请求-响应协议 | +| VRAM 访问 | mmap 零拷贝 | mmap 零拷贝 | +| 中断 | irq_fd(eventfd -> KVM) | 自定义 socket 消息 | +| MMIO | vfio-user 消息传递 | 自定义二进制协议 | +| QEMU 侧设备 | 内置 `vfio-user-pci` | 自定义 `mi300x_gem5.c` | +| 地址转换 | gem5 内部 GART 转换 | QEMU 端 `pci_dma_read/write` | +| QEMU 版本 | 原生 QEMU 10.0+ | 需要自定义分支 | + +--- + +## PCI BAR 布局 + +PCI BAR 布局必须与 amdgpu 驱动中硬编码的预期一致(`AMDGPU_VRAM_BAR=0`、`AMDGPU_DOORBELL_BAR=2`、`AMDGPU_MMIO_BAR=5`)。 + +``` +BAR0+1 VRAM 64-bit prefetchable 16 GiB (shared memory) +BAR2+3 Doorbell 64-bit 4 MiB +BAR4 MSI-X exclusive 256 vectors +BAR5 MMIO regs 32-bit 512 KiB (forwarded to gem5) +``` + +| BAR | 内容 | 大小 | 通信方式 | +|-----|------|------|---------| +| BAR0+1 | VRAM | 16 GiB | 共享内存(零拷贝 mmap) | +| BAR2+3 | Doorbell | 4 MiB | Socket 转发(vfio-user 或 legacy) | +| BAR4 | MSI-X | 256 vectors | QEMU 本地 | +| BAR5 | MMIO 寄存器 | 512 KiB | Socket 转发(vfio-user 或 legacy) | + +BAR0+1 和 BAR2+3 是 64 位 BAR(16 GiB VRAM 无法放入 32 位地址空间)。在 PCI BAR size probing 期间,每个 64 位 BAR 的上半部分必须返回 size mask 的高 32 位。 + +PCI class code 设置为 `PCI_CLASS_DISPLAY_VGA (0x0300)` 而非 `PCI_CLASS_DISPLAY_OTHER (0x0380)`,使内核将设备检测为"带有 shadowed ROM 的视频设备",从而启用 `0xC0000` 处的 VGA ROM 查找。 + +--- + +## 内存共享架构 + +在协同仿真中,GPU 设备模型(gem5)和宿主系统(QEMU/KVM)运行在两个独立进程中。GPU 需要访问两类内存: + +- **VRAM**(本地显存):GPU 私有,存放纹理、buffer、GART 页表和设备本地分配。 +- **GTT**(Graphics Translation Table / System Memory):宿主物理内存中被 GPU 映射的区域,用于 ring buffer、fence、IH cookie 和 DMA 缓冲。 + +这两类内存都通过 POSIX 共享内存文件实现双向可见,无需 socket 通信。 + +### 三个共享通道 + +``` ++----------------------------+ +-----------------------------+ +| QEMU (Q35 + KVM) | | gem5 (Docker) | +| | | | +| Guest Linux | | MI300X GPU Model | +| amdgpu driver | | Shader / CU / SDMA | +| | | PM4 / IH / Ruby caches | +| | | | +| +--------+ +---------+ | vfio-user (Unix) | +------------+ +--------+ | +| | BAR0 | | BAR5 |<---(MMIO/CFG/Doorbell)--->|MI300XVfio | |GPU core| | +| | (VRAM) | | (MMIO) | | | |User bridge | | | | +| +---+----+ +---------+ | | +-----+------+ +--------+ | +| | | | | | ++------+---------------------+ +--------+--------------------+ + | | + v v + /dev/shm/mi300x-vram (16 GiB) mmap 同一文件 + (VRAM: GPU 数据 + GART 页表) (vramShmemPtr) + | | + v v + /dev/shm/cosim-guest-ram (8 GiB) mmap 同一文件 + (Guest RAM: ring buffer, fence, (system->getPhysMem()) + GTT 页面, 内核/用户数据) +``` + +| 通道 | 文件/Socket | 大小 | 用途 | 访问方式 | +|------|-----------|------|------|---------| +| VRAM 共享内存 | `/dev/shm/mi300x-vram` | 16 GiB | GPU 显存 + GART 页表 | mmap(零拷贝) | +| Guest RAM 共享内存 | `/dev/shm/cosim-guest-ram` | 8 GiB | 宿主物理内存(GTT 页面) | QEMU: mmap; gem5: Ruby 内存系统直接访问共享后端 | +| vfio-user Socket | `/tmp/gem5-mi300x.sock` | -- | MMIO/配置空间/Doorbell;中断通过 irq_fd(eventfd -> KVM) | vfio-user 协议 | + +### VRAM 共享(BAR0) + +#### 初始化流程 + +gem5 侧(`mi300x_vfio_user.cc:setupVramShm`): + +```cpp +shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666); +ftruncate(shmemFd, vramSize); +shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0); + +// Pass the shared pointer to the GART translator +gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr; +gpuDevice->getVM().vramShmemSize = vramSize; +``` + +QEMU 通过 vfio-user DMA 区域映射机制获取 BAR0 映射——不再直接打开 VRAM 共享内存文件,而是通过 vfio-user 协议获取映射。 + +#### VRAM 内容布局 + +``` +Offset 0x000000000 +------------------------------+ + | GPU Data Area | + | - hipMalloc allocations | + | - Kernel args, textures | + | - Driver internal allocs | + | | + | ... | + | | +Offset ~0x3EE600000 +------------------------------+ +(ptBase) | GART Page Table (PTEs) | + | 8 bytes per PTE | + | Maps GPU VA -> phys addr | +Offset 0x400000000 +------------------------------+ +(16 GiB) +``` + +#### 访问模式 + +| 场景 | 写入方 | 读取方 | 路径 | +|------|--------|--------|------| +| GPU buffer 分配 | 驱动(via BAR0 write) | gem5(via vramShmemPtr) | 共享内存直接访问 | +| GART PTE 写入 | 驱动(via BAR0 write) | gem5 GART 翻译器 | memcpy from vramShmemPtr | +| IP Discovery 表 | gem5 初始化 | 驱动(via BAR0 read) | 共享内存直接访问 | + +由于 QEMU BAR0 和 gem5 的 `vramShmemPtr` 映射的是同一个 `/dev/shm` 文件,驱动写入 BAR0 的数据对 gem5 立即可见,无需任何 socket 通信。 + +### Guest RAM 共享(GTT 页面) + +在 AMD GPU 中,GTT = GART = Graphics Address Remapping Table。它是一个单级页表(VMID 0),将 GPU 虚拟地址映射到宿主物理地址。被映射的宿主物理内存页面就是所谓的"GTT 页面"。 + +典型的 GTT 页面内容: + +| 数据结构 | 说明 | 访问方向 | +|---------|------|---------| +| PM4 Ring Buffer | GFX 命令队列 | 驱动写 -> GPU 读 | +| SDMA Ring Buffer | DMA 命令队列 | 驱动写 -> GPU 读 | +| IH Ring Buffer | 中断处理队列 | GPU 写 -> 驱动读 | +| Fence 值 | 完成信号 | GPU 写 -> 驱动读 | +| MQD (Map Queue Descriptor) | 队列描述符 | 驱动写 -> GPU 读 | +| 用户 DMA 缓冲 | hipMemcpy 源/目标 | 双向 | + +#### 初始化流程 + +QEMU 侧(命令行参数): + +```bash +-object memory-backend-file,id=mem0,size=8G,\ + mem-path=/dev/shm/cosim-guest-ram,share=on +-numa node,memdev=mem0 +``` + +`share=on` 确保文件映射使用 `MAP_SHARED`,其他进程可以看到 QEMU 对 Guest 内存的修改。 + +gem5 侧(`mi300_cosim.py`): + +```python +system.shared_backstore = args.shmem_host_path # "/cosim-guest-ram" +system.auto_unlink_shared_backstore = True +system.memories[0].shared_backstore = args.shmem_host_path +``` + +gem5 的 `PhysicalMemory` 使用同一个 POSIX 共享内存文件作为后端。 + +#### 为什么 GTT 不需要额外的共享机制 + +GTT 页面存在于 Guest RAM 中。Guest RAM 已经通过 `/dev/shm/cosim-guest-ram` 在 QEMU 和 gem5 之间共享: + +1. **驱动写入 ring buffer** -> 写入 Guest RAM -> 共享内存 -> gem5 可读 +2. **gem5 写入 fence** -> Ruby 内存控制器写入共享后端 -> 驱动可读 +3. **GART PTE 指向的物理地址** -> 就是 Guest RAM 中的偏移 -> 双方都能访问 + +### 内存分割(Q35) + +QEMU Q35 在 RAM >= 2.75 GiB 时将内存分为两个区域: + +- **4G 以下区域**:前 2 GiB(文件偏移 0) +- **4G 以上区域**:其余部分位于文件偏移 2 GiB 处,映射到 Guest 物理地址 0x100000000+ + +gem5 的 `mi300_cosim.py` 复制了此分割逻辑,以确保双方在文件布局上保持一致: + +```python +total_mem = convert.toMemorySize(args.mem_size) +lowmem_limit = 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000 +below_4g = min(total_mem, lowmem_limit) +above_4g = total_mem - below_4g +``` + +如果双方在 4G 以上内存的文件偏移位置上不一致,gem5 会读到过期或全零的数据(例如 GART PTE 读出全零,导致 PM4 命令处理器中的无限 NOP 循环)。 + +### Sink 机制 + +在协同仿真模式下,部分 GART PTE 可能为零(未初始化)或指向 VRAM 内部地址。如果 gem5 无法转换这些地址,原始行为是抛出 `GenericPageTableFault`,导致 DMA 重试循环直至仿真挂死。 + +Sink 机制防止了这一问题: + +```cpp +// amdgpu_vm.cc: GARTTranslationGen::translate() + +if (pte == 0) { + if (origAddr < vramShmemSize && vramShmemPtr) { + // VRAM address -> map to sink (paddr=0) + range.paddr = 0; + warn_once("GART: VRAM address mapped to sink -- " + "VRAM write-backs are no-ops in cosim"); + } else if (vramShmemPtr) { + // Unmapped GART page -> sink + range.paddr = 0; + warn_once("GART cosim: unmapped page -> sink"); + } +} +``` + +Sink 语义: + +- `paddr=0` 在 gem5 中始终是有效的物理地址(系统 RAM 基址) +- DMA 读取返回零 +- DMA 写入被静默丢弃 +- 避免了 fault -> retry 死循环 + +此行为是安全的:诊断确认 GART 第一页(ptStart 本身)通常是未映射的,而后续 PTE 包含有效条目。Sink 确保即使 GPU 尝试 DMA 到驱动尚未映射的页面,仿真仍然保持存活。 + +--- + +## GPU 地址转换与 GART + +MI300X(GFX 9.4.3)使用多个地址空间和 aperture 来访问内存。GPU 发出的每次内存访问首先按 aperture 分类,然后转换为物理地址。 + +### GPU 地址空间与 Aperture + +``` +GPU Virtual Address (48-bit) +| ++-- AGP aperture [agpBot, agpTop] +| +-- Direct offset: paddr = vaddr - agpBot + agpBase +| ++-- GART aperture [ptStart<<12, ptEnd<<12] +| +-- Page table: paddr = GART_PTE[page_num].phys_addr | offset +| ++-- Framebuffer (FB) [fbBase, fbTop] +| +-- VRAM offset: vram_off = vaddr - fbBase +| ++-- System aperture [sysAddrL, sysAddrH] +| +-- Direct map: paddr = vaddr (system memory) +| ++-- MMHUB aperture [mmhubBase, mmhubTop] +| +-- VRAM mirror: vram_off = vaddr - mmhubBase +| ++-- User VM (VMID>0) [arbitrary VAs] + +-- Multi-level page table walk (4 or 5 levels) +``` + +### Aperture 寄存器 + +这些 MMIO 寄存器定义了每个 aperture 的边界。这些值由 amdgpu 驱动在 GMC(Graphics Memory Controller)初始化期间设置。 + +| 寄存器 | gem5 字段 | 格式 | 描述 | +|----------|-----------|--------|-------------| +| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | MC 地址空间中 VRAM 的起始地址 | +| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 | 0xFFFFFF` | VRAM 结束地址 | +| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB 重定位偏移量 | +| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP 重映射基地址 | +| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture 底部 | +| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 | 0xFFFFFF` | AGP aperture 顶部 | +| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture 低地址 | +| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture 高地址 | +| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | GART 表在 VRAM 中的位置 | +| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture 起始地址(页号) | +| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture 结束地址(页号) | + +协同仿真中的典型值(来自驱动初始化诊断): + +``` +ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB +ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 +ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) +fbBase = 0x8000000000 VRAM starts at MC address 512 GiB +fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) +sysAddrL = 0x0 System aperture start +sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) +``` + +### GART 结构与表布局 + +GART 是一个单级页表,供 VMID 0(内核模式)使用,将 GPU 虚拟地址映射到系统物理地址。它使 GPU 能够对主机(Guest)RAM 进行 DMA 访问,用于 ring buffer、fence 值、IH cookie 以及其他内核模式数据结构。 + +GART 表位于 VRAM 偏移 `ptBase` 处: + +``` +VRAM offset = ptBase (gartBase) ++-------------------+ ptBase + 0 +| PTE[0] (8 bytes) | maps page ptStart ++-------------------+ ptBase + 8 +| PTE[1] | maps page ptStart + 1 ++-------------------+ ptBase + 16 +| PTE[2] | maps page ptStart + 2 +| ... | ++-------------------+ +| PTE[N] | maps page ptStart + N ++-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 +``` + +### PTE 格式 + +每个 PTE 为 8 字节: + +``` +63 52 51 48 47 12 11 6 5 2 1 0 ++-------+------+-----------------+------+----+---+---+ +| Flags | BlkF | Physical Page | Rsvd |Frag|Sys| V | +| | | (PA >> 12) | | | | | ++-------+------+-----------------+------+----+---+---+ +``` + +| 位域 | 字段 | 描述 | +|------|-------|-------------| +| 0 | Valid | 条目有效 | +| 1 | System | 1 = 系统内存(Guest RAM),0 = 本地 VRAM | +| 5:2 | Fragment | 页面片段大小 | +| 47:12 | Physical Page | 物理地址 >> 12 | +| 51:48 | Block Fragment | 块片段大小 | +| 63:52 | Flags | MTYPE、PRT 等 | + +物理地址提取:`paddr = (bits(PTE, 47, 12) << 12) | page_offset` + +### getGARTAddr 变换 + +在 GART 查找之前,地址通过 `getGARTAddr()` 进行变换。该函数将页号乘以 8(PTE 的大小),实际上是将 GPU VA 转换为 GART 表内的字节偏移量: + +```cpp +// In pm4_packet_processor.cc and sdma_engine.cc: +Addr getGARTAddr(Addr addr) const { + if (!gpuDevice->getVM().inAGP(addr)) { + Addr low_bits = bits(addr, 11, 0); + addr = (((addr >> 12) << 3) << 12) | low_bits; + } + return addr; +} +``` + +### 转换流程 + +完整的 GART 转换序列: + +``` +Original GPU VA (e.g., 0x7FFF00032000) + | + v getGARTAddr() +Transformed addr = ((VA>>12) * 8) << 12 | low_bits + = 0x3FFF80019_0000 (example) + | + v GARTTranslationGen::translate() +gart_addr = bits(transformed, 63, 12) = page_num * 8 + | + +-- Look up gartTable hash map (populated by writeFrame / SDMA shadow) + | + +-- Cosim fallback: read PTE from shared VRAM + | pte_offset = gart_addr - (ptStart * 8) + | pte = *(vramShmemPtr + ptBase + pte_offset) + | + v Extract physical address +paddr = (bits(PTE, 47, 12) << 12) | bits(VA, 11, 0) +``` + +驱动通过以下路径写入 GART PTE: + +``` +amdgpu driver (guest) + | + +- amdgpu_gart_map(): compute PTE value + | pte = (phys_addr >> 12) << 12 | flags + | + +- write to BAR0 + ptBase + (gpu_page * 8) + | | + | +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram + | +- data immediately appears in shared memory + | + +- TLB invalidate: write VM_INVALIDATE_ENG17 register + +- MMIO -> vfio-user -> gem5 -> invalidateTLBs() +``` + +### gartTable 哈希表 vs. 共享 VRAM + +在独立 gem5 模式下,GART 条目维护在一个哈希表(`AMDGPUVM::gartTable`)中,由以下方式填充: + +1. **直接写入**(`amdgpu_device.cc:writeFrame()`):当驱动通过 BAR0 写入 VRAM 的 GART 区域时,值被存储到 `gartTable[offset]` 中。 +2. **SDMA 影子拷贝**(`sdma_engine.cc`):当 SDMA 写入设备内存中的 GART 范围时,影子拷贝会更新 `gartTable`。 + +在协同仿真模式下,驱动通过 QEMU 的 BAR0 映射写入 GART PTE,直接进入共享 VRAM,不经过 gem5 的 `writeFrame()`。因此,`gartTable` 基本为空。协同仿真回退机制直接从共享 VRAM 的 `vramShmemPtr + ptBase` 处读取 PTE: + +```cpp +Addr pte_table_offset = gart_addr - (ptStart * 8); +Addr pte_vram_offset = gartBase() + pte_table_offset; +memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); +``` + +如果 PTE 为 0(未映射的页面),协同仿真模式将映射到 sink(`paddr=0`),而不是产生 fault(参见 [Sink 机制](#sink-机制))。 + +### 转换后地址分类 + +GART 转换得到物理地址后,gem5 判断该地址指向哪里: + +``` +Physical address paddr + | + +- Within fbBase ~ fbTop range? + | +- YES -> VRAM address + | +- Access directly via vramShmemPtr (zero-copy) + | + +- Within sysAddrL ~ sysAddrH range? + | +- YES -> Guest RAM address (GTT page) + | +- Access via Ruby memory system (shared memory direct access) + | + +- Neither? + +- Sink (paddr=0, safely discarded) +``` + +### MMHUB Aperture + +MMHUB(Memory Management Hub)提供 VRAM 的影子映射。`[mmhubBase, mmhubTop]` 范围内的地址通过减去基地址进行转换: + +``` +vram_offset = vaddr - mmhubBase +``` + +SDMA 在 VMID 0 模式下使用此 aperture 访问设备内存。 + +### 用户空间转换(VMID > 0) + +用户空间 GPU 程序(如 HIP 应用)使用类似于 x86-64 分页的多级页表。每个 VMID(1-15)拥有自己的页表基址寄存器。 + +``` +VM_CONTEXT[N]_PAGE_TABLE_BASE_ADDR -> Page Directory Base + | + v 4-level walk (PDE3 -> PDE2 -> PDE1 -> PDE0 -> PTE) +Physical address +``` + +`UserTranslationGen` 类使用 GPU 的页表遍历器(`VegaISA::Walker`)执行此遍历。用户模式(vmid > 0)下的 SDMA 使用此路径。 + +VMID 0(内核模式)GART 页表通过共享 VRAM 完全可见。VMID > 0(用户模式)多级页表由 `VegaISA::Walker` 遍历,它使用 gem5 内部的 TLB/page walker,而非直接从共享内存读取。实际影响有限:驱动写入页表后会发送 TLB invalidate MMIO,gem5 收到后刷新 TLB,后续 walker 遍历时会从正确的物理地址读取。 + +--- + +## DMA 数据流 + +### PM4 Packet Processor 路由 + +``` +PM4PacketProcessor::translate(vaddr, size) + | + +-- inAGP(vaddr)? -> AGPTranslationGen (direct offset) + | + +-- else -> GARTTranslationGen (page table lookup) +``` + +所有 PM4 DMA 使用 GART 转换(VMID 0)。地址在 DMA 调用之前先通过 `getGARTAddr()` 变换。 + +### SDMA 引擎路由 + +SDMA 比 PM4 具有更多的 aperture 感知能力,因为它同时处理内核模式(VMID 0)和用户模式(VMID > 0)的操作: + +``` +SDMAEngine::translate(vaddr, size) + | + +-- cur_vmid > 0? -> UserTranslationGen (multi-level page table) + | + +-- inAGP(vaddr)? -> AGPTranslationGen + | + +-- inMMHUB(vaddr)?-> MMHUBTranslationGen (VRAM shadow) + | + +-- else -> GARTTranslationGen +``` + +### VRAM vs. 系统内存检测 + +对于 PM4 的 RELEASE_MEM 和 WRITE_DATA 数据包,目标可以是 VRAM 或系统内存。路由逻辑: + +```cpp +bool vram = isVRAMAddress(pkt->addr); // addr < gpuDevice->getVRAMSize() +Addr addr = vram ? pkt->addr : getGARTAddr(pkt->addr); + +if (vram) + gpuDevice->getMemMgr()->writeRequest(addr, data, size); // device memory +else + dmaWriteVirt(addr, size, cb, data); // system memory via GART +``` + +如果没有此检查,VRAM 地址会被送入 `getGARTAddr()` 导致页号乘以 8,GART 转换失败(VRAM 地址没有对应的页表项)。三层防护(PM4 层、SDMA 层、GART 回退 sink)防止仿真崩溃。 + +### vfio-user 后端:共享内存直接访问 + +在 vfio-user 后端下,gem5 通过 Ruby 内存系统的共享后端直接访问 Guest RAM,无需基于 socket 的 DMA 操作: + +``` +gem5 GPU model (PM4/SDMA/IH) + | + | Needs to read ring buffer commands / write fence values + | + v Ruby memory system request + | + +- Address translated by GART -> Guest physical address + | + +- Ruby memory controller accesses PhysicalMemory + | | + | +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED) + | +- read/write directly hits shared memory + | +- QEMU sees changes immediately (same mmap file) + | + +- Done (no socket round-trip needed) +``` + +关键优势: + +- **零拷贝**:DMA 读写直接操作共享内存,无需序列化/反序列化 +- **低延迟**:省去了 socket 请求-响应的往返开销 +- **简化架构**:无需自定义 DMA 协议,Ruby 内存系统天然支持共享后端 + +### Legacy 后端:Socket DMA 协议 + +旧版后端通过 socket 使用自定义二进制协议路由 DMA。 + +**gem5 读取 Guest RAM**(ring buffer / fence): + +``` +gem5 GPU model (PM4/SDMA/IH) + | + v cosimBridge->sendDmaRead(guestPhysAddr, length) + | + +- Construct DmaRead message (32-byte header) + | { type=DmaRead, addr=guestPhysAddr, data=length } + | + +- sendAll(eventFd, &msg, 32) --> QEMU event thread + | | + | +- pci_dma_read(addr, buf, len) + | | (reads from /dev/shm/cosim-guest-ram) + | | + | +- sendAll(eventFd, &resp, 32) + | <------------------------------------------+- sendAll(eventFd, data, len) + | + +- memcpy(dest, recvBuf, length) // data arrives at gem5 +``` + +**gem5 写入 Guest RAM**(fence / IH cookie): + +``` +gem5 GPU model + | + v cosimBridge->sendDmaWrite(guestPhysAddr, length, data) + | + +- Construct DmaWrite message + data payload + | { type=DmaWrite, addr=guestPhysAddr, data=length, size=length } + | + +- sendAll(eventFd, &msg, 32) --> QEMU event thread + +- sendAll(eventFd, data, length) --> | + | +- pci_dma_write(addr, buf, len) + | | (writes to /dev/shm/cosim-guest-ram) + | + +- Done (DMA writes don't wait for response) +``` + +Legacy 后端的单次 DMA 最大传输量为 4 MiB(`COSIM_DMA_BUF_SIZE`)。实际场景中驱动通常以页为单位提交。 + +### 中断处理器(IH)DMA + +中断处理器使用原始系统物理地址(非 GART): + +``` +IH Ring Buffer: regs.baseAddr (from IH_RB_BASE register) +Wptr Address: regs.WptrAddr (from IH_RB_WPTR_ADDR registers) +``` + +这些是驱动设置的 GPA(Guest Physical Address)。IH 写入流程: + +1. 将中断 cookie(32 字节)写入 `baseAddr + IH_Wptr` +2. 将更新后的写指针写入 `WptrAddr` +3. 调用 `intrPost()` 向 Guest 发送 MSI-X 中断 + +在协同仿真模式下,DMA 写入落入共享 Guest RAM(`/dev/shm/cosim-guest-ram`),中断通过 vfio-user 的 irq_fd 机制(或 legacy 后端的 event socket)转发给 QEMU。 + +### 完整数据流示例 + +以 HIP kernel dispatch 为例,展示跨两个共享内存区域的完整内存交互: + +``` +1. hipMalloc(&d_a, N*sizeof(int)) + Driver -> allocates buffer in VRAM + Writes GART PTEs to shared VRAM (BAR0) + +2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice) + Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer) + Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5 + gem5 -> reads ring buffer (Guest RAM via shared memory) + gem5 -> parses SDMA command -> GART translates source address -> Guest RAM + gem5 -> reads source data (Guest RAM via shared memory) + gem5 -> writes to VRAM destination (shared memory direct write) + +3. kernel<<<1, N>>>(d_a, d_b, d_c, N) + Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer) + Driver -> writes Doorbell -> gem5 + gem5 -> reads PM4 command (Guest RAM via shared memory) + gem5 -> launches shader execution + gem5 -> shader reads/writes VRAM (shared memory direct access) + gem5 -> writes fence on completion (Guest RAM via Ruby memory write) + gem5 -> sends MSI-X interrupt (irq_fd -> KVM) + +4. hipDeviceSynchronize() + Driver -> polls fence value (until Guest RAM value matches) + +- fence written by gem5 via Ruby memory write to shared backstore +``` + +Fence 写入(RELEASE_MEM)的地址转换细节: + +``` +1. PM4 RELEASE_MEM packet: addr=0x113100000 (guest phys), data=0x1234 +2. isVRAMAddress(0x113100000)? No (< 16 GiB but not a VRAM offset) +3. getGARTAddr(0x113100000) -> 0x899800000000 (page * 8 transform) +4. dmaWriteVirt(0x899800000000, 8, cb, &data) +5. GARTTranslationGen::translate() + - gart_addr = 0x89980000 + - Look up PTE from shared VRAM -> PTE has paddr bits + - paddr = extracted address (in guest RAM) +6. DMA write lands in /dev/shm/cosim-guest-ram at paddr offset +7. Guest driver reads fence value from same shared memory +``` + +--- + +## MSI-X 中断转发 + +### 中断传递路径 + +GPU 通过 MSI-X 中断向 Guest 发出完成事件信号(fence 回写、IH ring 条目)。中断传递链在不同后端中有所不同: + +**vfio-user 后端**: + +``` +gem5 AMDGPUDevice::intrPost() + | + +-> cosimBridge->sendIrqRaise(0) + | + +-> MI300XVfioUser: vfu_irq_trigger(irq_fd) + | eventfd write -> KVM + | + +-> KVM injects MSI-X interrupt into guest + | + +-> Guest IH handler processes interrupt + reads IH ring buffer from Guest RAM +``` + +vfio-user 后端使用注册到 KVM 的 eventfd 描述符(`irq_fd`)。当 gem5 触发中断时,它写入 eventfd,KVM 直接将中断注入 Guest——热路径无需 QEMU 参与。 + +**Legacy 后端**: + +``` +gem5 AMDGPUDevice::intrPost() + | + +-> cosimBridge->sendIrqRaise(0) + | + +-> MI300XGem5Cosim: send IrqRaise message via event socket + | + +-> QEMU mi300x_gem5.c: event thread receives message + | msix_notify(pci_dev, vector) + | + +-> KVM injects MSI-X interrupt into guest + | + +-> Guest IH handler processes interrupt +``` + +设备支持 256 个 MSI-X 向量(BAR4)。 + +### IH Ring Buffer 交互 + +MSI-X 中断到达后,Guest 的 IH(Interrupt Handler)从 Guest RAM 中的 IH ring buffer 读取中断 cookie: + +1. gem5 将 32 字节中断 cookie 写入 Guest RAM 的 `IH_RB_BASE + IH_Wptr` +2. gem5 更新 `IH_RB_WPTR_ADDR` 处的写指针 +3. gem5 调用 `intrPost()` 传递 MSI-X 中断 +4. Guest IH handler 唤醒,从 ring buffer 读取 cookie,处理事件 + +Ring buffer 和写指针都位于共享 Guest RAM 中,gem5 的 Ruby 内存系统写入后 Guest 立即可见。 + +--- + +## xGMI 互连模型 + +xGMI(芯片间全局内存互连)模型提供 cosim-gpu 多 GPU hive 中的 GPU 间通信。它挂载在每个 GPU 的 L2 缓存(TCC)出口端口上,将远程 VRAM 访问通过可配置的带宽、延迟和拓扑的 xGMI 链路模型进行路由。 + +### 数据包格式 + +| 字段 | 类型 | 描述 | +|------|------|------| +| src_gpu | uint8 | 源 GPU ID | +| dst_gpu | uint8 | 目标 GPU ID | +| addr | uint64 | 目标 VRAM 地址 | +| size | uint32 | 负载大小(字节) | +| payload | bytes | 数据(写操作时) | + +### 地址映射 + +每个 GPU 拥有连续的 VRAM 地址范围: + +``` +GPU 0: [0, vram_size) +GPU 1: [vram_size, 2 * vram_size) +GPU N: [N * vram_size, (N+1) * vram_size) +``` + +桥接器通过检查地址落入哪个 GPU 的范围来判断本地或远程访问。 + +### 拓扑配置 + +启动参数 `--xgmi-topology`: + +- **mesh**:每个 GPU 与所有其他 GPU 直连。8 GPU mesh 创建 28 条双向链路。 +- **ring**:每个 GPU 连接其两个邻居。链路数更少但非相邻 GPU 需多跳。 + +### 链路参数 + +| 参数 | 默认值 | CLI 标志 | +|------|--------|----------| +| 每链路带宽 | 128 GB/s | `--xgmi-bandwidth` | +| 每跳延迟 | 100 ns | `--xgmi-latency` | +| 每链路通道数 | 16 | (SimObject 参数) | +| 每 GPU 最大链路 | 7 | (SimObject 参数) | +| 流控信用 | 32 | (SimObject 参数) | + +### 流量控制 + +基于信用的背压机制防止数据丢失: + +1. 每条链路初始 N 个信用(默认 32)。 +2. 发送一个数据包消耗一个信用。 +3. 接收方在接受数据包后归还信用。 +4. 信用归零时发送方阻塞(永不丢弃)。 + +### 架构阶段 + +**Path A(自建 xGMI 模型)**: + +- 单进程多 GPU:进程内函数调用 +- 多进程 8-GPU hive:通过共享内存环形缓冲区或 Unix socket 的 IPC 传输 + +**Path B(SST Merlin 集成)**: + +- 用 SST Merlin 网络引擎替换 xGMI 传输 +- 三层同步:QEMU(功能仿真)<-> gem5(GPU 时序)<-> SST(网络时序) +- 支持任意拓扑(fat-tree、dragonfly) + +### 关键源文件 + +- `gem5/src/dev/amdgpu/XGMIBridge.py` -- SimObject 定义 +- `gem5/src/dev/amdgpu/xgmi_bridge.hh` -- C++ 头文件 +- `gem5/src/dev/amdgpu/xgmi_bridge.cc` -- C++ 实现 +- `gem5/configs/example/gpufs/mi300_cosim.py` -- 配置和连线 + +--- + +## 设计历程与关键决策 + +本节记录了塑造协同仿真系统的关键架构决策和重要 bug 修复洞察。 + +### 为什么选择 vfio-user 而非自定义协议 + +初始实现使用自定义二进制协议,通过两条 Unix socket 连接传输(一条同步用于 MMIO,一条异步用于事件)。这种方式可以工作,但需要维护一个自定义 QEMU PCI 设备(`mi300x_gem5.c`)和自定义协议定义。 + +迁移到 vfio-user 由三个因素驱动: + +1. **无需自定义 QEMU 代码**:任何原生 QEMU 10.0+ 构建都可以通过内置的 `vfio-user-pci` 设备直接连接 gem5,无需维护 QEMU 分支。 +2. **协议标准化**:BAR 映射、配置空间、中断和 DMA 全部由 vfio-user 规范定义,减少了协议层面的 bug 可能性。 +3. **更简单的部署**:用户只需构建支持 libvfio-user 的 gem5;QEMU 直接使用原生版本。 + +vfio-user 迁移过程中解决的问题: + +- libvfio-user 的 BAR size 字段是 `uint32_t`,无法表示 16 GiB VRAM——改为 `uint64_t`。 +- 64 位 BAR 的上半部分在 PCI BAR size probing 时需要返回 size mask 的高 32 位。 +- PCIe Express 和 MSI-X capability 必须在 `vfu_realize_ctx()` 之前注册。 +- SDMA ring test 超时:`sdma_delay=1e9` 导致约 500 ms 墙钟延迟,超过驱动端约 200 ms 的超时窗口——将 `sdma_delay` 减小到 1000,同时将 `KEEPALIVE_INTERVAL` 增加到 `1e9`。 + +### 为什么选择 Q35 + KVM + +协同仿真使用 QEMU 的 Q35 机器类型配合 KVM 加速: + +- **KVM**:以接近原生的速度运行 Guest CPU。完整的 Linux 启动 + 驱动加载在一分钟内完成,而 gem5 全系统模式下需要 10 分钟以上。这大幅缩短了调试周期。 +- **Q35**:提供现代 PCIe 芯片组,支持 64 位 BAR(16 GiB VRAM BAR 所必需)和 MSI-X 中断。 +- **gem5 端的 StubWorkload**:gem5 不运行自己的内核。它启动一个最小事件循环,等待来自 QEMU 的 MMIO 请求。这避免了双内核的复杂性,使 gem5 专注于 GPU 建模。 + +### 共享内存设计 + +使用两个独立的 POSIX 共享内存文件(`/dev/shm/cosim-guest-ram` 和 `/dev/shm/mi300x-vram`)而非单一统一内存的决策,源于两个内存区域本质上的不同: + +- **Guest RAM** 必须作为 QEMU `memory-backend-file`(配合 `share=on`)和 gem5 `PhysicalMemory`(通过 `shared_backstore`)的后端存储。文件布局必须精确复制 Q35 的 4G 以下/4G 以上内存分割方式。 +- **VRAM** 作为 BAR0 暴露给 QEMU,作为设备内存暴露给 gem5。它有自己的内部布局(数据区 + GART 页表),与 Guest 物理地址空间无关。 + +将两者合并到一个文件中会引入复杂的偏移算术和两个独立地址空间之间的耦合。 + +### SIGIO 边沿触发排空 + +gem5 的 `PollQueue` 使用 `FASYNC`/`SIGIO` 监听 socket,这是边沿触发的:当 socket 缓冲区从空变为非空时,内核发送一次 `SIGIO`,且仅此一次。 + +amdgpu 驱动频繁地先写 INDEX 寄存器(选择要访问的内部寄存器),然后立即读 DATA 寄存器(获取值)。这两条消息背靠背到达 gem5 的 socket 缓冲区,但只会触发一次 SIGIO。如果消息处理器每次只读一条消息,第二条消息就会留在缓冲区中,没有信号唤醒 gem5。QEMU 阻塞等待读响应。结果:处理 15 条消息后死锁。 + +修复方案:使用 `do/while` 排空循环配合 `poll(fd, POLLIN, 0)`,在每次 SIGIO 到来时消费所有待处理消息: + +```cpp +do { + // read and process one message + ... + struct pollfd pfd = {fd, POLLIN, 0}; +} while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); +``` + +此问题仅影响 legacy 后端。vfio-user 后端使用 libvfio-user 的非阻塞 poll 机制。 + +### GART 回退方案 + +在独立 gem5 模式下,GART 条目维护在哈希表(`gartTable`)中,由 `writeFrame()` 和 SDMA 影子拷贝填充。在协同仿真中,驱动通过 QEMU 的 BAR0 映射写入 GART PTE,直接进入共享 VRAM,不经过 gem5 的 `writeFrame()`。哈希表为空。 + +协同仿真回退机制直接从共享 VRAM 的 `vramShmemPtr + ptBase` 处读取 PTE。当 PTE 为零(未映射)时,条目映射到 sink(`paddr=0`),而非产生 fault。这防止了 `GenericPageTableFault` -> DMA 重试死循环(此前曾导致内存耗尽和段错误)。 + +诊断确认共享 VRAM 中 `gartBase`(= `ptBase`)处的 GART PTE 已被驱动正确填充。第一页(ptStart 本身)只是未映射——这是正常行为——而后续 PTE(偏移 0x32E0+)包含有效条目。 + +### VRAM 路由发现 + +地址 `0x1f72fa8000` 触发了超过 861,000 次 GART 转换错误,导致内存耗尽和段错误。根因:SDMA rptr 回写地址和 PM4 RELEASE_MEM 目标地址可能指向 VRAM(地址 < 16 GiB)。当这些地址被送入 `getGARTAddr()` 时,页号被乘以 8,GART 转换失败(VRAM 地址没有对应的页表项)。 + +修复采用三层防护: + +1. **PM4 层**(`pm4_packet_processor.cc`):`writeData()`、`releaseMem()`、`queryStatus()` 检查 `isVRAMAddress(addr)`,将 VRAM 写操作通过 `gpuDevice->getMemMgr()->writeRequest()`(设备内存)路由,而非 `dmaWriteVirt()`(通过 GART 的系统内存)。 +2. **SDMA 层**(`sdma_engine.cc`):`setGfxRptrLo/Hi()` 和 rptr 回写对 VRAM 地址跳过 `getGARTAddr()`,改用 `getMemMgr()->writeRequest()`。 +3. **GART 兜底**(`amdgpu_vm.cc`):`GARTTranslationGen::translate()` 通过逆向 `getGARTAddr` 变换(`orig_page = page_num >> 3`)检测 VRAM 地址,并将其映射到 `paddr=0` 作为 sink,而非产生 fault。 + +--- + +## 关键源文件 + +| 文件 | 作用 | +|------|------| +| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | vfio-user 服务端 SimObject(默认后端) | +| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | 旧版 socket 桥接 SimObject | +| `src/dev/amdgpu/cosim_bridge.hh` | 抽象 CosimBridge 接口 | +| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | 所有转换生成器(GART、AGP、MMHUB、User) | +| `src/dev/amdgpu/pm4_packet_processor.{cc,hh}` | PM4 DMA 路由、VRAM 检测、`getGARTAddr` | +| `src/dev/amdgpu/sdma_engine.{cc,hh}` | SDMA DMA 路由、GART 影子拷贝 | +| `src/dev/amdgpu/interrupt_handler.cc` | IH ring buffer DMA 和中断传递 | +| `src/dev/amdgpu/amdgpu_device.cc` | 设备级 `intrPost()`、`writeFrame()` | +| `src/dev/amdgpu/xgmi_bridge.{cc,hh}` | xGMI 互连桥接 | +| `configs/example/gpufs/mi300_cosim.py` | 系统配置、内存设置、后端选择 | +| `scripts/cosim_launch.sh` | 启动编排 | diff --git a/docs/zh/build-disk-china-mirror.md b/docs/zh/build-disk-china-mirror.md deleted file mode 100644 index 2618aa2..0000000 --- a/docs/zh/build-disk-china-mirror.md +++ /dev/null @@ -1,25 +0,0 @@ -[English](../en/build-disk-china-mirror.md) - -# 国内构建磁盘镜像加速补丁 - -## 问题 - -国内直连环境跑 `./scripts/run_mi300x_fs.sh build-disk` 时,VM 内 `apt` -会从 `us.archive.ubuntu.com` 拉包,常因网络波动挂住(Packer -`Timeout waiting for SSH` 或 provisioner 装 ROCm 时退出)。 - -## 应用补丁 - -```bash -cd gem5-resources -git apply ../scripts/patches/0001-user-data-cn-mirror.patch -``` - -回滚: - -```bash -cd gem5-resources -git apply -R ../scripts/patches/0001-user-data-cn-mirror.patch -``` - -切换其他镜像源:修改 patch 里的 URI 后重新 apply。 diff --git a/docs/zh/cosim-debugging-pitfalls.md b/docs/zh/cosim-debugging-pitfalls.md deleted file mode 100644 index ecf71c4..0000000 --- a/docs/zh/cosim-debugging-pitfalls.md +++ /dev/null @@ -1,186 +0,0 @@ -[English](../en/cosim-debugging-pitfalls.md) - -# MI300X 协同仿真:调试陷阱与修复 - -本文档记录了在 QEMU+gem5 MI300X 协同仿真启动过程中遇到并修复的 bug,包含一些不易察觉的根因分析。 - -## 1. SIGIO 合并导致的死锁(handleClientData 单次读取) - -> **注意**:此问题仅适用于 legacy cosim 后端(MI300XGem5Cosim)。vfio-user 后端使用 libvfio-user 的非阻塞轮询机制,不使用 FASYNC/SIGIO。 - -**现象**:驱动在首次访问 PCIe INDEX2/DATA2 寄存器对时挂起。gem5 处理约 15 条消息后停止响应。 - -**根因**:Linux FASYNC/SIGIO 是**边沿触发**的。当 QEMU 发送一个 fire-and-forget 的 MMIO write 后紧接着一个阻塞式 MMIO read 时,两条消息可能在 gem5 的 SIGIO handler 触发前同时到达。此时系统只会投递一个信号。原始的 `handleClientData()` 每次 SIGIO 只读取一条消息,导致第二条消息永远滞留。 - -**修复**(`mi300x_gem5_cosim.cc`):将 `handleClientData()` 改为排空循环,每处理一条消息后使用 `poll(fd, POLLIN, 0)` 检查是否还有更多数据: - -```cpp -void MI300XGem5Cosim::handleClientData(int fd) { - struct pollfd pfd; - do { - CosimMsgHeader msg; - if (!recvAll(fd, &msg, COSIM_MSG_HDR_SIZE)) { - closeClient(fd); return; - } - processMessage(fd, msg); - pfd = {fd, POLLIN, 0}; - } while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -} -``` - -**教训**:任何基于 FASYNC 的 I/O handler 都必须排空所有待处理数据,而不能只读一条消息。这种模式(write + read 合并)在 PCIe 间接寄存器访问中很常见。 - ---- - -## 2. ip_block_mask 使用的是检测顺序而非类型枚举值 - -**现象**:`PSP load tmr failed!`、`hw_init of IP block failed -22`、`Fatal error during GPU init`。 - -**根因**:ROCm 7.0 DKMS 驱动(`amdgpu_device.c:2807`)检查 `(amdgpu_ip_block_mask & (1 << i))`,其中 `i` 是 **检测顺序索引**,而非 `amd_ip_block_type` 枚举值。 - -MI300X 检测顺序(来自 dmesg): - -| 索引 | IP Block | mask 中的位 | -|------|-----------------|-------------| -| 0 | soc15_common | 0x01 | -| 1 | gmc_v9_0 | 0x02 | -| 2 | vega20_ih | 0x04 | -| 3 | psp | 0x08 | -| 4 | smu | 0x10 | -| 5 | gfx_v9_4_3 | 0x20 | -| 6 | sdma_v4_4_2 | 0x40 | -| 7 | vcn_v4_0_3 | 0x80 | -| 8 | jpeg_v4_0_3 | 0x100 | - -**修复**:将 `ip_block_mask` 从 `0x6f` 改为 `0x67`: -- `0x6f` = `0110_1111` → 启用 common、gmc、ih、**psp**、gfx、sdma -- `0x67` = `0110_0111` → 启用 common、gmc、ih、gfx、sdma(禁用索引 3 的 psp 和索引 4 的 smu) - -**陷阱**:`amd_shared.h` 中的 `amd_ip_block_type` 枚举显示 PSP=4,但 PSP 的 mask 位实际是 `(1 << 3)`,因为 PSP 在 IP discovery 过程中排在第三个(索引 3)。文档和枚举值具有误导性。 - ---- - -## 3. amdgpu_atom_parse_data_header 空指针崩溃(缺少 VGA ROM) - -**现象**:`modprobe amdgpu` 导致内核空指针崩溃,位于 `amdgpu_atom_parse_data_header+0x1b`。调用链:`amdgpu_ras_init → amdgpu_atomfirmware_mem_ecc_supported → amdgpu_atom_parse_data_header`。RAX=0(NULL `atom_context`)。 - -**根因**:amdgpu 驱动的 BIOS 发现链有 5 种方法,在 cosim 模式下全部失败: - -| 方法 | 失败原因 | -|------|---------| -| `amdgpu_atrm_get_bios()` | QEMU Q35 无 ACPI ATRM 方法 | -| `amdgpu_acpi_vfct_bios()` | 无 ACPI VFCT 表 | -| `amdgpu_read_bios_from_rom()` | 通过 SMU 寄存器读取,但 SMU 被 `ip_block_mask=0x67` 禁用 | -| `amdgpu_read_platform_bios()` | 无平台提供的 ROM | -| `amdgpu_read_disabled_bios()` | cosim 下不可用 | - -驱动打印 `"Unable to locate a BIOS ROM"` 和 `"VBIOS image optional, proceeding"`,但 RAS 初始化路径无条件调用 `amdgpu_atom_parse_data_header()` 而不检查 NULL `atom_context`。 - -**修复**:在 `modprobe` **之前**将 VGA ROM 写入物理地址 `0xC0000`(共享内存): - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -`0xC0000` 处的 ROM 数据通过 `/dev/shm/cosim-guest-ram` 可被 gem5 访问。当驱动通过 SMU MMIO 寄存器读取 ROM 时,gem5 的 `AMDGPUDevice::readROM()` 从 `system->getPhysMem()` 的 `VGA_ROM_DEFAULT + offset` 处读取,通过 cosim socket 返回 ROM 内容。 - -**陷阱**:QEMU 的 `romfile=` 属性将 ROM 加载到 PCI expansion ROM BAR,但 amdgpu 驱动**不会**直接从 PCI ROM BAR 读取——而是通过 SMU 寄存器访问 ROM。仅靠 `romfile` 不够;`dd` 步骤始终必要。 - ---- - -## 5. PM4ReleaseMem.dataSelect panic - -**现象**:gem5 panic,报错 `Unimplemented PM4ReleaseMem.dataSelect`。 - -**根因**:`pm4_packet_processor.cc` 中只实现了 `dataSelect == 1`(32 位数据写入)。驱动在 GFX 初始化过程中会使用其他模式。 - -**修复**:添加了所有常见 dataSelect 值的处理: - -| dataSelect | 行为 | -|------------|-------------------------------| -| 0 | 不写入数据(仅触发事件) | -| 1 | 写入 32 位值(已有实现) | -| 2 | 写入 64 位值 | -| 3 | 写入 64 位 GPU 时钟计数器 | -| 其他 | 发出警告并视为空操作 | - ---- - -## 6. 协同仿真模式下 GART 表未填充 - -**现象**:大量 `GART translation for X not found` 警告。PM4 处理器读到全零内存(opcode 0x0)。KIQ ring test 超时。 - -**根因**:在协同仿真模式下,QEMU 的 BAR2(VRAM,16GB)由共享内存文件(`/dev/shm/mi300x-vram`)支撑。驱动对 VRAM 的写入直接进入共享文件,**完全绕过了 gem5 的 socket 协议**。gem5 的 `AMDGPUVM::gartTable` 哈希表在 `AMDGPUDevice::writeFrame()` 中填充,而该函数仅在写入通过 gem5 内存系统时才会执行。由于 VRAM 写入绕过了 gem5,`gartTable` 始终为空。 - -> **注意**:此问题同时适用于 legacy cosim 和 vfio-user 两种后端,因为在两种架构下 VRAM 都通过共享内存文件(`/dev/shm/mi300x-vram`)传递,驱动对 VRAM 的写入始终绕过 gem5 内存系统。 - -**修复**(`amdgpu_vm.cc` + `amdgpu_vm.hh`):在 `GARTTranslationGen::translate()` 中添加了共享 VRAM 回退机制: - -1. 在 `AMDGPUVM` 中添加 `vramShmemPtr` / `vramShmemSize` 字段 -2. `MI300XGem5Cosim` 在映射共享 VRAM 后设置这些字段 -3. 当 `gartTable` 未命中时,直接从共享 VRAM 读取 PTE: - -```cpp -Addr gart_byte_offset = bits(range.vaddr, 63, 12); -Addr pte_vram_offset = (gartBase() - getFBBase()) + gart_byte_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); -``` - -**关键细节**:`getGARTAddr()`(在 translate 之前调用)已经将页索引乘以 8 得到字节偏移: -```cpp -addr = (((addr >> 12) << 3) << 12) | low_bits; // page_num *= 8 -``` -因此 translate 函数中 `bits(vaddr, 63, 12)` 已经是 PTE 的**字节偏移**,而不是页索引。如果再乘以 8,会导致地址偏移到 GART 表中 8 倍远的位置。 - -**架构注释**:原始 translate 代码中的"扩展公式"(`gart_addr += lsb * 7`)对于经过 `getGARTAddr()` 处理的地址实际上是空操作,因为 `lsb = (page_num * 8) & 7 = 0`(`page_num * 8` 始终是 8 对齐的,所以低 3 位永远为零)。 - ---- - -## 7. SDMA Ring Test 超时(sdma_delay 时序问题) - -**现象**:驱动初始化过程中 SDMA ring test 返回 `-110`(`-ETIMEDOUT`)。 - -**根因**:gem5 中 `sdma_engine.hh` 的 `sdma_delay` 参数默认值为 `1e9` ticks。在协同仿真模式下,gem5 的模拟时钟与墙钟(wall-clock)之间的比率导致 `1e9` ticks 对应约 500ms 的实际延迟。而 amdgpu 驱动的 SDMA ring test 超时阈值约为 200ms,远小于这个延迟。 - -具体流程: -1. 驱动写入 SDMA ring buffer 并敲 doorbell -2. gem5 收到 doorbell 后调度 SDMA 处理事件,延迟 `sdma_delay` ticks -3. 由于延迟过长,驱动在 gem5 完成处理之前就已超时 -4. 驱动报告 `sdma v4_4_2: ring 0 test failed (-110)` - -**修复**: -- 将 `sdma_delay` 从 `1e9` 减小到 `1000` ticks(`sdma_engine.hh`) -- 将 cosim 的 `KEEPALIVE_INTERVAL` 增大到 `1e9`,避免 keepalive 消息干扰时序 - -**教训**:协同仿真模式下的时序参数不能照搬独立仿真的默认值。gem5 模拟时钟和墙钟之间的比率差异会放大或缩小延迟效果。 - ---- - -## 协同仿真架构通用说明 - -### 哪些操作绕过了通信协议 - -**Legacy 后端(自定义 socket 协议):** - -| 资源 | QEMU BAR | gem5 BAR | 通过 Socket? | 通过共享内存? | -|----------------|----------|----------|---------------|----------------| -| MMIO 寄存器 | BAR0 | BAR5 | 是 | 否 | -| VRAM(16GB) | BAR2 | BAR0 | **否** | 是 | -| Doorbells | BAR4 | BAR2 | 是 | 否 | - -**vfio-user 后端(标准 vfio-user 协议):** - -| 资源 | QEMU 映射方式 | gem5 侧 | 通过 vfio-user? | 通过共享内存? | -|----------------|------------------------|----------------|------------------|----------------| -| MMIO 寄存器 | vfio-user region 回调 | BAR5 | 是 | 否 | -| VRAM(16GB) | vfio-user DMA region | BAR0 | **否** | 是 | -| Doorbells | vfio-user region 回调 | BAR2 | 是 | 否 | - -> **注意**:使用 vfio-user 后端时,QEMU 使用内置的 `vfio-user-pci` 设备,无需自定义 QEMU 设备代码。QEMU 通过 vfio-user 协议映射所有 BAR:BAR0(VRAM)通过 DMA region 映射,BAR2(doorbell)和 BAR5(MMIO)通过 vfio-user region 回调处理。 - -任何通过拦截 VRAM 写入来填充的 gem5 数据结构(如 `gartTable`、页表、ring buffer)在协同仿真模式下都**不会**被填充。这些结构需要显式的回退机制来从共享 VRAM 中读取数据。此限制同时适用于两种后端。 - -### 驱动加载失败后需要重启 guest - -驱动 `hw_init` 失败后执行 `rmmod amdgpu` 会导致 kernel oops(`kgd2kfd_device_exit` 中的 page fault)。模块会停留在 "busy" 状态,无法重新加载。唯一的解决方法是重启整个协同仿真环境(杀掉 QEMU,重启 gem5 Docker 容器,重启 QEMU)。 diff --git a/docs/zh/cosim-dev-story.md b/docs/zh/cosim-dev-story.md deleted file mode 100644 index 84686e4..0000000 --- a/docs/zh/cosim-dev-story.md +++ /dev/null @@ -1,342 +0,0 @@ -[English](../en/cosim-dev-story.md) - -# 只用两天:我用 Claude 把十万块的 MI300X GPU 搬进了 QEMU - -> AMD Instinct MI300X,304 个计算单元,192GB HBM3 显存,单卡零售价超过 10 万人民币。 -> 现在,你只需要一台普通的 x86 Linux 机器,就能在 QEMU 上跑起完整的 ROCm/HIP 工作负载。 - -## 01 -缘起:当 gem5 的启动时间比调试还长 - -我做 GPU 模拟器已经有一段时间了。gem5 有 MI300X 的设备模型,也有全系统仿真的能力,但它的 KVM 快进模式仍然很慢——一次 Linux 启动要等 5 分钟,驱动加载再等 5 分钟,每次调试一个 MMIO 寄存器的问题都意味着 10 分钟的空白等待。 - -我一直想做一件事:让 QEMU 跑 Linux 和 amdgpu 驱动,gem5 只负责 GPU 计算模型,中间用某种 IPC 桥接起来。这样 QEMU 用 KVM 跑 CPU 部分,速度接近原生;gem5 只处理 GPU 的 MMIO/Doorbell/DMA,可以专注在计算仿真的精度上。 - -这个想法听起来不复杂,但实际做起来涉及到 QEMU PCIe 设备模型、gem5 SimObject 架构、Linux amdgpu 驱动的初始化流程、GART 地址翻译、共享内存文件偏移量对齐、Unix 域套接字的边沿触发语义——这些东西的交叉点上全是坑。 - -2026 年 3 月 6 日早上,我打开了 Claude Code,开始了这个项目。到 3 月 8 日凌晨,第一个 HIP 向量加法测试在联合仿真环境下跑出了 `PASSED!`。 - -这篇文章记录了整个过程中踩过的坑和关键决策。 - ---- - -## 02 -架构:一句话版本 - -``` -+-----------------------------+ +----------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker) | -| +-----------------------+ | | +----------------------+ | -| | Guest Linux | | | | MI300X GPU Model | | -| | amdgpu driver | | | | Shader / CU / SDMA | | -| | ROCm 7.0 / HIP | | | | PM4 / Ruby caches | | -| +----------+------------+ | | +---------+------------+ | -| | | | | | -| +----------v------------+ | | +---------v------------+ | -| | vfio-user-pci (内置) |<-------->| | MI300XVfioUser | | -| +-----------------------+ |vfio- | +----------------------+ | -| |user | | -+-----------------------------+socket +----------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (shared guest RAM) (shared GPU VRAM) -``` - -> **后端选择**:默认使用 vfio-user 后端(`MI300XVfioUser`),QEMU 侧使用内置的 `vfio-user-pci` 设备,无需自定义 QEMU 代码。也支持 legacy 后端(`MI300XGem5Cosim` + 自定义 `mi300x-gem5` QEMU 设备),通过 `--cosim-backend=legacy` 切换。 - -QEMU 这边是一个完整的 Q35 虚拟机,跑 Ubuntu 24.04 + ROCm 7.0 + amdgpu 驱动。vfio-user 后端使用 QEMU 内置的 `vfio-user-pci` 设备,通过标准 vfio-user 协议把所有 MMIO 读写和 Doorbell 写操作转发给 gem5。 - -gem5 这边跑的是 MI300X 的 GPU 设备模型——Shader、CU 阵列、PM4 命令处理器、SDMA 引擎、Ruby 缓存层次结构——但**没有 Linux 内核**。它用 `StubWorkload` 空壳启动,只等 QEMU 通过 socket 发来 MMIO 请求。 - -Guest 物理内存和 GPU VRAM 各有一块共享内存文件(`/dev/shm/`),QEMU 和 gem5 都能直接 mmap,实现零拷贝 DMA。 - -BAR 布局必须严格匹配 amdgpu 驱动的硬编码预期: - -| BAR | 内容 | 大小 | 通信方式 | -|-----|------|------|----------| -| BAR0+1 | VRAM | 16 GiB | 共享内存 | -| BAR2+3 | Doorbell | 4 MiB | Socket 转发 | -| BAR4 | MSI-X | 256 vectors | QEMU 本地 | -| BAR5 | MMIO 寄存器 | 512 KiB | Socket 转发 | - ---- - -## 03 -起步:从零开始写 PCIe 设备 - -6 号早上 6 点半,我让 Claude 帮我写了 QEMU 侧的 `mi300x_gem5.c`。这是一个标准的 QEMU PCIe 设备,但有几个特殊的地方: - -1. **六个 BAR**,其中三个需要 64 位地址空间(VRAM 16GB 不可能放在 4G 以下) -2. **两条 socket 连接**:一条同步(MMIO 请求/响应),一条异步(中断和 DMA 事件) -3. **MSI-X 支持**:256 个中断向量,gem5 通过 event socket 通知 QEMU 触发 `msix_notify()` - -gem5 侧的 `MI300XGem5Cosim` SimObject 稍微复杂一点——它是一个 socket 服务器,监听来自 QEMU 的连接,接收 MMIO 消息后分发给 `AMDGPUDevice` 处理,再把结果发回去。 - -第一版代码大约 1500 行(QEMU 700 行 + gem5 800 行),结构清晰但全是 bug。 - ---- - -## 04 -踩坑:从 SIGIO 死锁到 GART 翻译 - -### Bug #1:SIGIO 边沿触发死锁——最阴险的问题 - -gem5 的事件系统使用 `FASYNC`/`SIGIO` 来监听 socket 上的数据。这是**边沿触发**的——当 socket 缓冲区从空变非空时,内核发一次 `SIGIO`,仅此一次。 - -问题出在 amdgpu 驱动的寄存器访问模式上。驱动经常先写 INDEX 寄存器(选择要访问哪个内部寄存器),然后立即读 DATA 寄存器(拿到值)。write 是 fire-and-forget 的,read 是阻塞等待响应的。当这两条消息背靠背到达 gem5 的 socket 缓冲区时,只会触发一次 SIGIO。 - -我最初的 `handleClientData()` 每次只读一条消息。结果:gem5 读了 write 消息,处理完毕,然后就傻等下一次 SIGIO。但 read 消息已经在缓冲区里了,不会再有新的 SIGIO 来唤醒它。QEMU 那边死等 read 响应。**完美死锁。** - -gem5 处理了 15 条消息后就永远挂住了。 - -修复方法很简单——把单次读取改成排空循环: - -```cpp -void MI300XGem5Cosim::handleClientData(int fd) { - struct pollfd pfd; - do { - CosimMsgHeader msg; - if (!recvAll(fd, &msg, COSIM_MSG_HDR_SIZE)) { - closeClient(fd); return; - } - processMessage(fd, msg); - pfd = {fd, POLLIN, 0}; - } while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -} -``` - -修完这个之后,MMIO 消息数从 15 跳到了 **35,181**。驱动初始化一路推进到了 PSP 固件加载阶段。 - -**教训:任何基于 FASYNC 的 I/O handler 都必须排空所有待处理数据。这在 PCIe 间接寄存器访问的场景下是必然的。** - -### Bug #2:ip_block_mask——文档骗人 - -amdgpu 驱动有一个 `ip_block_mask` 参数,用来控制哪些 IP 块需要初始化。cosim 模式下不需要 PSP(安全处理器)和 SMU(电源管理),需要禁用它们。 - -我最初用的是 `0x6f`,觉得禁用了 PSP(枚举值 4)和保留了其他。结果 PSP 还是被初始化了,加载固件时报 `-EINVAL` 然后整个 GPU init 失败。 - -花了好一阵子才搞明白:`ip_block_mask` 的位对应的是 **IP discovery 的检测顺序索引**,不是 `amd_ip_block_type` 枚举值。MI300X 的检测顺序是: - -``` -0: soc15_common 1: gmc_v9_0 2: vega20_ih -3: psp 4: smu 5: gfx_v9_4_3 -6: sdma_v4_4_2 7: vcn_v4_0_3 8: jpeg_v4_0_3 -``` - -PSP 在枚举值里是 4,但在检测顺序里是 3。`0x6f` = `0110_1111` 禁用的是索引 4(smu),但索引 3(psp)还是被启用了。正确的值是 `0x67` = `0110_0111`,同时禁用索引 3 和 4。 - -**教训:amd_shared.h 的枚举值和驱动实际使用的位掩码之间没有对应关系。只有 dmesg 的检测日志才是真相。** - -### Bug #3:共享内存偏移量——两个系统的内存观不一致 - -这个 bug 最诡异。GART 页表项读出来全是零,PM4 命令处理器一直读到 opcode 0x0(NOP),无限循环。 - -问题出在 QEMU Q35 和 gem5 对内存拆分方式的不同。配置 8GB RAM 时: - -- **QEMU Q35** 硬编码 `below_4g = 2 GiB`(当 `ram_size >= 0xB0000000`),上方 6GB 放在文件偏移 2G 处 -- **gem5** 默认 `below_4g = 3 GiB`,上方 5GB 放在文件偏移 3G 处 - -两边 mmap 同一个共享内存文件,但对"第 4G 以上的内存在文件的哪个偏移"意见不一致。gem5 从偏移 3G 处读 GART 页表——那里全是零,因为 QEMU 把数据写在了偏移 2G 处。 - -修复:在 `mi300_cosim.py` 里完全复制 Q35 的拆分逻辑。 - -**教训:共享 memory-backend-file 时,双方必须在每个范围的文件偏移量上达成一致,不仅仅是总大小。** - -### Bug #4:VRAM 地址被错误地走了 GART 翻译 - -PM4 的 `RELEASE_MEM` 和 SDMA 的 rptr 回写,目标地址有时候指向 VRAM(地址 < 16GiB)。原来的代码把所有地址都扔进 `getGARTAddr()` 做翻译,但 VRAM 地址在 GART 里没有对应的页表项,翻译失败 861,000 多次,最后内存耗尽段错误。 - -修复用了三层防护: - -1. **PM4 层**:`writeData()` / `releaseMem()` 检查 `isVRAMAddress(addr)`,VRAM 写直接走设备内存 -2. **SDMA 层**:rptr 回写对 VRAM 地址跳过 `getGARTAddr()` -3. **GART 兜底**:未映射的 GART 页映射到 `paddr=0`(sink),不产生 fault - ---- - -## 05 -验证:HIP 向量加法 PASSED - -3 月 8 日凌晨,所有 bug 修完,驱动加载正常,`rocm-smi` 看到了 MI300X (0x74a0),`rocminfo` 报告 gfx942 架构、320 个 CU。 - -在 Guest 里写了一个最简单的 HIP 测试——四个元素的向量加法: - -```cpp -__global__ void add(int *a, int *b, int *c, int n) { - int i = blockIdx.x * blockDim.x + threadIdx.x; - if (i < n) c[i] = a[i] + b[i]; -} -``` - -编译,运行: - -``` -Result: 11 22 33 44 -PASSED! -``` - -`{1+10, 2+20, 3+30, 4+40}` = `{11, 22, 33, 44}`。hipMalloc、hipMemcpy(host-to-device / device-to-host)、kernel dispatch、hipDeviceSynchronize 全部正常返回。MSI-X 中断从 gem5 通过 event socket 转发给 QEMU,QEMU 触发 `msix_notify()`,guest 的 IH handler 正确处理——整个中断链路首次端到端跑通。 - -这是 gem5 作为"远程 GPU"被 QEMU guest 里的真实 amdgpu 驱动驱动起来做计算的最佳实践。 - ---- - -## 06 -协作:不是代码工具,而是系统搭档 - -整个开发过程在一个巨型对话里完成,上下文用完了就续上。工作流是这样的: - -1. **我提供原始的终端输出**:dmesg 日志、gem5 panic 信息、socket 通信的 hexdump -2. **Claude 分析输出**,搜索 gem5/QEMU/Linux 内核源码定位根因 -3. **Claude 提出并实现修复**——直接编辑 gem5 C++ 代码、QEMU C 代码、Python 配置、Shell 脚本 -4. **后台构建**:gem5 编译约 30 分钟,QEMU 约 5 分钟,磁盘镜像约 40 分钟——这些都在后台跑 -5. **我测试,贴新的输出**,循环继续 - -Claude 在这个项目里的角色不是"帮我写代码的工具",而更像一个**对 gem5 和 QEMU 内部机制有深入了解的协作者**。几个典型场景: - -- **SIGIO 死锁**:我只贴了"gem5 处理 15 条消息后挂住",Claude 立刻定位到 FASYNC 的边沿触发语义,给出了排空循环的方案 -- **ip_block_mask**:我贴了 dmesg 的 IP discovery 日志,Claude 直接对照出了检测顺序和位掩码的不匹配 -- **GART 翻译**:Claude 从 gem5 源码中追踪了 `getGARTAddr()` 的乘 8 变换,发现了 VRAM 地址被误导入 GART 路径的问题 -- **Q35 内存拆分**:Claude 翻出了 `qemu/hw/i386/pc_q35.c:161` 的硬编码 2GiB 边界,和 gem5 的 3GiB 默认值做对比 - -整个过程中,15 个 blocking bug 被逐一解决。每个 bug 的修复都建立在对底层系统行为的准确理解上——不是试错,而是溯源。 - ---- - -## 07 -记忆:跨会话的知识延续 - -这个项目的开发跨越了多个对话会话——Claude Code 的上下文窗口是有限的,一个超长的调试会话用完上下文后需要续上新的对话。这时候一个关键问题出现了:新的对话怎么知道之前做了什么、哪些 bug 已经修了、哪些还在处理中? - -答案是 Claude 的 auto memory 系统。在 `~/.claude/projects/` 目录下,Claude 会自动维护一组记忆文件,记录跨会话的关键信息。这个项目的记忆文件有三个: - -1. **MEMORY.md**(主记忆,43 行):项目结构、gem5 运行环境配置(Docker 镜像名、构建参数、Python 版本)、DRM Client -13 崩溃的修复记录、联合仿真的总体状态 -2. **cosim-details.md**(架构细节,69 行):完整的 BAR 布局、8 个关键修复的摘要、gem5/QEMU 启动命令、GART 页表的精确参数(ptBase、fbBase、PTE 格式) -3. **cosim-debugging.md**(调试进展,63 行):每个 bug 的文件位置、根因、修复状态(包括"部分修复"这种中间状态)、当前阻塞项 - -这些记忆文件在实际开发中发挥了几个关键作用: - -**避免重复诊断**。当一个新会话开始时,Claude 不需要重新分析整个代码库来理解项目状态。记忆文件里记着"SIGIO 死锁已修复、ip_block_mask 改成了 0x67、GART 回退已实现",可以直接从上次停下的地方继续。 - -**保持环境一致性**。gem5 必须在特定的 Docker 镜像里构建和运行(`ghcr.io/gem5/gpu-fs:latest`),QEMU 的串口参数不能和 `-nographic` 混用,磁盘镜像需要用 packer 加特定参数构建——这些环境细节散落在不同的会话中,但被记忆文件统一收集。新会话不会因为用错 Docker 镜像或构建参数而浪费时间。 - -**追踪增量进展**。调试过程不是线性的。GART 翻译的修复经历了"部分修复"到"完全修复"的过程——记忆文件忠实地记录了这个中间状态,避免了在新会话中误以为问题已经完全解决而跳过验证。 - -**跨代码库的关联索引**。记忆文件中记录了关键文件的路径(`mi300x_gem5_cosim.cc`、`amdgpu_vm.cc`、`mi300_cosim.py`)、关键常量(`ptBase=0x3EE600000`、`fbBase=0x8000000000`)和关键公式(`getGARTAddr()` 的乘 8 变换)。这些信息分散在三个不同的代码库中,记忆系统将它们集中在一起,形成了一个高效的关联索引。 - -如果说 Claude 在单次对话中的价值是"快速定位根因",那么记忆系统的价值就是**让这种能力跨会话延续**。没有记忆系统,每次续上新对话都需要花 10-15 分钟重新建立上下文;有了记忆系统,新会话在几秒钟内就能回到之前的工作状态。 - ---- - -## 08 -成果:两天交付了什么 - -| 指标 | 数据 | -|------|------| -| 开发耗时 | ~24 小时(3月6日 06:30 → 3月8日 06:00) | -| 新增代码 | ~2500 行(gem5 C++ ~800,QEMU C ~700,Python 配置 ~200,Shell 脚本 ~800) | -| 解决的 blocking bug | 15 个 | -| 技术文档 | 6 篇(中英双语,共 ~2000 行) | -| Git 提交 | 16 笔(cosim 主仓库) | -| MMIO 操作 | 65,000+ 次无崩溃 | -| HIP 计算测试 | PASSED | -| vfio-user 迁移 | 完成(3月9日),vector_add / transpose / gemm 全部 PASSED | - -最终的系统支持: - -- **完整 amdgpu 驱动加载**:DRM 初始化,7 个 XCP 分区,gfx942 架构 -- **ROCm 工具链**:rocm-smi、rocminfo 正常工作 -- **HIP GPU 计算**:hipMalloc、kernel dispatch、hipDeviceSynchronize -- **MSI-X 中断转发**:gem5 → QEMU 事件通知 -- **共享内存 DMA**:零拷贝 VRAM + Guest RAM -- **vfio-user 后端**:标准协议,无需自定义 QEMU 代码 -- **一键启动**:`./scripts/cosim_launch.sh` - ---- - -## 08.5 -vfio-user 迁移:从自定义协议到行业标准 - -在初始版本使用自定义 socket 协议验证了端到端可行性之后,我们在 3 月 9 日将 QEMU-gem5 通信迁移到了标准的 vfio-user 协议。 - -vfio-user 是一个用于将 PCI 设备暴露到远程进程的标准协议(QEMU 10.0+ 内置了 `vfio-user-pci` 客户端)。gem5 侧使用 Nutanix 的 libvfio-user 库作为服务端。这意味着: - -- **无需自定义 QEMU 代码**:任何支持 vfio-user 的 QEMU 都能直接连接 gem5 -- **协议标准化**:BAR 映射、配置空间、中断、DMA 全部由 vfio-user 规范定义 -- **更简单的部署**:不再需要维护 QEMU 分支 - -迁移过程中解决了几个关键问题: -- libvfio-user 的 BAR size 字段是 `uint32_t`,无法表示 16GB VRAM → 改为 `uint64_t` -- 64 位 BAR 的上半部分在 size probing 时需要返回 size mask 的高 32 位 -- PCIe Express 和 MSI-X capability 必须在 `vfu_realize_ctx()` 之前注册 -- SDMA ring test 超时:`sdma_delay=1e9` 导致 ~500ms 墙钟延迟 → 减小到 1000 - -迁移后的测试结果:vector_add (120ms)、transpose (6.5s)、gemm (4.7s) 全部 PASSED。 - ---- - -## 09 -意义:十万块的 GPU 触手可及 - -MI300X 是 AMD 最强的数据中心 GPU,单卡价格超过 10 万人民币,普通开发者根本摸不到。但通过 QEMU + gem5 联合仿真,你可以在任何一台 x86 Linux 机器上: - -- 跑完整的 ROCm 7.0 软件栈 -- 编译和运行 HIP 程序 -- 在 cycle-accurate 的 GPU 模型上做性能分析 -- 调试 amdgpu 驱动的初始化流程 -- 开发和验证 GPU 架构的新特性 - -所有代码已开源:[github.com/zevorn/cosim-gpu](https://github.com/zevorn/cosim-gpu) - -```bash -git clone --recurse-submodules git@github.com:zevorn/cosim-gpu.git -cd cosim-gpu -GEM5_BUILD_IMAGE=ghcr.io/gem5/gpu-fs:latest ./scripts/run_mi300x_fs.sh build-all -cd scripts && docker build -t gem5-run:local -f Dockerfile.run . && cd .. -./scripts/cosim_launch.sh -``` - ---- - -## 10 -后记:放大器,不是替代品 - -有人可能会说:"两天写完的代码能靠谱吗?" - -说实话,如果没有 Claude,这个项目至少需要两周。不是因为代码量大——2500 行代码对于一个 PCIe 设备桥接来说并不多——而是因为调试过程中需要同时理解三个系统的内部行为:QEMU 的 Q35 内存布局、gem5 的事件驱动 I/O 模型、Linux amdgpu 驱动的 IP block 初始化顺序。任何一个环节理解错了,就是几小时的调试黑洞。 - -Claude 的价值不在于帮我写代码,而在于**大幅缩短了从"看到症状"到"理解根因"的时间**。当我贴上一段 dmesg 输出,Claude 能在几秒钟内关联到 gem5 源码中的具体函数和 QEMU 的硬编码常量——这种跨代码库的关联分析,是人工翻源码做不到的速度。 - -当然,Claude 也不是万能的。所有的测试都是我跑的,所有的架构决策都是我做的(比如选择两条 socket 连接而不是一条,选择 StubWorkload 而不是全系统启动),所有的最终验证都需要在真实环境里确认。AI 是放大器,不是替代品。 - -但这个放大器确实很强。两天,一个人,一个 AI,十万块的 GPU 搬进了 QEMU。 - ---- - -## 参考资料 - -**项目与源码** - -- [cosim-gpu](https://github.com/zevorn/cosim-gpu) — 本项目仓库(含 gem5、QEMU、gem5-resources 子模块) -- [完整使用指南](cosim-usage-guide.md) — 从编译到运行 HIP 测试的全流程 -- [技术笔记](cosim-technical-notes.md) — 架构设计、踩坑记录、修复方案 -- [MI300X 内存管理](mi300x-memory-management.md) — GART、地址翻译、内存映射 -- [Guest GPU 初始化流程](cosim-guest-gpu-init.md) — 驱动加载与设备初始化 -- [调试踩坑记录](cosim-debugging-pitfalls.md) — 常见问题与解决方案 - -**上游项目** - -- [gem5](https://www.gem5.org/) — 模块化计算机体系结构仿真器 -- [QEMU](https://www.qemu.org/) — 开源机器模拟器与虚拟化工具 -- [ROCm](https://rocm.docs.amd.com/) — AMD 开源 GPU 计算平台 -- [AMD Instinct MI300X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) — 产品规格 -- [libvfio-user](https://github.com/nutanix/libvfio-user) — vfio-user 协议服务端库 - -**开发工具** - -- [Claude Code](https://docs.anthropic.com/en/docs/claude-code) — Anthropic 的 CLI 编程助手 - ---- - -*泽文,2026 年 3 月* diff --git a/docs/zh/cosim-guest-gpu-init.md b/docs/zh/cosim-guest-gpu-init.md deleted file mode 100644 index 78a78d5..0000000 --- a/docs/zh/cosim-guest-gpu-init.md +++ /dev/null @@ -1,158 +0,0 @@ -[English](../en/cosim-guest-gpu-init.md) - -# MI300X 协同仿真:客户机 GPU 初始化指南 - -## 概述 - -MI300X GPU 驱动可在 QEMU 客户机启动后**自动**或**手动**加载。磁盘镜像中已包含 systemd 服务(`cosim-gpu-setup.service`),会在开机时自动完成完整的初始化流程。 - -磁盘镜像中已包含所有必需的文件(ROM、固件、内核模块)。 - -## 自动加载(默认) - -磁盘镜像内置 `cosim-gpu-setup.service`,开机时自动执行: - -1. `dd` 写入 VGA ROM 到 `0xC0000`(gem5 通过共享内存的 `readROM()` 需要此数据) -2. 链接 IP discovery 固件 -3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` - -服务约 40 秒完成。登录后 GPU 即可使用: - -```bash -rocm-smi # 应显示设备 0x74a0 -rocminfo # 应显示 gfx942 -``` - -服务文件内容: - -```ini -# /etc/systemd/system/cosim-gpu-setup.service -[Unit] -Description=MI300X GPU Setup for Co-simulation -After=local-fs.target -Before=multi-user.target - -[Service] -Type=oneshot -RemainAfterExit=yes -ExecStart=/usr/local/bin/cosim-gpu-setup.sh - -[Install] -WantedBy=multi-user.target -``` - -> **注意:** 内核命令行必须保留 `modprobe.blacklist=amdgpu`,防止 PCI 子系统在 ROM 写入共享内存之前自动加载驱动。systemd 服务会在 `dd` 之后显式 `modprobe`。 - -## 手动加载 - -如果 systemd 服务未安装,在 guest 启动后手动执行以下命令。 - -### 前置条件 - -- `cosim_launch.sh` 正在运行(gem5 + QEMU 已连接) -- 客户机已启动并获取了 root shell -- 内核命令行中传递了 `modprobe.blacklist=amdgpu` - -### 快速参考(可直接复制粘贴) - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -## 详细步骤 - -### 步骤 1:加载 VGA BIOS ROM - -```bash -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 -``` - -**功能说明**:将 MI300X VBIOS ROM 镜像写入物理地址 `0xC0000`(768 KB)处的传统 VGA ROM 区域。 - -**必要性**:amdgpu 驱动在初始化期间从传统 VGA ROM 空间(`0xC0000–0xDFFFF`,128 KB)读取 VBIOS。QEMU 协同仿真设备注册为 `PCI_CLASS_DISPLAY_VGA`,因此内核将该地址范围识别为 "shadowed ROM"。如果没有 ROM,驱动将报错 `"Unable to locate a BIOS ROM"`。 - -**参数说明**: -| 参数 | 值 | 含义 | -|-----------|-------|---------| -| `if` | `/root/roms/mi300.rom` | ROM 二进制文件(在磁盘镜像中) | -| `of` | `/dev/mem` | 物理内存设备 | -| `bs` | `1k` | 块大小 = 1024 字节 | -| `seek` | `768` | 跳转至 768 × 1024 = `0xC0000` | -| `count` | `128` | 写入 128 × 1024 = 128 KB | - -### 步骤 2:链接 IP Discovery 固件 - -```bash -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ - /usr/lib/firmware/amdgpu/ip_discovery.bin -``` - -**功能说明**:将驱动的 IP discovery 固件路径指向 MI300X 专用的 discovery 二进制文件。 - -**必要性**:amdgpu 驱动使用 `discovery=2` 模式,该模式从磁盘上的固件文件读取 GPU IP 块信息,而非从 GPU 自身的 ROM/寄存器读取。gem5 GPU 模型通过其 `ipt_binary` 参数提供此文件(空字符串 = 使用磁盘固件)。驱动查找 `/usr/lib/firmware/amdgpu/ip_discovery.bin`,该文件必须指向 MI300X 专用文件。 - -**注意**:磁盘镜像中已包含这两个文件;此命令仅创建正确的符号链接。如果 `mi300_discovery` 不存在,驱动将回退到内置默认值(可能与 MI300X 不匹配)。 - -### 步骤 3:加载 amdgpu 内核模块 - -```bash -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -**功能说明**:使用协同仿真参数加载 amdgpu 驱动。 - -**amdgpu 模块参数**: - -| 参数 | 值 | 含义 | -|-----------|-------|---------| -| `ip_block_mask` | `0x67` | 禁用 PSP(bit 3)和 SMU(bit 4);cosim 不模拟这些 IP 块 | -| `ppfeaturemask` | `0` | 禁用 PowerPlay 特性;cosim 无电源管理硬件 | -| `dpm` | `0` | 禁用动态电源管理 | -| `audio` | `0` | 禁用音频;cosim 无 HDMI/DP 音频 | -| `ras_enable` | `0` | 禁用 RAS — 防止 VBIOS 最小化时 `atom_context` 为 NULL 导致的空指针崩溃 | -| `discovery` | `2` | 使用固件文件进行 IP discovery | - -> **警告**:使用 `ip_block_mask=0x6f`(仅禁用 SMU)会导致 PSP 固件加载失败和内核 panic。务必使用 `0x67`。 - -> **警告**:`dd` 步骤(步骤 1)在 `modprobe` 之前**必须**执行。否则驱动的 BIOS 发现链全部失败(ACPI 不可用、SMU 已禁用),导致 `"Unable to locate a BIOS ROM"` 后在 `amdgpu_ras_init` → `amdgpu_atom_parse_data_header` 处发生空指针崩溃。 - -## 验证 - -完成步骤 3 后,检查驱动是否已加载: - -```bash -# Check dmesg for amdgpu initialization -dmesg | grep -i amdgpu | tail -20 - -# Check PCI device -lspci | grep -i amd - -# Check ROCm (if available) -rocm-smi -rocminfo | head -40 -``` - -**预期结果**:`dmesg` 应显示 amdgpu 正在初始化 GPU 且无致命错误。MMIO 流量应出现在 gem5 调试日志中。 - -## 故障排查 - -| 症状 | 原因 | 解决方法 | -|---------|-------|-----| -| `Unable to locate a BIOS ROM` + 空指针崩溃 | 步骤 1(dd ROM)未在 modprobe 之前执行 | 先执行 `dd`;检查 `/root/roms/mi300.rom` 是否存在 | -| `insmod: ERROR: could not load module` | 内核版本不匹配 | 使用匹配的内核重建磁盘镜像 | -| `cosim-gpu-setup.service` 失败 | 检查 `journalctl -u cosim-gpu-setup` | 确认磁盘镜像中 ROM 文件和模块存在 | -| MMIO 读取全部返回零 | gem5 未连接或已崩溃 | 检查 `docker logs gem5-cosim` | -| `probe failed with error -12` | BAR 布局不匹配 | 使用正确的 BAR5=MMIO 布局重建 QEMU | -| gem5 因 `schedule()` 断言崩溃 | 定时器事件溢出 | 确保设置了 `disable_rtc_events` 和 `disable_timer_events` | - -## 文件位置(客户机磁盘镜像内部) - -| 文件 | 路径 | 来源 | -|------|------|--------| -| VGA BIOS ROM | `/root/roms/mi300.rom` | 由 Packer 构建 | -| IP Discovery 固件 | `/usr/lib/firmware/amdgpu/mi300_discovery` | 由 Packer 构建 | -| 自动加载服务 | `/etc/systemd/system/cosim-gpu-setup.service` | 通过 `guestmount` 安装 | -| 自动加载脚本 | `/usr/local/bin/cosim-gpu-setup.sh` | 通过 `guestmount` 安装 | -| amdgpu 模块 | `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | ROCm 7.0 DKMS | diff --git a/docs/zh/cosim-memory-architecture.md b/docs/zh/cosim-memory-architecture.md deleted file mode 100644 index 33d7268..0000000 --- a/docs/zh/cosim-memory-architecture.md +++ /dev/null @@ -1,405 +0,0 @@ -[English](../en/cosim-memory-architecture.md) - -# QEMU+gem5 协同仿真:内存共享架构详解 - -## 1. 问题背景 - -在 QEMU+gem5 MI300X 协同仿真中,GPU 设备模型(gem5)和宿主系统(QEMU/KVM)运行在两个独立进程中。GPU 需要访问两类内存: - -- **VRAM**(本地显存):GPU 私有,存放纹理、buffer、GART 页表等 -- **GTT**(Graphics Translation Table / System Memory):宿主物理内存中被 GPU 映射的区域,用于 ring buffer、fence、IH cookie、DMA 缓冲等 - -这两类内存都必须在 QEMU 和 gem5 之间共享,否则 gem5 无法读取驱动写入的命令,QEMU 无法看到 GPU 写回的结果。 - -### 核心结论 - -> **VRAM 和 Guest RAM(GTT 页所在的宿主内存)都已通过共享内存实现双向可见。** -> GART 页表本身存放在 VRAM 中,也是共享的。gem5 直接从共享 VRAM 读取 GART PTE,然后通过 Ruby 内存系统直接访问 Guest RAM 共享内存完成 DMA 操作。 - -## 2. 总体架构 - -``` -+----------------------------+ +-----------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker) | -| | | | -| Guest Linux | | MI300X GPU Model | -| amdgpu driver | | Shader / CU / SDMA | -| | | PM4 / IH / Ruby caches | -| +--------+ +---------+ | vfio-user (Unix) | +------------+ +--------+ | -| | BAR0 | | BAR5 |<---(MMIO/CFG/Doorbell)--->|MI300XVfio | |GPU core| | -| | (VRAM) | | (MMIO) | | | |User bridge | | | | -| +---+----+ +---------+ | | +-----+------+ +--------+ | -| | | | | | -+------+---------------------+ +--------+--------------------+ - | | - v v - /dev/shm/mi300x-vram (16 GiB) mmap 同一文件 - (VRAM: GPU 数据 + GART 页表) (vramShmemPtr) - | | - v v - /dev/shm/cosim-guest-ram (8 GiB) mmap 同一文件 - (Guest RAM: ring buffer, fence, (system->getPhysMem()) - GTT 页面, 内核/用户数据) -``` - -### 2.1 三个共享通道 - -| 通道 | 文件/Socket | 大小 | 用途 | 访问方式 | -|------|-----------|------|------|---------| -| VRAM 共享内存 | `/dev/shm/mi300x-vram` | 16 GiB | GPU 显存 + GART 页表 | mmap(零拷贝) | -| Guest RAM 共享内存 | `/dev/shm/cosim-guest-ram` | 8 GiB | 宿主物理内存(GTT 页面) | QEMU: mmap; gem5: Ruby 内存系统直接访问共享后端 | -| vfio-user Socket | `/tmp/gem5-mi300x.sock` | — | MMIO/配置空间/Doorbell 通过 vfio-user 消息传递;DMA 通过 `vfu_dma_transfer()` 或共享内存直接访问;中断通过 irq_fd(eventfd -> KVM) | vfio-user 协议(单连接) | - -## 3. VRAM 共享(BAR0) - -### 3.1 初始化流程 - -**QEMU 侧**(vfio-user 后端): - -QEMU 使用内置的 `vfio-user-pci` 设备连接 gem5 的 vfio-user 服务端。BAR0 通过 vfio-user DMA 区域映射机制暴露给 QEMU,QEMU 不再直接打开 VRAM 共享内存文件,而是通过 vfio-user 协议获取 BAR 映射。 - -**gem5 侧** (`mi300x_vfio_user.cc:setupVramShm`): - -```cpp -shmemFd = shm_open(shmemPath.c_str(), O_CREAT | O_RDWR, 0666); -ftruncate(shmemFd, vramSize); -shmemPtr = mmap(nullptr, vramSize, PROT_READ | PROT_WRITE, MAP_SHARED, shmemFd, 0); - -// Key: pass the shared pointer to the GART translator -gpuDevice->getVM().vramShmemPtr = (uint8_t *)shmemPtr; -gpuDevice->getVM().vramShmemSize = vramSize; -``` - -### 3.2 VRAM 内容布局 - -``` -Offset 0x000000000 +------------------------------+ - | GPU Data Area | - | - hipMalloc allocations | - | - Kernel args, textures | - | - Driver internal allocs | - | | - | ... | - | | -Offset ~0x3EE600000 +------------------------------+ -(ptBase) | GART Page Table (PTEs) | - | 8 bytes per PTE | - | Maps GPU VA -> phys addr | -Offset 0x400000000 +------------------------------+ -(16 GiB) -``` - -### 3.3 访问模式 - -| 场景 | 写入方 | 读取方 | 路径 | -|------|--------|--------|------| -| GPU buffer 分配 | 驱动(via BAR0 write) | gem5(via vramShmemPtr) | 共享内存直接访问 | -| GART PTE 写入 | 驱动(via BAR0 write) | gem5 GART 翻译器 | memcpy from vramShmemPtr | -| IP Discovery 表 | gem5 初始化 | 驱动(via BAR0 read) | 共享内存直接访问 | - -**零拷贝**:由于 QEMU BAR0 和 gem5 的 `vramShmemPtr` 映射的是同一个 `/dev/shm` 文件,驱动写入 BAR0 的数据对 gem5 **立即可见**,无需任何 socket 通信。 - -## 4. Guest RAM 共享(GTT 页面) - -### 4.1 GTT 的本质 - -在 AMD GPU 中,**GTT = GART = Graphics Address Remapping Table**。它是一个单级页表(VMID 0),将 GPU 虚拟地址映射到宿主物理地址。被映射的宿主物理内存页面就是所谓的"GTT 页面"。 - -典型的 GTT 页面内容: - -| 数据结构 | 说明 | 访问方向 | -|---------|------|---------| -| PM4 Ring Buffer | GFX 命令队列 | 驱动写 → GPU 读 | -| SDMA Ring Buffer | DMA 命令队列 | 驱动写 → GPU 读 | -| IH Ring Buffer | 中断处理队列 | GPU 写 → 驱动读 | -| Fence 值 | 完成信号 | GPU 写 → 驱动读 | -| MQD (Map Queue Descriptor) | 队列描述符 | 驱动写 → GPU 读 | -| 用户 DMA 缓冲 | hipMemcpy 源/目标 | 双向 | - -### 4.2 Guest RAM 共享初始化 - -**QEMU 侧**(命令行参数): - -```bash --object memory-backend-file,id=mem0,size=8G,\ - mem-path=/dev/shm/cosim-guest-ram,share=on --numa node,memdev=mem0 -``` - -`share=on` 确保文件映射使用 `MAP_SHARED`,其他进程可以看到 QEMU 对 guest 内存的修改。 - -**gem5 侧** (`mi300_cosim.py`): - -```python -system.shared_backstore = args.shmem_host_path # "/cosim-guest-ram" -system.auto_unlink_shared_backstore = True -system.memories[0].shared_backstore = args.shmem_host_path -``` - -gem5 的 `PhysicalMemory` 使用同一个 POSIX 共享内存文件作为后端,实现与 QEMU 的内存共享。`MI300XVfioUser` 同样通过 `gpuDevice->getVM().vramShmemPtr` 设置 VRAM 指针,使 GART 翻译器能正确访问共享 VRAM。 - -### 4.3 为什么 GTT 不需要额外的共享机制 - -GTT 页面存在于 Guest RAM 中。Guest RAM 已经通过 `/dev/shm/cosim-guest-ram` 在 QEMU 和 gem5 之间共享。因此: - -1. **驱动写入 ring buffer** → 写入 Guest RAM → `/dev/shm/cosim-guest-ram` → gem5 可读 -2. **gem5 写入 fence** → 通过 Ruby 内存控制器写入 Guest RAM → `/dev/shm/cosim-guest-ram` → 驱动可读 -3. **GART PTE 指向的物理地址** → 就是 Guest RAM 中的偏移 → 双方都能访问 - -**vfio-user 后端**:VRAM 和 Guest RAM 均通过 mmap 零拷贝访问。gem5 的 SDMA/PM4 DMA 操作通过 Ruby 内存系统直接访问共享后端内存,无需 socket 中转。 - -## 5. GART 翻译流程 - -### 5.1 驱动写入 GART PTE - -``` -amdgpu driver (guest) - | - +- amdgpu_gart_map(): compute PTE value - | pte = (phys_addr >> 12) << 12 | flags - | - +- write to BAR0 + ptBase + (gpu_page * 8) - | | - | +- QEMU BAR0 = mmap of /dev/shm/mi300x-vram - | +- data immediately appears in shared memory - | - +- TLB invalidate: write VM_INVALIDATE_ENG17 register - +- MMIO -> vfio-user -> gem5 -> invalidateTLBs() -``` - -### 5.2 gem5 读取 GART PTE - -```cpp -// amdgpu_vm.cc: GARTTranslationGen::translate() - -// Step 1: compute PTE offset within VRAM -gart_addr = bits(transformedAddr, 63, 12); // GPU VA page number -pte_table_offset = gart_addr - (ptStart * 8); - -// Step 2: read PTE directly from shared VRAM (zero-copy) -pte_vram_offset = gartBase() + pte_table_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(uint64_t)); - -// Step 3: extract physical address -if (pte != 0) { - paddr = (bits(pte, 47, 12) << 12) | bits(vaddr, 11, 0); - // paddr points to Guest RAM (GTT page) or VRAM -} -``` - -### 5.3 PTE 格式 - -``` -63 52 51 48 47 12 11 6 5 2 1 0 -+-------+------+-----------------+------+----+---+---+ -| Flags | BlkF | Physical Page | Rsvd |Frag|Sys| V | -| | | (PA >> 12) | | | | | -+-------+------+-----------------+------+----+---+---+ - -Bit 0: Valid -- PTE is valid -Bit 1: System -- 1=system memory (Guest RAM), 0=local VRAM -Bit 47:12 -- physical page number -``` - -### 5.4 地址分类 - -GART 翻译得到物理地址后,gem5 需要判断该地址指向哪里: - -``` -Physical address paddr - | - +- Within fbBase ~ fbTop range? - | +- YES -> VRAM address - | +- Access directly via vramShmemPtr (zero-copy) - | - +- Within sysAddrL ~ sysAddrH range? - | +- YES -> Guest RAM address (GTT page) - | +- Access via Ruby memory system (shared memory direct access) - | - +- Neither? - +- Sink (paddr=0, safely discarded) -``` - -## 6. DMA 流程 - -### 6.1 vfio-user 后端:共享内存直接访问 - -在 vfio-user 后端下,gem5 通过 Ruby 内存系统直接访问 Guest RAM 共享后端(`/dev/shm/cosim-guest-ram`),无需经过 socket 进行 DMA 操作。 - -``` -gem5 GPU model (PM4/SDMA/IH) - | - | Needs to read ring buffer commands / write fence values - | - v Ruby memory system request - | - +- Address translated by GART -> Guest physical address - | - +- Ruby memory controller accesses PhysicalMemory - | | - | +- PhysicalMemory backed by /dev/shm/cosim-guest-ram (MAP_SHARED) - | +- read/write directly hits shared memory - | +- QEMU sees changes immediately (same mmap file) - | - +- Done (no socket round-trip needed) -``` - -**关键优势**: - -- **零拷贝**:DMA 读写直接操作共享内存,无需序列化/反序列化 -- **低延迟**:省去了 socket 请求-响应的往返开销 -- **简化架构**:无需自定义 DMA 协议,Ruby 内存系统天然支持共享后端 - -中断通过 vfio-user 的 irq_fd 机制传递(eventfd -> KVM),无需自定义中断消息。 - -### 6.2 Legacy 后端:Socket DMA 协议 - -> 以下描述适用于旧版自定义 cosim socket 后端(`MI300XGem5Cosim`),保留作为参考。 - -#### 6.2.1 gem5 读取 Guest RAM(读 ring buffer / fence) - -``` -gem5 GPU model (PM4/SDMA/IH) - | - | Needs to read ring buffer commands from Guest RAM - | - v cosimBridge->sendDmaRead(guestPhysAddr, length) - | - +- Construct DmaRead message (32-byte header) - | { type=DmaRead, addr=guestPhysAddr, data=length } - | - +- sendAll(eventFd, &msg, 32) --> QEMU event thread - | | - | +- pci_dma_read(addr, buf, len) - | | (reads from /dev/shm/cosim-guest-ram) - | | - | +- sendAll(eventFd, &resp, 32) - | <------------------------------------------+- sendAll(eventFd, data, len) - | - +- memcpy(dest, recvBuf, length) // data arrives at gem5 -``` - -#### 6.2.2 gem5 写入 Guest RAM(写 fence / IH cookie) - -``` -gem5 GPU model - | - | Needs to write fence value to Guest RAM - | - v cosimBridge->sendDmaWrite(guestPhysAddr, length, data) - | - +- Construct DmaWrite message + data payload - | { type=DmaWrite, addr=guestPhysAddr, data=length, size=length } - | - +- sendAll(eventFd, &msg, 32) --> QEMU event thread - +- sendAll(eventFd, data, length) --> | - | +- pci_dma_write(addr, buf, len) - | | (writes to /dev/shm/cosim-guest-ram) - | | - +- Done (DMA writes don't wait for response) +- Driver can see data immediately -``` - -### 6.3 vfio-user 与 Legacy 后端的对比 - -| 维度 | vfio-user 后端 | Legacy Socket 后端 | -|------|---------------|-------------------| -| Guest RAM DMA | Ruby 内存系统直接访问共享后端 | Socket 请求-响应协议 | -| VRAM 访问 | mmap 零拷贝(相同) | mmap 零拷贝(相同) | -| 中断 | irq_fd(eventfd -> KVM) | Socket 消息 | -| MMIO | vfio-user 消息传递 | 自定义 socket 协议 | -| QEMU 侧设备 | 内置 `vfio-user-pci` | 自定义 `mi300x_gem5.c` | -| 地址翻译 | gem5 内部 GART 翻译 | QEMU 端 `pci_dma_read/write` | - -Legacy 后端的 Guest RAM DMA 走 socket 的原因(地址翻译、事件驱动、IOMMU 兼容等)在 vfio-user 后端下不再适用:gem5 的 Ruby 内存控制器直接访问共享后端内存,GART 地址翻译在 gem5 内部完成。 - -## 7. Sink 机制 - -### 7.1 问题场景 - -在协同仿真模式下,部分 GART PTE 可能为零(未初始化)或指向 VRAM 内部地址。如果 gem5 无法翻译这些地址,会抛出 `GenericPageTableFault`,导致 DMA 重试循环直至仿真挂死。 - -### 7.2 解决方案 - -```cpp -// amdgpu_vm.cc: GARTTranslationGen::translate() - -if (pte == 0) { - if (origAddr < vramShmemSize && vramShmemPtr) { - // VRAM address -> map to sink (paddr=0) - range.paddr = 0; - warn_once("GART: VRAM address mapped to sink — " - "VRAM write-backs are no-ops in cosim"); - } else if (vramShmemPtr) { - // Unmapped GART page -> sink - range.paddr = 0; - warn_once("GART cosim: unmapped page -> sink"); - } -} -``` - -**Sink 的语义**: -- `paddr=0` 是 gem5 中始终有效的物理地址(系统 RAM 基址) -- DMA 读取返回零 -- DMA 写入被静默丢弃 -- 避免了 fault → retry 死循环 - -## 8. 完整的数据流示例 - -以 HIP kernel dispatch 为例,展示完整的内存交互: - -``` -1. hipMalloc(&d_a, N*sizeof(int)) - Driver -> allocates buffer in VRAM - Writes GART PTEs to shared VRAM (BAR0) - -2. hipMemcpy(d_a, h_a, N*sizeof(int), hipMemcpyHostToDevice) - Driver -> constructs SDMA copy command -> writes to Guest RAM (ring buffer) - Driver -> writes Doorbell -> QEMU BAR2 -> vfio-user -> gem5 - gem5 -> reads ring buffer (Guest RAM via shared memory) - gem5 -> parses SDMA command -> GART translates source address -> Guest RAM - gem5 -> reads source data (Guest RAM via shared memory) - gem5 -> writes to VRAM destination (shared memory direct write) - -3. kernel<<<1, N>>>(d_a, d_b, d_c, N) - Driver -> constructs PM4 dispatch command -> writes to Guest RAM (ring buffer) - Driver -> writes Doorbell -> gem5 - gem5 -> reads PM4 command (Guest RAM via shared memory) - gem5 -> launches shader execution - gem5 -> shader reads/writes VRAM (shared memory direct access) - gem5 -> writes fence on completion (Guest RAM via Ruby memory write) - gem5 -> sends MSI-X interrupt (irq_fd -> KVM) - -4. hipDeviceSynchronize() - Driver -> polls fence value (until Guest RAM value matches) - +- fence written by gem5 via Ruby memory write to shared backstore -``` - -## 9. 已知限制 - -### 9.1 DMA 缓冲大小(Legacy 后端) - -> 此限制仅适用于旧版 socket 后端。vfio-user 后端通过共享内存直接访问,无此限制。 - -单次 DMA 最大 4 MiB(`COSIM_DMA_BUF_SIZE`)。超过此大小的传输需要分块。实际场景中驱动通常以页为单位提交,不会触及此限制。 - -### 9.2 User-space 页表(VMID > 0) - -VMID 0 (kernel mode) 的 GART 页表通过共享 VRAM 完全可见。但 VMID > 0 (user mode) 的多级页表由 `VegaISA::Walker` 遍历,它使用 gem5 内部的 TLB/page walker,而非直接从共享内存读取。 - -实际影响有限:驱动写入页表后会发送 TLB invalidate MMIO,gem5 收到后刷新 TLB,下次 walker 遍历时会从正确的物理地址读取(该地址指向共享 VRAM 或 Guest RAM)。 - -### 9.3 VRAM 写回语义 - -gem5 中某些 GART 地址指向 VRAM 本身(VRAM-to-VRAM DMA)。这些地址被路由到 sink(paddr=0),写入被静默丢弃。对于纯计算场景,这不影响正确性。 - -## 10. 文件参考 - -| 文件 | 关键函数/区域 | 角色 | -|------|-------------|------| -| `gem5/src/dev/amdgpu/amdgpu_vm.cc:396-557` | `GARTTranslationGen::translate()` | GART 翻译核心逻辑 | -| `gem5/src/dev/amdgpu/amdgpu_vm.hh` | `AMDGPUSysVMContext`, `vramShmemPtr` | GART 数据结构 | -| `gem5/src/dev/amdgpu/mi300x_vfio_user.cc` | `setupVramShm()` | VRAM 共享内存初始化(vfio-user 后端) | -| `gem5/src/dev/amdgpu/mi300x_vfio_user.hh` | `MI300XVfioUser` | vfio-user 服务端桥接 | -| `gem5/src/dev/amdgpu/mi300x_gem5_cosim.cc` | `setupSharedMemory()`, `sendDmaRead/Write()` | Legacy socket 后端(VRAM 初始化 + DMA) | -| `gem5/configs/example/gpufs/mi300_cosim.py` | `shared_backstore` 配置, `--cosim-backend` | Guest RAM 共享设置 + 后端选择 | -| `gem5/src/dev/amdgpu/MI300XVfioUser.py` | SimObject 定义 | vfio-user 后端 Python 绑定 | diff --git a/docs/zh/cosim-technical-notes.md b/docs/zh/cosim-technical-notes.md deleted file mode 100644 index d1ad95b..0000000 --- a/docs/zh/cosim-technical-notes.md +++ /dev/null @@ -1,352 +0,0 @@ -[English](../en/cosim-technical-notes.md) - -# QEMU + gem5 MI300X 协同仿真:技术笔记 - -本文档总结了 QEMU + gem5 MI300X 协同仿真系统的架构、实现细节、已解决的问题和已知限制。 - -## 1. 架构概述 - -``` -+--------------------------------------+ -| QEMU (Q35 + KVM) | -| +--------------------------------+ | -| | Guest Linux (Ubuntu 24) | | -| | amdgpu driver (ROCm 7) | | -| | ROCm userspace | | -| +--------------+-----------------+ | -| | MMIO / Doorbell | -| +--------------v-----------------+ | -| | vfio-user-pci | | -| | (QEMU built-in device) | | -| +--------------+-----------------+ | -| | vfio-user protocol | -+-----------------+--------------------+ - | /tmp/gem5-mi300x.sock - | (Unix socket) -+-----------------+--------------------+ -| gem5 | | -| +--------------v-----------------+ | -| | MI300XVfioUser | | -| | (mi300x_vfio_user.cc) | | -| | [libvfio-user server] | | -| +--------------+-----------------+ | -| | AMDGPUDevice API | -| +--------------v-----------------+ | -| | AMDGPUDevice | | -| | PM4PacketProcessor | | -| | SDMAEngine | | -| | Shader / CU array | | -| +--------------------------------+ | -+--------------------------------------+ - -Shared Memory: - /dev/shm/cosim-guest-ram Guest physical RAM (QEMU <-> gem5 DMA) - /dev/shm/mi300x-vram GPU VRAM (QEMU BAR0 <-> gem5 device memory) -``` - -> **注意**:旧版后端(`mi300x-gem5` QEMU 设备 + `MI300XGem5Cosim` gem5 桥接)仍然可用,可通过 `--cosim-backend=legacy` 选择。vfio-user 后端是当前默认选项。 - -### 关键组件 - -| 组件 | 位置 | 作用 | -|---|---|---| -| `MI300XVfioUser` | `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5 vfio-user 服务端;通过 libvfio-user 处理 BAR 访问和中断(**默认后端**) | -| `vfio-user-pci` | QEMU 内建设备 | QEMU 侧 vfio-user 客户端;无需自定义 QEMU 代码 | -| `CosimBridge` | `src/dev/amdgpu/cosim_bridge.hh` | 抽象协同仿真桥接接口,vfio-user 和 legacy 后端均实现此接口 | -| `MI300XGem5Cosim` | `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | 旧版 socket 桥接 SimObject(**legacy 后端**) | -| `mi300x_gem5.c` | `qemu/hw/misc/`(legacy) | 旧版 QEMU PCI 设备;通过自定义 socket 协议转发 MMIO/doorbell(**legacy 后端**) | -| `mi300_cosim.py` | `configs/example/gpufs/` | gem5 配置;通过 `--cosim-backend=vfio-user\|legacy` 选择后端 | -| `cosim_launch.sh` | `scripts/` | 编排 Docker (gem5) + QEMU 的启动流程 | - -### PCI BAR 布局 - -``` -BAR0+1 VRAM 64-bit prefetchable 16 GiB (shared memory) -BAR2+3 Doorbell 64-bit 4 MiB -BAR4 MSI-X exclusive -BAR5 MMIO regs 32-bit 512 KiB (forwarded to gem5) -``` - -此布局**必须**与 amdgpu 驱动中硬编码的预期一致(`AMDGPU_VRAM_BAR=0`、`AMDGPU_DOORBELL_BAR=2`、`AMDGPU_MMIO_BAR=5`)。 - -## 2. 已解决的问题(踩坑日志) - -### 2.1 共享内存文件偏移量不匹配(严重) - -**现象**:GART 页表项读出全为零;PM4 opcode 0x0(NOP,count 为 0)无限重复。 - -**根因**:QEMU Q35 和 gem5 对 4G 以下/4G 以上的内存拆分方式不一致,导致共享后备存储中的文件偏移量不同。 - -- QEMU Q35 配置 8 GiB RAM 时:`below_4g = 2 GiB`(当 `ram_size >= 0xB0000000` 时硬编码)。参见 `qemu/hw/i386/pc_q35.c:161`。 -- gem5 配置为 3 GiB 以下 / 5 GiB 以上。 -- QEMU 将 4G 以上数据放在文件偏移 2 GiB 处;gem5 从偏移 3 GiB 处读取 → 全为零。 - -**修复**:`mi300_cosim.py` 复刻了 Q35 的拆分逻辑: - -```python -total_mem = convert.toMemorySize(args.mem_size) -lowmem_limit = 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000 -below_4g = min(total_mem, lowmem_limit) -above_4g = total_mem - below_4g -``` - -**关键教训**:当两个系统共享 memory-backend-file 时,它们必须在每个范围的文件偏移量上达成一致,而不仅仅是总大小。 - -### 2.2 SIGIO 边沿触发排空问题(严重,legacy 后端) - -**现象**:gem5 处理完第一条 MMIO 消息后永远挂起。QEMU 的 socket 缓冲区被填满。 - -**根因**:gem5 的 `PollQueue` 使用 `FASYNC`/`SIGIO`,这是**边沿触发**的。如果在处理第一条消息之前有多条消息到达,只会触发一次 `SIGIO`。处理完一条消息后,剩余的消息留在 socket 缓冲区中,没有信号唤醒 gem5。 - -**修复**:`mi300x_gem5_cosim.cc:handleClientData()` 使用 `do/while` 循环配合 `poll(fd, POLLIN, 0)` 来排空每次 SIGIO 到来时**所有**待处理的消息。 - -```cpp -do { - // read and process one message - ... - struct pollfd pfd = {fd, POLLIN, 0}; -} while (poll(&pfd, 1, 0) > 0 && (pfd.revents & POLLIN)); -``` - -> **注意**:此问题仅影响 legacy 后端。vfio-user 后端使用 libvfio-user 的非阻塞 poll 机制,不依赖 SIGIO 信号。 - -### 2.3 VRAM 地址 GART 翻译错误(严重) - -**现象**:地址 `0x1f72fa8000` 产生 861,000 多次 GART 翻译错误,内存耗尽,段错误。 - -**根因**:SDMA rptr 回写地址和 PM4 RELEASE_MEM 目标地址可能指向 VRAM(地址 < 16 GiB)。这些地址经过 `getGARTAddr()` 处理时会将页号乘以 8,然后 GART 翻译失败,因为 VRAM 地址没有对应的页表项。 - -**修复(三层防护)**: - -1. **PM4 层**(`pm4_packet_processor.cc`):`writeData()`、`releaseMem()`、`queryStatus()` 检查 `isVRAMAddress(addr)`,将 VRAM 写操作通过 `gpuDevice->getMemMgr()->writeRequest()`(设备内存)路由,而非 `dmaWriteVirt()`(通过 GART 的系统内存)。 - -2. **SDMA 层**(`sdma_engine.cc`):`setGfxRptrLo/Hi()` 和 rptr 回写对 VRAM 地址跳过 `getGARTAddr()`,改用 `getMemMgr()->writeRequest()`。 - -3. **GART 兜底**(`amdgpu_vm.cc`):`GARTTranslationGen::translate()` 通过逆向 `getGARTAddr` 变换(`orig_page = page_num >> 3`)检测 VRAM 地址,并将其映射到 `paddr=0` 作为 sink,而非产生 fault。 - -### 2.4 协同仿真模式下的定时器溢出 - -**现象**:经过数十亿 tick 后,gem5 因 `curTick()` 整数溢出而崩溃(RTC 和 PIT 定时器持续调度事件)。 - -**修复**:为 `Cmos` 添加了 `disable_rtc_events` 参数,为 `I8254` 添加了 `disable_timer_events` 参数。在 `mi300_cosim.py` 中均设为禁用。`MI300XGem5Cosim` 中的 keepalive 事件防止事件队列变空。 - -### 2.5 PSP / SMU 固件加载失败 - -**现象**:使用 `ip_block_mask=0x6f` 执行 `modprobe amdgpu` 时,在 PSP 固件加载阶段出现 `-EINVAL` 错误。 - -**根因**:在 ROCm 7.0 的 `amdgpu_discovery.c` 中,IP block 枚举顺序为: -``` -0: soc15_common 1: gmc_v9_0 2: vega20_ih -3: psp 4: smu 5: gfx_v9_4_3 -6: sdma_v4_4_2 7: vcn_v4_0_3 8: jpeg_v4_0_3 -``` - -`ip_block_mask=0x6f` = `0b01101111` 禁用了 bit 4(SMU)但**没有**禁用 bit 3(PSP)。应使用 `ip_block_mask=0x67` = `0b01100111` 来同时禁用 PSP(bit 3)和 SMU(bit 4)。 - -### 2.6 QEMU 串口控制台与 `-nographic` 的冲突 - -**现象**:同时使用 `-serial unix:/tmp/serial.sock -nographic` 时,guest 没有串口输出。 - -**根因**:`-nographic` 隐含了 `-serial mon:stdio`,它创建了映射到 stdio 的 serial0。显式的 `-serial unix:...` 变成了 serial1(ttyS1),但 kernel 使用的是 `console=ttyS0`。 - -**修复**:单独使用 `-nographic`(串口输出到 stdio)。如需程序化访问,在 `screen` 中运行 QEMU: -```bash -screen -dmS qemu-cosim -L -Logfile /tmp/log -screen -S qemu-cosim -X stuff 'command\n' -``` - -### 2.7 不支持的 PM4 操作码 - -| 操作码 | 名称 | 说明 | 修复方式 | -|--------|------|------|----------| -| `0x58` | `ACQUIRE_MEM` | 内存屏障 / 缓存刷新 | NOP(跳过包体) | -| `0xA0` | `SET_RESOURCES` | 队列资源配置 | NOP(跳过包体) | - -两者均已添加到 `pm4_defines.hh` 中,并在 `pm4_packet_processor.cc:decodeHeader()` 中作为跳过并继续处理。 - -### 2.8 链接时内存不足(OOM) - -**现象**:即使使用 `-j2`,链接器也被 OOM killer 终止。 - -**修复**:使用 gold 链接器并限制单任务: -```bash -scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold -``` - -### 2.9 PCI Class Code - -**现象**:amdgpu 驱动跳过了 `0xC0000` 处的 legacy VGA ROM 检查。 - -**修复**:将 PCI class 从 `PCI_CLASS_DISPLAY_OTHER (0x0380)` 改为 `PCI_CLASS_DISPLAY_VGA (0x0300)`。使用 VGA class 后,kernel 自动检测为"带有 shadowed ROM 的视频设备"。 - -### 2.10 GART 未映射页崩溃(严重) - -**现象**:HIP 程序输出 `hipMalloc OK` 后,gem5 段错误,伴随重复的 `GART translation for 0x3fff800000000 not found` 警告。 - -**根因**:GPU 的 PM4/SDMA 引擎尝试 DMA 到驱动尚未映射的 GART 页(共享 VRAM 中 PTE = 0)。原始代码创建了 `GenericPageTableFault`,但 DMA 回调链无限重试同一个失败地址,耗尽内存并崩溃。 - -**修复**:在协同仿真模式下,将未映射的 GART 页映射到 sink(`paddr=0`)而非产生 fault。DMA 读操作返回零,写操作被丢弃,但仿真保持存活。GART sink 诊断信息还会记录 `fbBase` 以辅助调试。 - -**关键发现**:共享 VRAM 中 `gartBase`(= `ptBase`)处的 GART PTE 已被驱动正确填充。诊断信息确认后续的 PTE(偏移 0x32E0+)包含有效条目,而第一页(ptStart 本身)只是未映射——这是正常现象。 - -### 2.11 SDMA Ring 测试超时 - -**现象**:驱动初始化期间,SDMA ring 测试返回 -110(ETIMEDOUT)。 - -**根因**:`sdma_engine.hh` 中 `sdma_delay = 1e9` 导致每次 SDMA 处理步骤消耗 10 亿仿真 tick。结合 keepalive 驱动的事件循环,SDMA 完成的实际墙钟时间约 500ms,超过了驱动端约 200ms 的超时限制。 - -**修复**:将 `sdma_delay` 从 `1e9` 降低到 `1000`,同时将 `KEEPALIVE_INTERVAL` 增加到 `1e9`。这大幅缩短了 SDMA 操作的墙钟延迟,使 ring 测试能在驱动超时窗口内完成。 - -## 3. 当前状态 - -### 已实现的功能 - -- **vfio-user 后端(默认)**:QEMU 使用内建 `vfio-user-pci` 设备,gem5 运行 `MI300XVfioUser` 作为 vfio-user 服务端。无需自定义 QEMU 代码,原生 QEMU 10.0+ 即可使用 -- **驱动初始化**:amdgpu 3.64.0 完整加载 - - 从固件文件进行 IP discovery(`discovery=2`) - - GMC(内存控制器)、GFX(计算)、SDMA、IH(中断处理器) - - 8 个 KIQ ring 已映射(mec 2 pipe 1 q 0) - - 4 个 SDMA 引擎 × 4 队列 = 16 个 SDMA ring - - 跨 8 个 XCP 分区的 64 个以上 compute ring - - 7 个 DRM XCP 设备节点(`/dev/dri/renderD129..135`) - - SDMA ring 测试通过(`sdma_delay` 调优后正常完成) - - Fence 回退定时器问题已解决 -- **ROCm 工具**: - - `rocm-smi`:设备 0x74a0,SPX 分区,1% VRAM - - `rocminfo`:Agent gfx942,320 CU,4 SIMD/CU,KERNEL_DISPATCH -- **KFD**(Kernel Fusion Driver):节点已添加,16383 MB VRAM,HSA agent 已注册 -- **GPU 计算(HIP)**:完全可用! - - `hipMalloc` / `hipMemcpy`(host-to-device、device-to-host) - - Kernel dispatch(`addKernel<<<1, N>>>`)运行在 gfx942 上 - - `hipDeviceSynchronize` 返回 `hipSuccess` - - 结果验证正确:`{1+10, 2+20, 3+30, 4+40}` = `{11, 22, 33, 44}` - - 测试结果:vector_add (120ms)、transpose (6.5s)、gemm (4.7s) 均 PASSED -- **MSI-X 中断转发**:gem5 → QEMU 通过 vfio-user 协议(vfio-user 后端)或 event socket(legacy 后端) - - `AMDGPUDevice::intrPost()` → `cosimBridge->sendIrqRaise(0)` - - QEMU → guest IH 处理程序 -- **GART 翻译**:协同仿真兜底机制从共享 VRAM 读取 PTE;未映射页安全路由到 sink -- **65,000+ 次 MMIO 操作**处理无崩溃 -- **磁盘镜像**:`cosim-gpu-setup.service` 开机自动加载驱动(dd ROM → modprobe `ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`) - -### 已知限制 - -1. **VGA BIOS ROM 必须先 dd**:`dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128` 必须在 `modprobe` 之前执行。驱动的 BIOS 发现链(ACPI ATRM/VFCT、SMU ROM 读取、Platform ROM)在 cosim 模式下全部失败。如果 `0xC0000` 处没有 ROM 数据,`atom_context` 为 NULL,`amdgpu_ras_init` 会触发空指针崩溃。 - -2. **GART 未映射页**:部分 GART 页的 PTE=0,路由到 sink。这是安全的,但意味着 DMA 到这些地址时读取到零。 - -## 4. 文件变更总结 - -### gem5(新文件 - vfio-user 后端) -| 文件 | 说明 | -|---|---| -| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | vfio-user 服务端 SimObject | -| `src/dev/amdgpu/MI300XVfioUser.py` | SimObject Python 封装 | -| `src/dev/amdgpu/cosim_bridge.hh` | 抽象 CosimBridge 接口(vfio-user 和 legacy 后端均实现) | -| `ext/libvfio-user/` | libvfio-user 库(子模块) | - -### gem5(新文件 - legacy 后端) -| 文件 | 说明 | -|---|---| -| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | Socket 桥接 SimObject | -| `src/dev/amdgpu/MI300XGem5Cosim.py` | SimObject Python 封装 | - -### gem5(新文件 - 通用) -| 文件 | 说明 | -|---|---| -| `configs/example/gpufs/mi300_cosim.py` | 协同仿真系统配置(`--cosim-backend=vfio-user\|legacy`) | -| `scripts/cosim_launch.sh` | 启动编排脚本 | - -### gem5(修改的文件) -| 文件 | 变更内容 | -|---|---| -| `src/dev/amdgpu/pm4_packet_processor.{cc,hh}` | VRAM 写路由、`isVRAMAddress()`、ACQUIRE_MEM/SET_RESOURCES NOP | -| `src/dev/amdgpu/pm4_defines.hh` | 添加 `IT_ACQUIRE_MEM`、`IT_SET_RESOURCES` | -| `src/dev/amdgpu/sdma_engine.{cc,hh}` | VRAM rptr 回写路由、`sdma_delay` 调优 | -| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | GART 协同仿真兜底(共享 VRAM PTE 读取)、VRAM 地址 sink | -| `src/dev/amdgpu/amdgpu_device.cc` | 协同仿真集成钩子 | -| `src/dev/amdgpu/amdgpu_nbio.cc` | ASIC 初始化完成寄存器 | -| `src/dev/intel_8254_timer.{cc,hh}` | `disable_timer_events` 参数 | -| `src/dev/mc146818.{cc,hh}` | `disable_rtc_events` 参数 | - -### QEMU(新文件 - legacy 后端) -| 文件 | 说明 | -|---|---| -| `hw/misc/mi300x_gem5.c` | 带 socket 桥接的 MI300X PCI 设备 | -| `hw/misc/mi300x_gem5.h` | 头文件 | -| `hw/misc/trace-events` | trace 事件定义 | - -> **注意**:vfio-user 后端使用 QEMU 内建的 `vfio-user-pci` 设备,不需要任何自定义 QEMU 代码。 - -## 5. 运行方法 - -### 前置条件 -- 安装 Docker 并构建 `gem5-run:local` 镜像 -- QEMU 10.0+(原生支持 vfio-user);legacy 后端需要从 `cosim/qemu/` 编译的 QEMU -- 磁盘镜像 `x86-ubuntu-rocm70` + 内核 `vmlinux-rocm70` - -### 快速启动 -```bash -cd cosim -./scripts/cosim_launch.sh -# GPU 驱动通过 cosim-gpu-setup.service 自动加载(约 40 秒) -# Guest 启动后验证: -rocm-smi # 应显示设备 0x74a0 -rocminfo # 应显示 gfx942 -``` - -### 手动启动(用于调试) -```bash -# 1. 在 Docker 中运行 gem5 -docker run -d --name gem5-cosim \ - -v "$PWD:/gem5" -v /tmp:/tmp -v /dev/shm:/dev/shm -w /gem5 \ - -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ - gem5-run:local build/VEGA_X86/gem5.opt \ - --debug-flags=MI300XCosim --listener-mode=on \ - configs/example/gpufs/mi300_cosim.py \ - --socket-path=/tmp/gem5-mi300x.sock \ - --shmem-path=/mi300x-vram \ - --shmem-host-path=/cosim-guest-ram \ - --dgpu-mem-size=16GiB --num-compute-units=40 --mem-size=8G - -# 2. 等待 socket 创建完成并修复权限 -docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock -docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram - -# 3. 在 screen 中运行 QEMU(vfio-user 后端,默认) -screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \ - qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,\ - mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel ../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 \ - modprobe.blacklist=amdgpu earlyprintk=serial,ttyS0,115200" \ - -drive file=../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,\ - format=raw,if=virtio \ - -device vfio-user-pci,socket=/tmp/gem5-mi300x.sock \ - -nographic -no-reboot - -# 使用 legacy 后端时,将上面的 -device 行替换为: -# -device mi300x-gem5,gem5-socket=/tmp/gem5-mi300x.sock,\ -# shmem-path=/dev/shm/mi300x-vram,vram-size=17179869184 -# 并使用从 cosim/qemu/ 编译的 QEMU - -# 4. 手动 GPU 初始化(如果 cosim-gpu-setup.service 未安装) -screen -S qemu-cosim -X stuff 'dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128\n' -screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2\n' -``` - -## 6. 调试技巧 - -- **gem5 调试标志**:`--debug-flags=MI300XCosim,AMDGPUDevice,PM4PacketProcessor` -- **QEMU trace**:`--qemu-trace 'mi300x_gem5_*'` -- **检查 gem5 日志**:`docker logs gem5-cosim 2>&1 | grep -E "warn|error|GART"` -- **检查 guest dmesg**:`screen -S qemu-cosim -X stuff 'dmesg | tail -20\n'` -- **增量重建**:删除过期的 `.o` 文件,使用 gold 链接器重建: - ```bash - docker run --rm -v "$PWD:/gem5" -w /gem5 gem5-run:local \ - sh -c 'rm -f build/VEGA_X86/dev/amdgpu/.o' - docker run --rm -v "$PWD:/gem5" -w /gem5 \ - gem5-run:local scons build/VEGA_X86/gem5.opt -j1 - ``` diff --git a/docs/zh/cosim-usage-guide.md b/docs/zh/cosim-usage-guide.md deleted file mode 100644 index e91924f..0000000 --- a/docs/zh/cosim-usage-guide.md +++ /dev/null @@ -1,579 +0,0 @@ -[English](../en/cosim-usage-guide.md) - -# QEMU + gem5 MI300X 联合仿真使用指南 - -从编译到运行 HIP GPU 计算的完整流程。 - -## 架构概述 - -``` -+---------------------------------+ +------------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker 容器内) | -| +---------------------------+ | | +------------------------+ | -| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU 模型 | | -| | amdgpu 驱动 | | | | - Shader + CU | | -| | ROCm 7.0 / HIP 运行时 | | | | - PM4 / SDMA 引擎 | | -| +-----------+---------------+ | | | - Ruby 缓存层次 | | -| | MMIO/Doorbell | | +----------+-------------+ | -| +-----------v---------------+ | | +----------v-------------+ | -| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | -| +---------------------------+ |vfio-| +------------------------+ | -| |user | | -+---------------------------------+ +------------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (Guest 物理内存, 共享) (GPU VRAM, 共享) -``` - -- **QEMU** 负责:CPU 执行、Linux 内核引导、PCIe 枚举、amdgpu 驱动加载 -- **gem5** 负责:MI300X GPU 计算模型(Shader、CU、缓存、DMA 引擎) -- 两者通过 **vfio-user 协议**(基于 Unix 域套接字)通信。QEMU 使用内置的 `vfio-user-pci` 设备,gem5 端运行 `MI300XVfioUser` 作为 vfio-user 服务端,透明处理 MMIO / Doorbell / PCI Config 访问。数据通过 **共享内存** 共享 - -## 前置条件 - -| 需求 | 说明 | -|---|---| -| 宿主机系统 | Linux x86_64,支持 KVM(已在 WSL2 6.6.x 验证) | -| Docker | 守护进程运行中,当前用户在 `docker` 组 | -| KVM | `/dev/kvm` 可访问 | -| 磁盘空间 | 至少 120 GB(55G 磁盘镜像 + 构建中间产物) | -| 内存 | 建议 16 GB 以上(gem5 编译和运行都比较占内存) | -| 工具 | `git`、`screen`、`unzip` | - -## 目录结构 - -``` -/home/zevorn/cosim/ - gem5/ # gem5 源码(cosim 分支) - build/VEGA_X86/gem5.opt # gem5 二进制 - configs/example/gpufs/ - mi300_cosim.py # cosim 配置脚本 - scripts/ - run_mi300x_fs.sh # 编排脚本 - cosim_launch.sh # cosim 一键启动脚本 - Dockerfile.run # 运行时 Docker 镜像 - gem5-resources/ # 磁盘镜像、内核、GPU 应用 - src/x86-ubuntu-gpu-ml/ - disk-image/x86-ubuntu-rocm70 # 55G raw 磁盘镜像 - vmlinux-rocm70 # 内核 - docs/ # 文档 - qemu/ # QEMU 源码(仅 legacy 后端需要) - build/qemu-system-x86_64 # QEMU 二进制 -``` - ---- - -## 第一步:编译 gem5 - -gem5 二进制链接了 Ubuntu 24.04 的库,需要在兼容环境中编译。 - -> **注意:** vfio-user 后端依赖 `libjson-c-dev`(编译时)和 `libjson-c5`(运行时)。`ghcr.io/gem5/gpu-fs:latest` 镜像已包含此依赖,无需额外安装。若在宿主机上直接编译,请先安装 `libjson-c-dev`。 - -### 方式一:Docker 内编译(推荐) - -```bash -cd /home/zevorn/cosim/gem5 - -# 使用 gpu-fs 镜像编译(amd64,包含所有依赖) -docker run --rm \ - -v "$(pwd):/gem5" -w /gem5 \ - gem5-run:local \ - scons build/VEGA_X86/gem5.opt -j4 -``` - -> **注意:** 内存不足时降低并行度(`-j1` 或 `-j2`)。使用 gold linker 可减少链接阶段内存占用。 - -### 方式二:编排脚本 - -```bash -./scripts/run_mi300x_fs.sh build-gem5 -``` - -产出:`build/VEGA_X86/gem5.opt`(约 1.1 GB)。 - -### 构建运行时 Docker 镜像 - -```bash -cd scripts -docker build -t gem5-run:local -f Dockerfile.run . -``` - -此镜像基于 `ghcr.io/gem5/gpu-fs`,添加了 Python 3.12 支持,用于运行 gem5。 - ---- - -## 第二步:编译 QEMU - -使用 vfio-user 后端时,**原版 QEMU 10.0+** 即可直接使用(内置 `vfio-user-pci` 设备),无需自定义 QEMU 代码。标准编译: - -```bash -# 任意 QEMU 10.0+ 源码均可 -mkdir -p qemu-build && cd qemu-build -/path/to/qemu/configure --target-list=x86_64-softmmu -make -j$(nproc) -``` - -产出:`qemu-system-x86_64`。 - -> **Legacy 后端:** 若使用 `--cosim-backend=legacy`,则需要 `cosim/qemu/` 中包含 `mi300x-gem5` 设备的源码。编译方式同上,但必须使用 cosim 分支的 QEMU 源码。 - -也可通过编排脚本: - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-qemu -``` - ---- - -## 第三步:准备磁盘镜像和内核 - -磁盘镜像包含 Ubuntu 24.04 + ROCm 7.0 + 内核 6.8.0-79-generic 及 amdgpu DKMS 模块。 - -### 自动构建 - -```bash -./scripts/run_mi300x_fs.sh build-disk -``` - -### 手动构建 - -```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml -./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" -``` - -> Arch Linux 上 QEMU 路径为 `/usr/sbin/`,其他发行版可能是 `/usr/bin/`。 - -### 产出 - -| 产物 | 路径 | 大小 | -|---|---|---| -| 磁盘镜像 | `../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | 约 55 GB | -| 内核 | `../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | 约 64 MB | - ---- - -## 第四步:启动 cosim - -### 方式一:一键启动脚本(推荐) - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/cosim_launch.sh -``` - -此脚本会自动完成以下所有步骤(启动 gem5 容器、等待就绪、修复权限、启动 QEMU),并以交互模式进入 QEMU 串口控制台。 - -可用参数: - -```bash -./scripts/cosim_launch.sh --help -./scripts/cosim_launch.sh --gem5-debug MI300XCosim # 开启 gem5 调试输出 -./scripts/cosim_launch.sh --vram-size 32GiB # 自定义 VRAM 大小 -./scripts/cosim_launch.sh --num-cus 80 # 自定义 CU 数量 -./scripts/cosim_launch.sh --cosim-backend=vfio-user # 使用 vfio-user 后端(默认) -./scripts/cosim_launch.sh --cosim-backend=legacy # 使用 legacy 自定义套接字后端 -``` - -### 方式二:手动分步启动 - -#### 4.1 启动 gem5(Docker 容器) - -```bash -docker run -d --name gem5-cosim \ - -v /home/zevorn/cosim/gem5:/gem5 \ - -v /tmp:/tmp \ - -v /dev/shm:/dev/shm \ - -w /gem5 \ - -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ - gem5-run:local \ - /gem5/build/VEGA_X86/gem5.opt --listener-mode=on \ - /gem5/configs/example/gpufs/mi300_cosim.py \ - --socket-path=/tmp/gem5-mi300x.sock \ - --shmem-path=/mi300x-vram \ - --shmem-host-path=/cosim-guest-ram \ - --dgpu-mem-size=16GiB \ - --num-compute-units=40 \ - --mem-size=8G -``` - -#### 4.2 等待 gem5 就绪 - -```bash -# 查看 gem5 日志,等待出现 "listening" 或 "ready" -docker logs -f gem5-cosim -``` - -看到如下输出即表示就绪: - -``` -============================================================ -gem5 MI300X co-simulation server ready - Socket: /tmp/gem5-mi300x.sock - VRAM SHM: /mi300x-vram - Host SHM: /cosim-guest-ram - VRAM size: 16GiB - Host RAM: 8GiB - CUs: 40 -Waiting for QEMU to connect... -============================================================ -``` - -#### 4.3 修复权限 - -Docker 创建的文件归 root 所有,需要修复权限以便 QEMU 访问: - -```bash -docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock -docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram -``` - -#### 4.4 启动 QEMU - -```bash -# 前台交互模式(vfio-user 后端,使用原版 QEMU 10.0+) -qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host \ - -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ - -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ - -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ - -nographic -no-reboot -``` - -> **重要:** 内核命令行必须包含 `modprobe.blacklist=amdgpu`,防止 PCI 子系统在 VGA ROM 写入共享内存之前自动加载驱动。`cosim-gpu-setup.service` 会按正确顺序初始化(dd ROM → modprobe)。 -> -> **注意:** 使用 vfio-user 后端时,无需在 QEMU 侧指定 `shmem-path` 或 `vram-size` 参数,共享内存由 gem5 端的 `MI300XVfioUser` 服务端负责创建和管理。 - -或者以后台 screen 模式运行: - -```bash -screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \ - qemu-system-x86_64 \ - -machine q35 -enable-kvm -cpu host \ - -m 8G -smp 4 \ - -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ - -numa node,memdev=mem0 \ - -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ - -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ - -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ - -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ - -nographic -no-reboot - -# 连接 screen 查看串口输出 -screen -r qemu-cosim -# 退出 screen: Ctrl-A D(分离) -``` - -#### 4.5 SSH 访问 Guest - -`cosim_launch.sh` 脚本默认启用了用户态网络和 SSH 端口转发(`-netdev user,id=net0,hostfwd=tcp::2222-:22` + `virtio-net-pci`)。要通过 SSH 访问 Guest,需要先在 Guest 内配置网络。 - -**1. 查看网卡名称:** - -```bash -ip a -``` - -找到 virtio 网卡接口(如 `enp0s2`),具体名称取决于 PCI 拓扑,可能不同。 - -**2. 配置 netplan:** - -编辑 `/etc/netplan/50-cloud-init.yaml`: - -```yaml -network: - version: 2 - ethernets: - enp0s2: - dhcp4: true -``` - -> **注意:** 将 `enp0s2` 替换为 `ip a` 输出中的实际接口名称。 - -**3. 应用配置:** - -```bash -netplan apply -``` - -**4. 从宿主机 SSH 登录:** - -在宿主机上打开另一个终端,执行: - -```bash -ssh -p 2222 gem5@localhost -``` - -默认密码:`12345`。 - -> **提示:** 相比 QEMU 串口控制台,SSH 访问在交互操作、文件传输(`scp -P 2222`)以及多会话场景下更加方便。 - ---- - -## 第五步:加载 GPU 驱动 - -Guest Linux 启动完成后(自动以 root 登录),执行以下命令加载 amdgpu 驱动。 - -### 方式一:自动加载(默认) - -磁盘镜像内置 `cosim-gpu-setup.service`,开机时自动执行: - -1. 通过 `dd` 写入 VGA ROM 到 `0xC0000`(gem5 `readROM()` 需要此数据) -2. 链接 IP discovery 固件 -3. 执行 `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` - -服务约 40 秒完成。登录后用 `rocm-smi` 验证。 - -### 方式二:手动加载 - -```bash -# 1. 加载 VGA ROM(modprobe 之前必须执行) -dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 - -# 2. 链接 IP discovery 固件 -ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ - /usr/lib/firmware/amdgpu/ip_discovery.bin - -# 3. 加载 amdgpu 驱动 -modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 -``` - -> **关键参数说明:** -> - `ip_block_mask=0x67`(二进制 0110_0111)启用 GMC、IH、DCN、GFX、SDMA、VCN,禁用 PSP 和 SMU -> - 若使用错误的 mask(如 0x6f),PSP 初始化会触发 GPU reset 导致内核 panic -> - `ras_enable=0` 防止 `amdgpu_atom_parse_data_header` 中的空指针崩溃(cosim ROM 仅 3KB,ATOMBIOS 数据最小化) -> - `dd` 步骤**必须**在 modprobe 之前执行 — 否则驱动的 BIOS 发现链全部失败,`atom_context` 为 NULL - -### 验证驱动加载 - -```bash -# 检查 dmesg — 应看到 "amdgpu: DRM initialized" 和 "7 XCP partitions" -dmesg | grep -i amdgpu | tail -20 - -# 验证设备识别 -rocm-smi - -# 验证 GPU 能力 -rocminfo | head -40 -``` - -预期输出: - -``` -# rocm-smi 输出 -GPU[0] : Device Name: 0x74a0 -GPU[0] : Partition: SPX - -# rocminfo 输出 -Name: gfx942 -Compute Unit: 320 -KERNEL_DISPATCH capable -``` - -> 加载过程中可能出现约 80 条 fence fallback timer 警告,这是正常现象——DRM 子系统在探测所有 ring buffer 时使用轮询模式的超时回退机制。 - ---- - -## 第六步:运行 GPU 计算测试 - -### 编译 HIP 测试程序 - -在 Guest 内编写一个简单的向量加法程序: - -```bash -cat > /tmp/vec_add.cpp << 'EOF' -#include -#include - -__global__ void vec_add(int *a, int *b, int *c, int n) { - int i = blockIdx.x * blockDim.x + threadIdx.x; - if (i < n) c[i] = a[i] + b[i]; -} - -int main() { - const int N = 4; - int ha[N] = {1, 2, 3, 4}; - int hb[N] = {10, 20, 30, 40}; - int hc[N] = {0}; - - int *da, *db, *dc; - hipMalloc(&da, N * sizeof(int)); - hipMalloc(&db, N * sizeof(int)); - hipMalloc(&dc, N * sizeof(int)); - - hipMemcpy(da, ha, N * sizeof(int), hipMemcpyHostToDevice); - hipMemcpy(db, hb, N * sizeof(int), hipMemcpyHostToDevice); - - vec_add<<<1, N>>>(da, db, dc, N); - - hipMemcpy(hc, dc, N * sizeof(int), hipMemcpyDeviceToHost); - - printf("Result: %d %d %d %d\n", hc[0], hc[1], hc[2], hc[3]); - - bool pass = (hc[0]==11 && hc[1]==22 && hc[2]==33 && hc[3]==44); - printf("%s\n", pass ? "PASSED!" : "FAILED!"); - - hipFree(da); hipFree(db); hipFree(dc); - return pass ? 0 : 1; -} -EOF -``` - -编译并运行: - -```bash -# 编译(gfx942 = MI300X 架构) -/opt/rocm/bin/hipcc --offload-arch=gfx942 -o /tmp/vec_add /tmp/vec_add.cpp - -# 运行 -/tmp/vec_add -``` - -### 预期输出 - -``` -Result: 11 22 33 44 -PASSED! -``` - -### 使用 gem5-resources 中的 square 测试 - -也可以使用 gem5-resources 自带的 square 测试程序。需要先在宿主机编译: - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-app square -``` - -然后将编译产物拷入 Guest(通过 scp 或直接挂载磁盘镜像),在 Guest 内运行: - -```bash -./square.default -``` - ---- - -## 关闭 cosim - -### 在 QEMU 串口控制台 - -``` -# 正常关机 -poweroff - -# 或强制退出 QEMU -Ctrl-A X -``` - -### 清理 Docker 容器和共享内存 - -```bash -docker rm -f gem5-cosim -rm -f /dev/shm/mi300x-vram /dev/shm/cosim-guest-ram -rm -f /tmp/gem5-mi300x.sock -``` - -> 使用 `cosim_launch.sh` 时,退出 QEMU 后会自动执行清理。 - ---- - -## 故障排查 - -### gem5 容器启动后立即退出 - -```bash -docker logs gem5-cosim -``` - -常见原因: -- `gem5.opt` 未编译或路径错误 -- Python 模块导入失败(检查 PYTHONPATH) -- 共享内存创建权限问题 - -### QEMU 连接 gem5 失败 - -``` -Failed to connect to /tmp/gem5-mi300x.sock -``` - -- 确认 gem5 已完成初始化(看到 "Waiting for QEMU to connect") -- 确认 socket 权限已修复(`chmod 777`) - -### 驱动加载失败 — PSP GPU reset panic - -``` -BUG: kernel NULL pointer dereference at psp_gpu_reset+0x43 -``` - -- 使用了错误的 `ip_block_mask`。必须用 `0x67`(禁用 PSP+SMU),不能用 `0x6f` - -### gem5 崩溃 — GART translation not found - -``` -GART translation for 0x3fff800000000 not found -``` - -- 这是已修复的 bug:未映射的 GART 页会被路由到 sink 地址(paddr=0),不再崩溃 -- 如果仍然出现,确认使用的是最新编译的 gem5 二进制 - -### hipcc 编译报错 — offload arch - -``` -error: cannot find ROCm device library -``` - -- 确认 ROCm 正确安装:`ls /opt/rocm/lib/` -- 使用正确的架构标志:`--offload-arch=gfx942` - -### GPU 计算超时 - -- 检查 gem5 日志(`docker logs gem5-cosim`)是否有错误 -- 少量 fence timeout 是正常的,大量超时可能表示 DMA 或中断路径有问题 - ---- - -## 关键参数参考 - -| 参数 | 默认值 | 说明 | -|---|---|---| -| `--socket-path` | `/tmp/gem5-mi300x.sock` | QEMU <-> gem5 通信套接字(vfio-user 协议) | -| `--shmem-path` | `/mi300x-vram` | GPU VRAM 共享内存名称(/dev/shm 下) | -| `--shmem-host-path` | `/cosim-guest-ram` | Guest RAM 共享内存名称 | -| `--dgpu-mem-size` | `16GiB` | GPU VRAM 大小 | -| `--num-compute-units` | `40` | GPU 计算单元数量 | -| `--mem-size` | `8GiB` | Guest 物理内存大小 | -| `--cosim-backend` | `vfio-user` | cosim 后端类型(`vfio-user` 或 `legacy`) | -| `ip_block_mask` | `0x67` | amdgpu 驱动 IP 块掩码 | -| `discovery` | `2` | 使用 IP discovery 固件 | - -## 关键文件参考 - -| 文件 | 用途 | -|---|---| -| `scripts/cosim_launch.sh` | cosim 一键启动脚本 | -| `scripts/run_mi300x_fs.sh` | 编排脚本(编译、构建镜像、运行) | -| `configs/example/gpufs/mi300_cosim.py` | gem5 cosim 配置 | -| `src/dev/amdgpu/mi300x_vfio_user.{cc,hh}` | gem5 侧 vfio-user 服务端(默认后端) | -| `src/dev/amdgpu/mi300x_gem5_cosim.{cc,hh}` | gem5 侧 legacy 桥接(legacy 后端) | -| `src/dev/amdgpu/amdgpu_device.cc` | GPU 设备模型 | -| `src/dev/amdgpu/amdgpu_vm.cc` | GPU 地址翻译(GART 等) | -| `qemu/hw/misc/mi300x_gem5.c` | QEMU 侧 mi300x-gem5 PCIe 设备(仅 legacy 后端) | - -## 版本矩阵 - -| 组件 | 版本 | -|---|---| -| Guest 操作系统 | Ubuntu 24.04.2 LTS | -| Guest 内核 | 6.8.0-79-generic | -| ROCm | 7.0.0 | -| amdgpu DKMS | 匹配 ROCm 7.0 | -| gem5 构建目标 | VEGA_X86 | -| GPU 设备 | MI300X (gfx942, DeviceID 0x74A0) | -| 一致性协议 | GPU_VIPER | -| QEMU | 10.0+(vfio-user 后端)或 cosim 分支(legacy 后端) | diff --git a/docs/zh/getting-started.md b/docs/zh/getting-started.md new file mode 100644 index 0000000..17d7d89 --- /dev/null +++ b/docs/zh/getting-started.md @@ -0,0 +1,532 @@ +[English](../en/getting-started.md) + +# 快速入门 + +QEMU + gem5 MI300X 联合仿真项目的快速入门指南。 +从编译各组件到运行第一个 HIP GPU 计算测试。 + +## 架构概述 + +``` ++---------------------------------+ +------------------------------+ +| QEMU (Q35 + KVM) | | gem5 (Docker 容器内) | +| +---------------------------+ | | +------------------------+ | +| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU 模型 | | +| | amdgpu 驱动 | | | | - Shader + CU | | +| | ROCm 7.0 / HIP 运行时 | | | | - PM4 / SDMA 引擎 | | +| +-----------+---------------+ | | | - Ruby 缓存层次 | | +| | MMIO/Doorbell | | +----------+-------------+ | +| +-----------v---------------+ | | +----------v-------------+ | +| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | +| +---------------------------+ |vfio-| +------------------------+ | +| |user | | ++---------------------------------+ +------------------------------+ + | | + v v + /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram + (Guest 物理内存, 共享) (GPU VRAM, 共享) +``` + +- **QEMU** 负责:CPU 执行、Linux 内核引导、PCIe 枚举、amdgpu 驱动加载 +- **gem5** 负责:MI300X GPU 计算模型(Shader、CU、缓存、DMA 引擎) +- 两者通过 **vfio-user 协议**(基于 Unix 域套接字)通信。QEMU 使用内置的 `vfio-user-pci` 设备,gem5 端运行 `MI300XVfioUser` 作为 vfio-user 服务端 +- Guest 物理内存和 GPU VRAM 通过 `/dev/shm/` 下的**共享内存**共享 + +关于内存架构和 BAR 布局的详细说明,请参阅[架构文档](architecture.md#内存共享架构)。 + +## 前置条件 + +| 需求 | 说明 | +|---|---| +| 宿主机系统 | Linux x86_64,支持 KVM(已在 WSL2 6.6.x 验证) | +| Docker | 守护进程运行中,当前用户在 `docker` 组 | +| KVM | `/dev/kvm` 可访问 | +| QEMU | 安装 `qemu-system-x86_64`(用于 Packer 构建磁盘镜像) | +| 磁盘空间 | 至少 120 GB(55G 磁盘镜像 + 构建中间产物) | +| 内存 | 建议 16 GB 以上(gem5 编译和运行都比较占内存) | +| 工具 | `git`、`screen`、`unzip` | + +## 编译 gem5 和 QEMU + +### 构建运行时 Docker 镜像 + +编译 gem5 之前,先创建运行时 Docker 镜像: + +```bash +cd scripts +docker build -t gem5-run:local -f Dockerfile.run . +``` + +此镜像基于 `ghcr.io/gem5/gpu-fs`,添加了 Python 3.12 支持。 + +### 编译 gem5 + +gem5 二进制链接了 Ubuntu 24.04 的库,需要在兼容环境中编译。 + +> **注意:** vfio-user 后端依赖 `libjson-c-dev`(编译时)和 `libjson-c5`(运行时)。`gem5-run:local` 镜像已包含此依赖,无需额外安装。 + +**方式一:编排脚本** + +```bash +./scripts/run_mi300x_fs.sh build-gem5 +``` + +**方式二:Docker 内手动编译** + +```bash +cd /home/zevorn/cosim/gem5 + +docker run --rm \ + -v "$(pwd):/gem5" -w /gem5 \ + gem5-run:local \ + scons build/VEGA_X86/gem5.opt -j4 +``` + +> **提示:** 内存不足时降低并行度(`-j1` 或 `-j2`)。 + +产出:`build/VEGA_X86/gem5.opt`(约 1.1 GB)。 + +### 编译 QEMU + +使用 vfio-user 后端时,**原版 QEMU 10.0+** 即可直接使用——内置 `vfio-user-pci` 设备,无需自定义 QEMU 代码。 + +```bash +mkdir -p qemu-build && cd qemu-build +/path/to/qemu/configure --target-list=x86_64-softmmu +make -j$(nproc) +``` + +或通过编排脚本: + +```bash +./scripts/run_mi300x_fs.sh build-qemu +``` + +产出:`qemu-system-x86_64`。 + +> **Legacy 后端:** 若使用 `--cosim-backend=legacy`,则需要 `cosim/qemu/` 中包含 `mi300x-gem5` 设备的源码。编译方式同上,但必须使用 cosim 分支的 QEMU 源码。 + +## 构建磁盘镜像 + +磁盘镜像包含 Ubuntu 24.04 + ROCm 7.0 + 内核 6.8.0-79-generic 及 amdgpu DKMS 模块。 + +### 自动构建 + +```bash +./scripts/run_mi300x_fs.sh build-disk +``` + +若 `gem5-resources` 不存在,会自动克隆后开始构建。 + +### 手动构建 + +```bash +cd ../gem5-resources/src/x86-ubuntu-gpu-ml +./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" +``` + +> Arch Linux 上 QEMU 路径为 `/usr/sbin/`,其他发行版可能是 `/usr/bin/`。 + +### 产出 + +| 产物 | 路径 | 大小 | +|---|---|---| +| 磁盘镜像 | `gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | 约 55 GB | +| 内核 | `gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | 约 64 MB | + +> **提示(国内网络):** 如果构建过程中包下载卡住,可以应用国内镜像补丁加速 VM 内的 `apt`。详见[参考手册 §7](reference.md#7-国内镜像配置)。 + +## 启动联合仿真 + +### 方式一:一键启动脚本(推荐) + +```bash +./scripts/cosim_launch.sh +``` + +此脚本会自动完成以下所有步骤(启动 gem5 容器、等待就绪、修复权限、启动 QEMU),并以交互模式进入 QEMU 串口控制台。 + +可用参数: + +```bash +./scripts/cosim_launch.sh --gem5-debug MI300XCosim # 开启 gem5 调试输出 +./scripts/cosim_launch.sh --vram-size 32GiB # 自定义 VRAM 大小 +./scripts/cosim_launch.sh --num-cus 80 # 自定义 CU 数量 +./scripts/cosim_launch.sh --cosim-backend=legacy # 使用 legacy 自定义套接字后端 +``` + +### 方式二:手动分步启动 + +#### 启动 gem5(Docker 容器) + +```bash +docker run -d --name gem5-cosim \ + -v /home/zevorn/cosim/gem5:/gem5 \ + -v /tmp:/tmp \ + -v /dev/shm:/dev/shm \ + -w /gem5 \ + -e PYTHONPATH=/usr/lib/python3.12/lib-dynload \ + gem5-run:local \ + /gem5/build/VEGA_X86/gem5.opt --listener-mode=on \ + /gem5/configs/example/gpufs/mi300_cosim.py \ + --socket-path=/tmp/gem5-mi300x.sock \ + --shmem-path=/mi300x-vram \ + --shmem-host-path=/cosim-guest-ram \ + --dgpu-mem-size=16GiB \ + --num-compute-units=40 \ + --mem-size=8G +``` + +#### 等待 gem5 就绪 + +```bash +docker logs -f gem5-cosim +``` + +看到如下输出即表示就绪: + +``` +============================================================ +gem5 MI300X co-simulation server ready + Socket: /tmp/gem5-mi300x.sock + VRAM SHM: /mi300x-vram + Host SHM: /cosim-guest-ram + VRAM size: 16GiB + Host RAM: 8GiB + CUs: 40 +Waiting for QEMU to connect... +============================================================ +``` + +#### 修复权限 + +Docker 创建的文件归 root 所有,需要修复权限以便 QEMU 访问: + +```bash +docker exec gem5-cosim chmod 777 /tmp/gem5-mi300x.sock +docker exec gem5-cosim chmod 666 /dev/shm/mi300x-vram +``` + +#### 启动 QEMU + +```bash +qemu-system-x86_64 \ + -machine q35 -enable-kvm -cpu host \ + -m 8G -smp 4 \ + -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/shm/cosim-guest-ram,share=on \ + -numa node,memdev=mem0 \ + -kernel /home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70 \ + -append "console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu" \ + -drive file=/home/zevorn/cosim/gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70,format=raw,if=virtio \ + -device 'vfio-user-pci,socket={"type":"unix","path":"/tmp/gem5-mi300x.sock"}' \ + -nographic -no-reboot +``` + +> **重要:** 内核命令行必须包含 `modprobe.blacklist=amdgpu`,防止 PCI 子系统在 VGA ROM 写入共享内存之前自动加载驱动。`cosim-gpu-setup.service` 会按正确顺序初始化。 + +#### SSH 访问客户机 + +`cosim_launch.sh` 脚本默认启用了用户态网络和 SSH 端口转发。在客户机内通过 `netplan` 配置网络接口后,从宿主机连接: + +```bash +ssh -p 2222 gem5@localhost +# 默认密码:12345 +``` + +### 关闭 + +```bash +# 在 QEMU 串口控制台: +poweroff +# 或强制退出:Ctrl-A X + +# 清理 Docker 容器和共享内存: +docker rm -f gem5-cosim +rm -f /dev/shm/mi300x-vram /dev/shm/cosim-guest-ram +rm -f /tmp/gem5-mi300x.sock +``` + +> 使用 `cosim_launch.sh` 时,退出 QEMU 后会自动执行清理。 + +## GPU 驱动初始化 + +MI300X GPU 驱动可在 QEMU 客户机启动后**自动**或**手动**加载。磁盘镜像中已包含所有必需的文件(ROM、固件、内核模块)。 + +### 自动加载(默认) + +磁盘镜像内置 `cosim-gpu-setup.service`,开机时自动执行: + +1. `dd` 写入 VGA ROM 到 `0xC0000`(gem5 通过共享内存的 `readROM()` 需要此数据) +2. 链接 IP discovery 固件 +3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2` + +服务约 40 秒完成。登录后 GPU 即可使用: + +```bash +rocm-smi # 应显示设备 0x74a0 +rocminfo # 应显示 gfx942 +``` + +服务文件内容: + +```ini +# /etc/systemd/system/cosim-gpu-setup.service +[Unit] +Description=MI300X GPU Setup for Co-simulation +After=local-fs.target +Before=multi-user.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=/usr/local/bin/cosim-gpu-setup.sh + +[Install] +WantedBy=multi-user.target +``` + +> **注意:** 内核命令行必须保留 `modprobe.blacklist=amdgpu`,防止 PCI 子系统在 ROM 写入共享内存之前自动加载驱动。systemd 服务会在 `dd` 之后显式 `modprobe`。 + +### 手动加载 + +如果 systemd 服务未安装,或需要重新加载驱动,在客户机启动后手动执行以下命令。 + +**前置条件:** `cosim_launch.sh` 正在运行(gem5 + QEMU 已连接),客户机已启动并获取了 root shell,内核命令行中传递了 `modprobe.blacklist=amdgpu`。 + +**快速参考(可直接复制粘贴):** + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +### 详细步骤 + +#### 步骤 1:加载 VGA BIOS ROM + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +``` + +将 MI300X VBIOS ROM 镜像写入物理地址 `0xC0000`(768 KB)处的传统 VGA ROM 区域。amdgpu 驱动在初始化期间从该地址读取 VBIOS。如果没有 ROM,驱动将报错 `"Unable to locate a BIOS ROM"`。 + +| 参数 | 值 | 含义 | +|-----------|-------|---------| +| `if` | `/root/roms/mi300.rom` | ROM 二进制文件(在磁盘镜像中) | +| `of` | `/dev/mem` | 物理内存设备 | +| `bs` | `1k` | 块大小 = 1024 字节 | +| `seek` | `768` | 跳转至 768 × 1024 = `0xC0000` | +| `count` | `128` | 写入 128 × 1024 = 128 KB | + +#### 步骤 2:链接 IP Discovery 固件 + +```bash +ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \ + /usr/lib/firmware/amdgpu/ip_discovery.bin +``` + +将驱动的 IP discovery 固件路径指向 MI300X 专用的 discovery 二进制文件。`discovery=2` 模式从磁盘上的固件文件读取 GPU IP 块信息,而非从 GPU 自身的 ROM/寄存器读取。 + +#### 步骤 3:加载 amdgpu 内核模块 + +```bash +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +关键参数: + +| 参数 | 值 | 含义 | +|-----------|-------|---------| +| `ip_block_mask` | `0x67` | 禁用 PSP(bit 3)和 SMU(bit 4);cosim 不模拟这些 IP 块 | +| `ppfeaturemask` | `0` | 禁用 PowerPlay 特性;cosim 无电源管理硬件 | +| `dpm` | `0` | 禁用动态电源管理 | +| `audio` | `0` | 禁用音频;cosim 无 HDMI/DP 音频 | +| `ras_enable` | `0` | 禁用 RAS — 防止 VBIOS 最小化时空指针崩溃 | +| `discovery` | `2` | 使用固件文件进行 IP discovery | + +> **警告**:使用 `ip_block_mask=0x6f`(仅禁用 SMU)会导致 PSP 固件加载失败和内核 panic。务必使用 `0x67`。 + +> **警告**:`dd` 步骤(步骤 1)在 `modprobe` 之前**必须**执行。否则驱动的 BIOS 发现链全部失败,`atom_context` 为 NULL,导致在 `amdgpu_atom_parse_data_header` 处发生空指针崩溃。 + +### 验证 + +```bash +# 检查 dmesg 中 amdgpu 初始化信息 +dmesg | grep -i amdgpu | tail -20 + +# 检查 PCI 设备 +lspci | grep -i amd + +# 验证设备识别和 GPU 能力 +rocm-smi +rocminfo | head -40 +``` + +预期输出: + +``` +# rocm-smi +GPU[0] : Device Name: 0x74a0 +GPU[0] : Partition: SPX + +# rocminfo +Name: gfx942 +Compute Unit: 320 +KERNEL_DISPATCH capable +``` + +> 加载过程中可能出现约 80 条 fence fallback timer 警告,这是正常现象——DRM 子系统在探测所有 ring buffer 时使用轮询模式的超时回退机制。 + +### 文件位置(客户机磁盘镜像内部) + +| 文件 | 路径 | +|------|------| +| VGA BIOS ROM | `/root/roms/mi300.rom` | +| IP Discovery 固件 | `/usr/lib/firmware/amdgpu/mi300_discovery` | +| 自动加载服务 | `/etc/systemd/system/cosim-gpu-setup.service` | +| 自动加载脚本 | `/usr/local/bin/cosim-gpu-setup.sh` | +| amdgpu 模块 | `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | + +## 运行 HIP 测试 + +### 编译 HIP 测试程序 + +在客户机内编写一个简单的向量加法程序: + +```bash +cat > /tmp/vec_add.cpp << 'EOF' +#include +#include + +__global__ void vec_add(int *a, int *b, int *c, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) c[i] = a[i] + b[i]; +} + +int main() { + const int N = 4; + int ha[N] = {1, 2, 3, 4}; + int hb[N] = {10, 20, 30, 40}; + int hc[N] = {0}; + + int *da, *db, *dc; + hipMalloc(&da, N * sizeof(int)); + hipMalloc(&db, N * sizeof(int)); + hipMalloc(&dc, N * sizeof(int)); + + hipMemcpy(da, ha, N * sizeof(int), hipMemcpyHostToDevice); + hipMemcpy(db, hb, N * sizeof(int), hipMemcpyHostToDevice); + + vec_add<<<1, N>>>(da, db, dc, N); + + hipMemcpy(hc, dc, N * sizeof(int), hipMemcpyDeviceToHost); + + printf("Result: %d %d %d %d\n", hc[0], hc[1], hc[2], hc[3]); + + bool pass = (hc[0]==11 && hc[1]==22 && hc[2]==33 && hc[3]==44); + printf("%s\n", pass ? "PASSED!" : "FAILED!"); + + hipFree(da); hipFree(db); hipFree(dc); + return pass ? 0 : 1; +} +EOF +``` + +编译并运行: + +```bash +# 编译(gfx942 = MI300X 架构) +/opt/rocm/bin/hipcc --offload-arch=gfx942 -o /tmp/vec_add /tmp/vec_add.cpp + +# 运行 +/tmp/vec_add +``` + +### 预期输出 + +``` +Result: 11 22 33 44 +PASSED! +``` + +### 使用 gem5-resources 中的 square 测试 + +也可以使用 gem5-resources 自带的 `square` 测试程序。需要先在宿主机编译: + +```bash +./scripts/run_mi300x_fs.sh build-app square +``` + +然后将编译产物拷入客户机(通过 `scp -P 2222` 或直接挂载磁盘镜像),在客户机内运行: + +```bash +./square.default +``` + +预期输出: + +``` +info: running on device AMD Instinct MI300X +info: allocate host and device mem ( 7.63 MB) +info: launch 'vector_square' kernel +info: check result +PASSED! +``` + +## 附录:独立 gem5 GPU 全系统仿真 + +上述联合仿真流程使用 QEMU 进行 KVM 加速启动,gem5 仅提供 GPU 模型。另一种方式是**完全在 gem5 内部运行**(CPU + GPU),无需 QEMU。这是标准的 gem5 全系统 GPU 仿真。 + +### 主要区别 + +| 方面 | 联合仿真(QEMU + gem5) | 独立 gem5 | +|---|---|---| +| CPU 执行 | KVM(接近原生速度) | gem5 atomic/timing 模型 | +| 启动时间 | 约 30 秒 | 约 2-5 分钟(KVM 快进) | +| GPU 模型 | gem5 MI300X(通过 vfio-user) | gem5 MI300X(同一模型) | +| 驱动加载 | systemd 服务或手动 `modprobe` | 通过 `m5 readfile` 自动化 | +| 适用场景 | 驱动开发、交互式调试 | 微架构研究、性能基准测试 | + +### 快速开始 + +**1. 编译 gem5 和磁盘镜像**(步骤与上述联合仿真相同)。 + +**2. 编译 GPU 测试应用:** + +```bash +./scripts/run_mi300x_fs.sh build-app square +``` + +**3. 运行仿真:** + +```bash +./scripts/run_mi300x_fs.sh run \ + ../gem5-resources/src/gpu/square/bin.default/square.default +``` + +> **重要:** 必须指定 `--app` 参数。不指定时,`readfile_contents` 为空字符串,驱动永远不会被加载。 + +**4. 监控输出:** + +```bash +tail -f m5out/board.pc.com_1.device +``` + +仿真使用 KVM 快进 Linux 启动过程,然后自动加载 GPU 驱动并运行指定的应用。Guest 在测试完成后调用 `m5 exit` 结束仿真。 + +关于独立仿真流程的完整细节(包括 legacy 配置、使用 `guestfish` 验证磁盘镜像、构建过程内部原理),请参阅 gem5 文档了解更多详情。 + +## 常见问题排查 + +五个最常见的问题及其解决方法: + +| 症状 | 原因 | 解决方法 | +|---------|-------|-----| +| gem5 容器启动后立即退出 | `gem5.opt` 未编译、路径错误或 Python 模块导入失败 | 执行 `docker logs gem5-cosim` 查看错误信息 | +| `Failed to connect to /tmp/gem5-mi300x.sock` | gem5 未就绪或 socket 权限不正确 | 等待 gem5 日志中出现 "Waiting for QEMU to connect";执行 `chmod 777` 修复 socket 权限 | +| `amdgpu_atom_parse_data_header` 处空指针崩溃 | `modprobe` 之前未写入 VGA ROM | 先执行 `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128` | +| PSP GPU reset 导致内核 panic | 使用了错误的 `ip_block_mask`(如 `0x6f` 而非 `0x67`) | 务必使用 `ip_block_mask=0x67` 同时禁用 PSP 和 SMU | +| `hipcc` 报错:cannot find ROCm device library | ROCm 未安装或架构标志错误 | 确认 `/opt/rocm/lib/` 存在;使用 `--offload-arch=gfx942` | + +完整的故障排查表和调试技术,请参阅[参考手册 §4](reference.md#4-已知问题与陷阱)。 diff --git a/docs/zh/gpu-fs-guide.md b/docs/zh/gpu-fs-guide.md deleted file mode 100644 index 44f1092..0000000 --- a/docs/zh/gpu-fs-guide.md +++ /dev/null @@ -1,321 +0,0 @@ -[English](../en/gpu-fs-guide.md) - -# gem5 MI300X 全系统 GPU 仿真复现指南 - -从零开始复现 cosim 分支上 AMD Instinct MI300X 的全系统 GPU 仿真流程, -直到 `square` 测试通过。 - -## 前置条件 - -| 需求 | 说明 | -|---|---| -| 宿主机系统 | Linux x86_64,支持 KVM(已在 WSL2 6.6.x 上验证) | -| Docker | 守护进程运行中,当前用户在 `docker` 组 | -| KVM | `/dev/kvm` 可访问(磁盘镜像构建和仿真均需要) | -| QEMU | 安装 `qemu-system-x86_64`(用于 Packer 构建磁盘镜像) | -| 磁盘空间 | 至少 120 GB 可用(55G 磁盘镜像 + 构建中间产物) | -| 工具 | `git`、`unzip`、`guestfish`(可选,用于磁盘镜像验证) | - -### Docker 镜像 - -| 镜像 | 用途 | -|---|---| -| `ghcr.io/gem5/gpu-fs:latest` | gem5 运行时容器的基础镜像(amd64) | -| `gem5-run:local` | 从 `scripts/Dockerfile.run` 构建的运行时镜像 | -| `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` | gem5 编译用(仅 arm64,见下方说明) | - -> **注意:** `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` 仅有 arm64 版本。 -> 在 amd64 宿主机上请使用 `ghcr.io/gem5/gpu-fs` 作为编译镜像或原生编译。 -> 可通过设置 `GEM5_BUILD_IMAGE` 环境变量来覆盖默认镜像。 - -## 目录结构 - -``` -/home/zevorn/cosim/ - gem5/ # gem5 源码(cosim 分支) - build/VEGA_X86/gem5.opt # gem5 二进制 - configs/example/ - gem5_library/x86-mi300x-gpu.py # stdlib 配置 - gpufs/mi300.py # legacy 配置 - scripts/ - run_mi300x_fs.sh # 编排脚本 - Dockerfile.run # 运行时 Docker 镜像 - gem5-resources/ # 磁盘镜像、内核、GPU 应用 - src/x86-ubuntu-gpu-ml/ - disk-image/x86-ubuntu-rocm70 # 55G raw 磁盘镜像 - vmlinux-rocm70 # 提取的内核 - src/gpu/square/ # square 测试应用 - docs/ # 文档 - qemu/ # QEMU 源码(cosim 设备) - build/qemu-system-x86_64 -``` - -## 第一步:编译 gem5 - -```bash -cd /home/zevorn/cosim/gem5 -./scripts/run_mi300x_fs.sh build-gem5 -``` - -此命令在 Docker 内执行 `scons build/VEGA_X86/gem5.opt`。 -产出:`build/VEGA_X86/gem5.opt`(约 1.1 GB)。 - -不使用 Docker 的手动编译方式: - -```bash -scons build/VEGA_X86/gem5.opt -j$(nproc) -``` - -## 第二步:编译 QEMU(可选,仅 cosim 模式需要) - -```bash -./scripts/run_mi300x_fs.sh build-qemu -``` - -要求 QEMU 源码位于 `../qemu/`。使用 `--target-list=x86_64-softmmu` 配置并编译。 -产出:`../qemu/build/qemu-system-x86_64`。 - -## 第三步:获取 gem5-resources - -```bash -./scripts/run_mi300x_fs.sh build-disk -# 若 gem5-resources 不存在会自动克隆,然后开始构建磁盘镜像 -``` - -或手动克隆: - -```bash -cd /home/zevorn/cosim -git clone --depth 1 https://github.com/gem5/gem5-resources.git gem5-resources -``` - -## 第四步:构建磁盘镜像 - -磁盘镜像构建使用 Packer + QEMU/KVM,安装 Ubuntu 24.04.2 + ROCm 7.0 + -内核 6.8.0-79-generic 及全部所需的 DKMS 模块。 - -### 自动构建(通过编排脚本) - -```bash -./scripts/run_mi300x_fs.sh build-disk -``` - -### 手动构建 - -```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml - -# 下载 Packer 并构建 -./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" -``` - -> **重要:** `x86-ubuntu-gpu-ml.pkr.hcl` 中默认 `qemu_path` 为 -> `/usr/bin/qemu-system-x86_64`。某些发行版(如 Arch)的实际路径是 -> `/usr/sbin/qemu-system-x86_64`,需要用 `-var` 覆盖。 - -### 构建过程详解 - -1. 通过 QEMU/KVM 启动 Ubuntu 24.04.2 ISO 进行自动安装 -2. 运行 `scripts/rocm-install.sh`,依次完成: - - 从 gem5 源码编译并安装 `m5` 工具(`/sbin/m5`) - - 从 `repo.radeon.com/amdgpu/7.0/ubuntu` 安装 ROCm 7.0 - - 安装 `amdgpu-dkms`(编译 DKMS 内核模块) - - 安装内核 `6.8.0-79-generic` 及对应 headers - - 提取 `vmlinux` 内核供 gem5 使用 - - 编译 `gem5_wmi.ko`(ACPI 补丁模块) - - 安装 PyTorch(ROCm 6.0 支持) -3. 复制 GPU BIOS ROM(`mi300.rom`)、IP discovery 文件和启动脚本到镜像中 -4. 从 VM 中下载提取的内核为 `vmlinux-rocm70` - -### 产出 - -| 产物 | 路径 | 大小 | -|---|---|---| -| 磁盘镜像 | `disk-image/x86-ubuntu-rocm70` | 约 55 GB | -| 内核 | `vmlinux-rocm70` | 约 64 MB | - -### 构建耗时 - -大约 30-60 分钟,取决于网络速度和宿主机性能。 - -### 验证磁盘镜像(可选) - -使用 `guestfish` 在不挂载的情况下检查磁盘镜像内容: - -```bash -LIBGUESTFS_BACKEND=direct guestfish --ro \ - -a disk-image/x86-ubuntu-rocm70 -m /dev/sda1 <<'EOF' -echo "=== DKMS 模块 ===" -ls /lib/modules/6.8.0-79-generic/updates/dkms/ -echo "=== ROCm 版本 ===" -cat /opt/rocm/.info/version -echo "=== load_amdgpu.sh ===" -cat /home/gem5/load_amdgpu.sh -echo "=== m5 二进制 ===" -is-file /sbin/m5 -echo "=== gem5_wmi 模块 ===" -is-file /home/gem5/gem5_wmi.ko -EOF -``` - -预期的 DKMS 模块列表(amdgpu 驱动的全部依赖): - -``` -amd-sched.ko.zst -amddrm_buddy.ko.zst -amddrm_exec.ko.zst # 关键模块——旧版构建中缺失 -amddrm_ttm_helper.ko.zst -amdgpu.ko.zst -amdkcl.ko.zst -amdttm.ko.zst -amdxcp.ko.zst -``` - -## 第五步:编译 GPU 测试应用 - -```bash -./scripts/run_mi300x_fs.sh build-app square -``` - -使用 Docker(`ghcr.io/gem5/gpu-fs`)或本地 `hipcc` 编译。 -产出:`../gem5-resources/src/gpu/square/bin.default/square.default`。 - -## 第六步:构建运行时 Docker 镜像 - -gem5 二进制链接了 Ubuntu 24.04 的库,需要兼容的运行时环境: - -```bash -cd scripts -docker build -t gem5-run:local -f Dockerfile.run . -``` - -## 第七步:运行仿真 - -### stdlib 配置(推荐) - -```bash -./scripts/run_mi300x_fs.sh run \ - ../gem5-resources/src/gpu/square/bin.default/square.default -``` - -> **重要:必须指定 `--app` 参数。** 不指定时,`readfile_contents` 为空字符串 -> `""`,Python 将其判为 falsy,`KernelDiskWorkload._set_readfile_contents` 不会 -> 被调用,guest 中的 amdgpu 驱动永远不会被加载。 - -### legacy 配置 - -```bash -./scripts/run_mi300x_fs.sh run-legacy \ - ../gem5-resources/src/gpu/square/bin.default/square.default -``` - -### 仿真过程详解 - -1. **KVM 快速启动阶段**(约 2-5 分钟):gem5 使用 KVM 快进 Linux 启动过程。 - Guest 内核引导、systemd 初始化、自动以 root 登录。 -2. **readfile 执行**:Guest 通过 `.bashrc` 运行 `/home/gem5/run_gem5_app.sh`, - 调用 `m5 readfile` 获取宿主机注入的脚本。 -3. **驱动加载**:脚本将 GPU BIOS ROM 写入 `/dev/mem`,创建 IP discovery 文件 - 的符号链接,然后运行 `load_amdgpu.sh` 按依赖顺序 insmod 所有 DKMS 模块。 -4. **GPU 应用执行**:脚本解码 base64 编码的 GPU 二进制,运行它, - 然后调用 `m5 exit` 结束仿真。 - -### 监控输出 - -Guest 串口控制台输出写入 `m5out/board.pc.com_1.device`: - -```bash -tail -f m5out/board.pc.com_1.device -``` - -### square 测试预期输出 - -``` -3+0 records in -3+0 records out -3072 bytes (3.1 kB, 3.0 KiB) copied, ... -info: running on device AMD Instinct MI300X -info: allocate host and device mem ( 7.63 MB) -info: launch 'vector_square' kernel -info: check result -PASSED! -``` - -## 故障排查 - -### `Failed to init DRM client: -13` 后内核 panic - -**根因:** 磁盘镜像缺少 `amddrm_exec.ko.zst` DKMS 模块。缺少此模块时, -amdgpu TTM 内存管理器初始化失败,`drm_dev_enter()` 发现设备处于 "unplugged" -状态,返回 `-EACCES`(-13)。后续清理路径在 -`ttm_resource_move_to_lru_tail` 触发 NULL 指针解引用。 - -**修复:** 使用最新的 `gem5-resources`(`origin/stable` 分支)重新构建磁盘镜像。 -更新后的 `rocm-install.sh` 安装了内核 `6.8.0-79-generic`,与 ROCm 7.0 DKMS -包完全匹配,包含所有所需模块。 - -**验证:** 用 `guestfish` 确认 `amddrm_exec.ko.zst` 存在于 -`/lib/modules/6.8.0-79-generic/updates/dkms/` 中。 - -### `Can't open /dev/gem5_bridge: No such file or directory` - -**无害警告。** `m5` 工具优先尝试 `gem5_bridge` 设备驱动,失败后回退到 -地址映射 MMIO 模式(以 root 运行时可用)。readfile 机制仍然正常工作。 - -### Packer 构建失败:`output_directory already exists` - -上一次构建遗留的 `disk-image/` 目录会阻塞 Packer: - -```bash -mv disk-image disk-image-old -# 然后重新运行构建 -``` - -### Packer 构建失败:VM 内 git clone 失败 - -QEMU VM 内部的网络问题可能导致 `git clone` 失败。`rocm-install.sh` 脚本已内置 -重试逻辑(3 次尝试,间隔 10 秒)。若仍然失败,检查宿主机网络连接和 DNS 解析。 - -### 不指定 `--app` 时 GPU 驱动不加载 - -使用 `x86-mi300x-gpu.py` 不带 `--app` 参数运行时,`readfile_contents` 为空字符串 -`""`。Python 的真值检查 `elif readfile_contents:` 求值为 `False`,因此 -`_set_readfile_contents` 不会被调用,不会写入 readfile 文件。Guest 中的 -`run_gem5_app.sh` 从 `m5 readfile` 获得空文件后直接退出。 - -**解决方式:** 运行 GPU 仿真时始终指定 `--app` 参数。 - -### DRAM 容量警告 - -``` -DRAM device capacity (16384 Mbytes) does not match the address range assigned (8192 Mbytes) -``` - -这是 gem5 内存系统的配置警告,不影响仿真正确性。 - -## 关键文件参考 - -| 文件 | 用途 | -|---|---| -| `scripts/run_mi300x_fs.sh` | 主编排脚本 | -| `scripts/Dockerfile.run` | 运行时 Docker 镜像定义 | -| `configs/example/gem5_library/x86-mi300x-gpu.py` | stdlib 仿真配置 | -| `configs/example/gpufs/mi300.py` | legacy 仿真配置 | -| `src/python/gem5/prebuilt/viper/board.py` | ViperBoard:readfile 注入、驱动加载 | -| `src/python/gem5/components/devices/gpus/amdgpu.py` | MI300X 设备定义 | -| `src/dev/amdgpu/amdgpu_device.cc` | GPU 设备模型核心(cosim 分支修改) | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/scripts/rocm-install.sh` | 磁盘镜像配置脚本 | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/files/load_amdgpu.sh` | Guest 侧驱动加载脚本 | -| `../gem5-resources/src/x86-ubuntu-gpu-ml/x86-ubuntu-gpu-ml.pkr.hcl` | Packer 配置 | - -## 版本矩阵 - -| 组件 | 版本 | -|---|---| -| Guest 操作系统 | Ubuntu 24.04.2 LTS | -| Guest 内核 | 6.8.0-79-generic | -| ROCm | 7.0.0 | -| amdgpu DKMS | 匹配 ROCm 7.0 | -| gem5 构建目标 | VEGA_X86 | -| GPU 设备 | MI300X(DeviceID 0x74A1) | -| 一致性协议 | GPU_VIPER | diff --git a/docs/zh/mi300x-memory-management.md b/docs/zh/mi300x-memory-management.md deleted file mode 100644 index edc99cf..0000000 --- a/docs/zh/mi300x-memory-management.md +++ /dev/null @@ -1,332 +0,0 @@ -[English](../en/mi300x-memory-management.md) - -# MI300X 内存管理、地址转换与映射 - -本文档描述了 AMD MI300X GPU 在独立 gem5 仿真和 QEMU+gem5 协同仿真环境中如何管理内存地址。 - -## 1. GPU 地址空间 - -MI300X (GFX 9.4.3) GPU 使用多个地址空间和 aperture 来访问内存。GPU 发出的每次内存访问首先按 aperture 分类,然后转换为物理地址。 - -``` -GPU Virtual Address (48-bit) -| -+-- AGP aperture [agpBot, agpTop] -| +-- Direct offset: paddr = vaddr - agpBot + agpBase -| -+-- GART aperture [ptStart<<12, ptEnd<<12] -| +-- Page table: paddr = GART_PTE[page_num].phys_addr | offset -| -+-- Framebuffer (FB) [fbBase, fbTop] -| +-- VRAM offset: vram_off = vaddr - fbBase -| -+-- System aperture [sysAddrL, sysAddrH] -| +-- Direct map: paddr = vaddr (system memory) -| -+-- MMHUB aperture [mmhubBase, mmhubTop] -| +-- VRAM mirror: vram_off = vaddr - mmhubBase -| -+-- User VM (VMID>0) [arbitrary VAs] - +-- Multi-level page table walk (4 or 5 levels) -``` - -### 1.1 Aperture 寄存器 - -这些 MMIO 寄存器定义了每个 aperture 的边界。这些值由 amdgpu 驱动程序在 GMC(Graphics Memory Controller)初始化期间设置。 - -| 寄存器 | gem5 字段 | 格式 | 描述 | -|----------|-----------|--------|-------------| -| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | MC 地址空间中 VRAM 的起始地址 | -| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 \| 0xFFFFFF` | VRAM 结束地址 | -| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB 重定位偏移量 | -| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP 重映射基地址 | -| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture 底部 | -| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 \| 0xFFFFFF` | AGP aperture 顶部 | -| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture 低地址 | -| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture 高地址 | -| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | GART 表在 VRAM 中的位置 | -| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture 起始地址(页号) | -| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture 结束地址(页号) | - -**协同仿真中的典型值**(来自驱动初始化诊断): -``` -ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB -ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 -ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) -fbBase = 0x8000000000 VRAM starts at MC address 512 GiB -fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) -sysAddrL = 0x0 System aperture start -sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) -``` - -## 2. GART(Graphics Address Remapping Table) - -### 2.1 概述 - -GART 是一个单级页表,供 VMID 0(内核模式)使用,将 GPU 虚拟地址映射到系统物理地址。它使 GPU 能够对主机(guest)RAM 进行 DMA 访问,用于 ring buffer、fence 值、IH cookie 以及其他内核模式数据结构。 - -### 2.2 表布局 - -``` -VRAM offset = ptBase (gartBase) -+-------------------+ ptBase + 0 -| PTE[0] (8 bytes) | maps page ptStart -+-------------------+ ptBase + 8 -| PTE[1] | maps page ptStart + 1 -+-------------------+ ptBase + 16 -| PTE[2] | maps page ptStart + 2 -| ... | -+-------------------+ -| PTE[N] | maps page ptStart + N -+-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 -``` - -每个 PTE 为 8 字节,格式如下: - -| 位域 | 字段 | 描述 | -|------|-------|-------------| -| 0 | Valid | 条目有效 | -| 1 | System | 1 = 系统内存,0 = 本地 VRAM | -| 5:2 | Fragment | 页面片段大小 | -| 47:12 | Physical Page | 物理地址 >> 12 | -| 51:48 | Block Fragment | 块片段大小 | -| 63:52 | Flags | MTYPE、PRT 等 | - -**物理地址提取**:`paddr = (bits(PTE, 47, 12) << 12) | page_offset` - -### 2.3 getGARTAddr 变换 - -在 GART 查找之前,地址会通过 `getGARTAddr()` 进行变换: - -```cpp -// In pm4_packet_processor.cc and sdma_engine.cc: -Addr getGARTAddr(Addr addr) const { - if (!gpuDevice->getVM().inAGP(addr)) { - Addr low_bits = bits(addr, 11, 0); - addr = (((addr >> 12) << 3) << 12) | low_bits; - } - return addr; -} -``` - -该函数将页号乘以 8(PTE 的大小),实际上是将 GPU VA 转换为 GART 表内的字节偏移量。随后 GART 转换使用这个变换后的地址来查找 PTE。 - -### 2.4 转换流程 - -``` -Original GPU VA (e.g., 0x7FFF00032000) - | - v getGARTAddr() -Transformed addr = ((VA>>12) * 8) << 12 | low_bits - = 0x3FFF80019_0000 (example) - | - v GARTTranslationGen::translate() -gart_addr = bits(transformed, 63, 12) = page_num * 8 - | - +-- Look up gartTable hash map (populated by writeFrame / SDMA shadow) - | - +-- Cosim fallback: read PTE from shared VRAM - | pte_offset = gart_addr - (ptStart * 8) - | pte = *(vramShmemPtr + ptBase + pte_offset) - | - v Extract physical address -paddr = (bits(PTE, 47, 12) << 12) | bits(VA, 11, 0) -``` - -### 2.5 gartTable 哈希表 vs. 共享 VRAM - -在独立 gem5 模式下,GART 条目维护在一个哈希表(`AMDGPUVM::gartTable`)中,由以下方式填充: - -1. **直接写入**(`amdgpu_device.cc:writeFrame()`):当驱动程序通过 BAR0 写入 VRAM 的 GART 区域时,值被存储到 `gartTable[offset]` 中。 - -2. **SDMA 影子拷贝**(`sdma_engine.cc`):当 SDMA 写入设备内存中的 GART 范围时,影子拷贝会更新 `gartTable`。 - -在协同仿真模式下,驱动程序通过 QEMU 的 BAR0 映射写入 GART PTE,直接进入共享 VRAM,不经过 gem5 的 `writeFrame()`。因此,`gartTable` 基本为空。协同仿真回退机制直接从共享 VRAM 的 `vramShmemPtr + ptBase` 处读取 PTE。 - -## 3. MMHUB Aperture - -MMHUB(Memory Management Hub)提供 VRAM 的影子映射。`[mmhubBase, mmhubTop]` 范围内的地址通过减去基地址进行转换: - -``` -vram_offset = vaddr - mmhubBase -``` - -SDMA 在 VMID 0 模式下使用此 aperture 访问设备内存。 - -## 4. 用户空间转换(VMID > 0) - -用户空间 GPU 程序(如 HIP 应用)使用类似于 x86-64 分页的多级页表。每个 VMID(1-15)拥有自己的页表基址寄存器。 - -``` -VM_CONTEXT[N]_PAGE_TABLE_BASE_ADDR -> Page Directory Base - | - v 4-level walk (PDE3 -> PDE2 -> PDE1 -> PDE0 -> PTE) -Physical address -``` - -`UserTranslationGen` 类使用 GPU 的页表遍历器(`VegaISA::Walker`)执行此遍历。用户模式(vmid > 0)下的 SDMA 使用此路径。 - -## 5. gem5 中的 DMA 路由 - -### 5.1 PM4 Packet Processor - -``` -PM4PacketProcessor::translate(vaddr, size) - | - +-- inAGP(vaddr)? -> AGPTranslationGen (direct offset) - | - +-- else -> GARTTranslationGen (page table lookup) -``` - -所有 PM4 DMA 使用 GART 转换(VMID 0)。地址在 DMA 调用之前先通过 `getGARTAddr()` 变换。 - -### 5.2 SDMA 引擎 - -``` -SDMAEngine::translate(vaddr, size) - | - +-- cur_vmid > 0? -> UserTranslationGen (multi-level page table) - | - +-- inAGP(vaddr)? -> AGPTranslationGen - | - +-- inMMHUB(vaddr)?-> MMHUBTranslationGen (VRAM shadow) - | - +-- else -> GARTTranslationGen -``` - -SDMA 比 PM4 具有更多的 aperture 感知能力,因为它同时处理内核模式(VMID 0)和用户模式(VMID > 0)的操作。 - -### 5.3 VRAM vs. 系统内存检测 - -对于 PM4 的 RELEASE_MEM 和 WRITE_DATA 数据包,目标可以是 VRAM 或系统内存。路由方式如下: - -```cpp -bool vram = isVRAMAddress(pkt->addr); // addr < gpuDevice->getVRAMSize() -Addr addr = vram ? pkt->addr : getGARTAddr(pkt->addr); - -if (vram) - gpuDevice->getMemMgr()->writeRequest(addr, data, size); // device memory -else - dmaWriteVirt(addr, size, cb, data); // system memory via GART -``` - -## 6. 中断处理器(IH)DMA - -中断处理器使用原始系统物理地址(非 GART): - -``` -IH Ring Buffer: regs.baseAddr (from IH_RB_BASE register) -Wptr Address: regs.WptrAddr (from IH_RB_WPTR_ADDR registers) -``` - -这些是驱动程序设置的 GPA(Guest Physical Address)。IH 写入流程: -1. 将中断 cookie(32 字节)写入 `baseAddr + IH_Wptr` -2. 将更新后的写指针写入 `WptrAddr` -3. 然后调用 `intrPost()` → 向 guest 发送 MSI-X 中断 - -在协同仿真模式下,DMA 写入落入共享 guest RAM(`/dev/shm/cosim-guest-ram`),中断通过事件 socket 转发给 QEMU。 - -## 7. 协同仿真内存架构 - -``` -+-----------------------------------------------------+ -| Host (Linux) | -| | -| /dev/shm/cosim-guest-ram (8 GiB) | -| +--------------------------------------------+ | -| | Guest Physical RAM | | -| | <- QEMU memory-backend-file (share=on) | | -| | <- gem5 system.shared_backstore | | -| | | | -| | Contains: page tables, ring buffers, | | -| | IH ring, fence values, kernel code/data | | -| +--------------------------------------------+ | -| | -| /dev/shm/mi300x-vram (16 GiB) | -| +--------------------------------------------+ | -| | GPU VRAM | | -| | <- QEMU BAR0 mmap (driver writes here) | | -| | <- gem5 vramShmemPtr (GPU model reads) | | -| | | | -| | Contains: GART page table, GPU page tables,| | -| | frame data, device-local allocations | | -| | | | -| | Layout: | | -| | [0, ~15.7G) General VRAM allocations | | -| | [0x3EE600000] GART page table (ptBase) | | -| | [~15.7G, 16G) Reserved / metadata | | -| +--------------------------------------------+ | -| | -| /tmp/gem5-mi300x.sock (Unix domain socket) | -| +--------------------------------------------+ | -| | MMIO connection: QEMU <-> gem5 (sync) | | -| | Event connection: gem5 -> QEMU (async) | | -| | - IRQ raise/lower | | -| | - DMA read/write requests | | -| +--------------------------------------------+ | -+-----------------------------------------------------+ -``` - -### 7.1 内存分割(Q35) - -QEMU Q35 在 RAM >= 2.75 GiB 时将内存分为: -- 4G 以下区域:前 2 GiB(文件偏移 0) -- 4G 以上区域:其余部分位于文件偏移 2 GiB 处,映射到 PA 0x100000000+ - -gem5 的 `mi300_cosim.py` 复制了此分割方式,以确保双方在文件布局上保持一致。 - -### 7.2 GART PTE 协同仿真回退机制 - -由于驱动程序通过 QEMU 的 BAR0(共享内存)写入 GART PTE,gem5 的 `gartTable` 哈希表不会被填充。协同仿真回退机制直接从共享 VRAM 读取 PTE: - -```cpp -Addr pte_table_offset = gart_addr - (ptStart * 8); -Addr pte_vram_offset = gartBase() + pte_table_offset; -memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); -``` - -如果 PTE 为 0(未映射的页面),协同仿真模式将映射到 sink(`paddr=0`),而不是产生 fault,从而避免 `GenericPageTableFault` 导致的无限 DMA 重试崩溃。 - -## 8. 地址流程示例 - -### 8.1 Fence 写入(RELEASE_MEM) - -``` -1. PM4 RELEASE_MEM packet: addr=0x113100000 (guest phys), data=0x1234 -2. isVRAMAddress(0x113100000)? No (< 16 GiB but not a VRAM offset) -3. getGARTAddr(0x113100000) -> 0x899800000000 (page * 8 transform) -4. dmaWriteVirt(0x899800000000, 8, cb, &data) -5. GARTTranslationGen::translate() - - gart_addr = 0x89980000 - - Look up PTE from shared VRAM -> PTE has paddr bits - - paddr = extracted address (in guest RAM) -6. DMA write lands in /dev/shm/cosim-guest-ram at paddr offset -7. Guest driver reads fence value from same shared memory -``` - -### 8.2 HIP 内核调度 - -``` -1. User writes AQL packet to queue ring buffer (user VA) -2. User writes doorbell -> QEMU -> gem5 (socket MMIO) -3. gem5 PM4 reads queue MQD (GART address -> guest RAM) -4. gem5 GPU command processor dispatches kernel to CU array -5. CUs execute wavefronts (compute work) -6. On completion: RELEASE_MEM writes fence + triggers interrupt -7. IH writes cookie to IH ring (raw DMA to guest RAM) -8. intrPost() -> sendIrqRaise(0) -> QEMU event socket -9. QEMU msix_notify() -> guest IH handler processes interrupt -10. hipDeviceSynchronize() returns success -``` - -## 9. 关键源文件 - -| 文件 | 作用 | -|------|------| -| `src/dev/amdgpu/amdgpu_vm.{cc,hh}` | 所有转换生成器(GART、AGP、MMHUB、User) | -| `src/dev/amdgpu/pm4_packet_processor.cc` | PM4 DMA 路由和 GART 地址变换 | -| `src/dev/amdgpu/sdma_engine.cc` | SDMA DMA 路由、GART 影子拷贝 | -| `src/dev/amdgpu/interrupt_handler.cc` | IH ring buffer DMA 和中断发送 | -| `src/dev/amdgpu/amdgpu_device.cc` | 设备级 intrPost()、writeFrame() | -| `src/dev/amdgpu/mi300x_gem5_cosim.cc` | 协同仿真 socket 桥接、IRQ 转发 | -| `configs/example/gpufs/mi300_cosim.py` | 内存配置、共享 backstore 设置 | diff --git a/docs/zh/reference.md b/docs/zh/reference.md new file mode 100644 index 0000000..ed69088 --- /dev/null +++ b/docs/zh/reference.md @@ -0,0 +1,564 @@ +[English](../en/reference.md) + +# 协同仿真参考手册 + +QEMU + gem5 MI300X 协同仿真系统的综合查阅参考。概念性说明请参阅[架构文档](architecture.md);分步构建和运行指南请参阅[快速入门](getting-started.md)。 + +--- + +## 1. 参数参考 + +### 1.1 cosim_launch.sh / mi300_cosim.py 选项 + +| 参数 | 默认值 | 说明 | +|------|--------|------| +| `--socket-path` | `/tmp/gem5-mi300x.sock` | QEMU <-> gem5 通信套接字(vfio-user 协议) | +| `--shmem-path` | `/mi300x-vram` | GPU VRAM 共享内存名称(/dev/shm 下) | +| `--shmem-host-path` | `/cosim-guest-ram` | Guest RAM 共享内存名称(/dev/shm 下) | +| `--dgpu-mem-size` | `16GiB` | GPU VRAM 大小 | +| `--num-compute-units` | `40` | GPU 计算单元数量 | +| `--mem-size` | `8GiB` | Guest 物理内存大小 | +| `--cosim-backend` | `vfio-user` | cosim 后端类型:`vfio-user`(原版 QEMU 10.0+)或 `legacy`(自定义 QEMU) | +| `--gem5-debug` | (无) | gem5 调试标志,例如 `MI300XCosim`、`AMDGPUDevice,PM4PacketProcessor` | +| `--vram-size` | `32GiB` | 自定义 VRAM 大小(`--dgpu-mem-size` 的别名) | +| `--num-cus` | `80` | 自定义 CU 数量(`--num-compute-units` 的别名) | + +### 1.2 amdgpu modprobe 参数 + +协同仿真模式下所有参数均为必需。完整命令: + +```bash +modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2 +``` + +| 参数 | 值 | 用途 | +|------|-----|------| +| `ip_block_mask` | `0x67` | 二进制 `0110_0111`。启用 common、GMC、IH、GFX、SDMA;禁用 PSP(bit 3)和 SMU(bit 4)。详见[第 3 节](#3-ip-block-mask-参考) | +| `ppfeaturemask` | `0` | 禁用所有 PowerPlay 特性;cosim 无电源管理硬件 | +| `dpm` | `0` | 禁用动态电源管理 | +| `audio` | `0` | 禁用 HDMI/DP 音频;cosim 无音频硬件 | +| `ras_enable` | `0` | 禁用 RAS(可靠性、可用性、可维护性)。防止 VBIOS 最小化(cosim ROM 仅 3 KB)时 `atom_context` 为 NULL 导致的空指针崩溃 | +| `discovery` | `2` | 使用磁盘上的固件文件进行 IP discovery,而非从 GPU ROM/寄存器读取 | + +> **警告**:使用 `ip_block_mask=0x6f`(启用 bit 3 的 PSP)会导致 PSP 固件加载失败和内核 panic。务必使用 `0x67`。 + +> **警告**:`ras_enable=0` 为强制参数。缺少时,`amdgpu_ras_init` 会调用 `amdgpu_atom_parse_data_header` 访问 NULL 的 `atom_context`,触发空指针崩溃。 + +### 1.3 dd 命令参数(VGA ROM) + +```bash +dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 +``` + +| 参数 | 值 | 含义 | +|------|-----|------| +| `if` | `/root/roms/mi300.rom` | ROM 二进制文件(在磁盘镜像中) | +| `of` | `/dev/mem` | 物理内存设备 | +| `bs` | `1k` | 块大小 = 1024 字节 | +| `seek` | `768` | 跳转至 768 x 1024 = `0xC0000`(传统 VGA ROM 区域) | +| `count` | `128` | 写入 128 x 1024 = 128 KB | + +`dd` 步骤将 MI300X VBIOS 写入共享内存(`/dev/shm/cosim-guest-ram`)中的物理地址 `0xC0000`--`0xDFFFF`。gem5 的 `AMDGPUDevice::readROM()` 通过 `system->getPhysMem()` 从该地址读取。此步骤在 `modprobe` 之前**必须**执行 -- amdgpu 驱动的五种 BIOS 发现方法在 cosim 模式下全部失败: + +| BIOS 发现方法 | 在 cosim 下失败的原因 | +|---------------|----------------------| +| `amdgpu_atrm_get_bios()` | QEMU Q35 无 ACPI ATRM 方法 | +| `amdgpu_acpi_vfct_bios()` | 无 ACPI VFCT 表 | +| `amdgpu_read_bios_from_rom()` | 通过 SMU 寄存器读取,但 SMU 被 `ip_block_mask=0x67` 禁用 | +| `amdgpu_read_platform_bios()` | 无平台提供的 ROM | +| `amdgpu_read_disabled_bios()` | cosim 下不可用 | + +### 1.4 内核命令行 + +内核必须使用以下命令行启动: + +``` +console=ttyS0,115200 root=/dev/vda1 modprobe.blacklist=amdgpu +``` + +`modprobe.blacklist=amdgpu` 防止 PCI 子系统在 ROM 写入共享内存之前自动加载驱动。`cosim-gpu-setup.service` 会按正确顺序初始化(dd ROM → modprobe)。 + +--- + +## 2. 版本矩阵 + +| 组件 | 版本 | +|------|------| +| Guest 操作系统 | Ubuntu 24.04.2 LTS | +| Guest 内核 | 6.8.0-79-generic | +| ROCm | 7.0.0 | +| amdgpu DKMS | 匹配 ROCm 7.0 | +| gem5 构建目标 | VEGA_X86 | +| GPU 设备 | MI300X (gfx942, DeviceID 0x74A0) | +| 一致性协议 | GPU_VIPER | +| QEMU | 10.0+(vfio-user 后端)或 cosim 分支(legacy 后端) | + +### Docker 镜像 + +| 镜像 | 用途 | +|------|------| +| `ghcr.io/gem5/gpu-fs:latest` | gem5 运行时容器的基础镜像(amd64) | +| `gem5-run:local` | 从 `scripts/Dockerfile.run` 构建的运行时镜像(添加 Python 3.12 支持) | +| `ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0` | gem5 编译用(仅 arm64) | + +> 在 amd64 宿主机上,请使用 `ghcr.io/gem5/gpu-fs` 作为编译镜像或原生编译。 + +### 构建产物 + +| 产物 | 路径 | 大小 | +|------|------|------| +| gem5 二进制 | `build/VEGA_X86/gem5.opt` | 约 1.1 GB | +| 磁盘镜像 | `../gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-rocm70` | 约 55 GB | +| 内核 | `../gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-rocm70` | 约 64 MB | +| QEMU 二进制 | `qemu/build/qemu-system-x86_64` | -- | + +--- + +## 3. IP Block Mask 参考 + +### 检测顺序表 + +`ip_block_mask` 参数使用的是 **检测顺序索引** 作为位位置,而非 `amd_shared.h` 中的 `amd_ip_block_type` 枚举值。枚举值具有误导性 -- 真正起作用的是 IP discovery 过程中各块出现的顺序。 + +MI300X 检测顺序(ROCm 7.0 DKMS,来自 dmesg): + +| 索引 | IP Block | mask 中的位 | 在 0x67 中是否启用? | +|------|----------|-------------|----------------------| +| 0 | `soc15_common` | `0x01` | 是 | +| 1 | `gmc_v9_0` | `0x02` | 是 | +| 2 | `vega20_ih` | `0x04` | 是 | +| 3 | `psp` | `0x08` | **否**(禁用) | +| 4 | `smu` | `0x10` | **否**(禁用) | +| 5 | `gfx_v9_4_3` | `0x20` | 是 | +| 6 | `sdma_v4_4_2` | `0x40` | 是 | +| 7 | `vcn_v4_0_3` | `0x80` | 否(非必需) | +| 8 | `jpeg_v4_0_3` | `0x100` | 否(非必需) | + +### 位掩码计算 + +驱动检查 `(amdgpu_ip_block_mask & (1 << i))`,其中 `i` 是检测顺序索引(`amdgpu_device.c:2807`)。 + +``` +0x67 = 0110_0111 (binary) + ||||_|||| + |||| |||+-- bit 0: soc15_common (enabled) + |||| ||+--- bit 1: gmc_v9_0 (enabled) + |||| |+---- bit 2: vega20_ih (enabled) + |||| +----- bit 3: psp (DISABLED) + |||+------- bit 4: smu (DISABLED) + ||+-------- bit 5: gfx_v9_4_3 (enabled) + |+--------- bit 6: sdma_v4_4_2 (enabled) + +---------- bit 7: vcn_v4_0_3 (disabled) +``` + +### 常见掩码值 + +| 掩码 | 二进制 | 启用的 IP 块 | 用途 | +|------|--------|-------------|------| +| `0x67` | `0110_0111` | common、GMC、IH、GFX、SDMA | **cosim(正确值)** | +| `0x6f` | `0110_1111` | common、GMC、IH、PSP、GFX、SDMA | **错误 -- PSP 导致内核 panic** | +| `0xFF` | `1111_1111` | 包含 PSP+SMU 在内的所有块 | 仅限真实硬件 | + +--- + +## 4. 已知问题与陷阱 + +### 4.1 VGA ROM 空指针崩溃 + +| | | +|---|---| +| **症状** | `modprobe amdgpu` 导致内核空指针崩溃,位于 `amdgpu_atom_parse_data_header+0x1b`。调用链:`amdgpu_ras_init` -> `amdgpu_atomfirmware_mem_ecc_supported` -> `amdgpu_atom_parse_data_header`。RAX=0(NULL `atom_context`) | +| **根因** | amdgpu 驱动的五种 BIOS 发现方法在 cosim 模式下全部失败(详见[第 1.3 节](#13-dd-命令参数vga-rom))。驱动打印 `"Unable to locate a BIOS ROM"` 后继续执行,但 RAS 初始化路径无条件调用 `amdgpu_atom_parse_data_header()` 而不检查 NULL `atom_context`。QEMU 的 `romfile=` 属性无效 -- amdgpu 驱动通过 SMU 寄存器访问 ROM,而非 PCI ROM BAR | +| **修复** | 在 `modprobe` **之前**执行 `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128`。`cosim-gpu-setup.service` 会自动完成此操作 | + +### 4.2 PSP / SMU 固件加载失败 + +| | | +|---|---| +| **症状** | `PSP load tmr failed!`、`hw_init of IP block failed -22`、`Fatal error during GPU init` | +| **根因** | `ip_block_mask=0x6f` 启用了 PSP(检测顺序索引 3),但 cosim 不模拟 PSP 硬件。`amd_shared.h` 中的 `amd_ip_block_type` 枚举显示 PSP=4,但 mask 使用的是检测顺序,PSP 的索引为 3 | +| **修复** | 使用 `ip_block_mask=0x67` 同时禁用 PSP(bit 3)和 SMU(bit 4)。详见[第 3 节](#3-ip-block-mask-参考) | + +### 4.3 SIGIO 合并导致的死锁(仅 Legacy 后端) + +| | | +|---|---| +| **症状** | 驱动在首次访问 INDEX2/DATA2 寄存器对时挂起。gem5 处理约 15 条消息后停止响应。QEMU socket 缓冲区被填满 | +| **根因** | Linux FASYNC/SIGIO 是边沿触发的。当 QEMU 快速连续发送一个 write 和一个 read 时,两条消息在 gem5 的 SIGIO handler 触发前同时到达。系统只投递一个信号;handler 读取一条消息后第二条永远滞留 | +| **修复** | `MI300XGem5Cosim::handleClientData()` 使用 `do/while` 排空循环配合 `poll(fd, POLLIN, 0)` 读取每次 SIGIO 到来时的所有待处理消息。不适用于 vfio-user 后端(使用 libvfio-user 的非阻塞 poll) | + +### 4.4 协同仿真模式下 GART 表未填充 + +| | | +|---|---| +| **症状** | 大量 `GART translation for X not found` 警告。PM4 读到全零内存(opcode 0x0)。KIQ ring test 超时 | +| **根因** | 在两种后端中,VRAM 均由共享内存(`/dev/shm/mi300x-vram`)支撑。驱动对 VRAM 的写入完全绕过 gem5 的内存系统,因此 `AMDGPUVM::gartTable` 哈希表不会通过 `AMDGPUDevice::writeFrame()` 被填充 | +| **修复** | `GARTTranslationGen::translate()` 中的协同仿真回退机制:当 `gartTable` 未命中时,直接从共享 VRAM 的 `vramShmemPtr + (gartBase - fbBase) + gart_byte_offset` 处读取 PTE。关键细节:`getGARTAddr()` 已将页索引乘以 8,因此 `bits(vaddr, 63, 12)` 已经是字节偏移 -- 不可再乘以 8 | + +### 4.5 GART 未映射页崩溃 + +| | | +|---|---| +| **症状** | `hipMalloc OK` 后,gem5 段错误并伴随重复的 `GART translation for 0x3fff800000000 not found` 警告。无限 DMA 重试导致内存耗尽 | +| **根因** | GPU PM4/SDMA 引擎尝试 DMA 到驱动尚未映射的 GART 页(PTE=0)。原始代码创建 `GenericPageTableFault`,但 DMA 回调链无限重试同一个失败地址 | +| **修复** | 未映射的 GART 页被映射到 sink(`paddr=0`)。DMA 读操作返回零,写操作被丢弃,仿真保持存活。这是正常现象:`ptStart` 处的第一页本身就是未映射的 | + +### 4.6 SDMA Ring 测试超时 + +| | | +|---|---| +| **症状** | 驱动初始化过程中 SDMA ring 测试返回 `-110`(`-ETIMEDOUT`)。`sdma v4_4_2: ring 0 test failed (-110)` | +| **根因** | `sdma_engine.hh` 中 `sdma_delay` 默认值为 `1e9` ticks。在 cosim 模式下,对应约 500ms 墙钟时间,超过驱动约 200ms 的超时窗口。流程:驱动写入 SDMA ring 并敲 doorbell → gem5 以 `sdma_delay` ticks 延迟调度 SDMA 事件 → 驱动在 gem5 完成前超时 | +| **修复** | 将 `sdma_delay` 从 `1e9` 减小到 `1000` ticks。将 `KEEPALIVE_INTERVAL` 增大到 `1e9` 以避免 keepalive 干扰时序 | + +### 4.7 VRAM 地址 GART 翻译错误 + +| | | +|---|---| +| **症状** | 地址 `0x1f72fa8000` 产生 861,000 多次 GART 翻译错误,内存耗尽,段错误 | +| **根因** | SDMA rptr 回写地址和 PM4 RELEASE_MEM 目标地址可能指向 VRAM(地址 < 16 GiB)。这些地址经过 `getGARTAddr()` 处理时页号会被乘以 8,然后 GART 查找失败,因为 VRAM 没有对应的页表项 | +| **修复** | 三层防护:(1) PM4:`writeData()`、`releaseMem()`、`queryStatus()` 检查 `isVRAMAddress(addr)` 并路由到 `getMemMgr()->writeRequest()`。(2) SDMA:`setGfxRptrLo/Hi()` 和 rptr 回写对 VRAM 地址跳过 `getGARTAddr()`。(3) GART 兜底:检测 VRAM 地址并映射到 sink(`paddr=0`) | + +### 4.8 共享内存文件偏移量不匹配 + +| | | +|---|---| +| **症状** | GART 页表项读出全为零。PM4 opcode 0x0(NOP,count 0)无限重复 | +| **根因** | QEMU Q35 配置 8 GiB RAM 时:`below_4g = 2 GiB`(当 `ram_size >= 0xB0000000` 时硬编码)。gem5 配置为 3 GiB 以下 / 5 GiB 以上。QEMU 将 4G 以上数据放在文件偏移 2 GiB 处;gem5 从偏移 3 GiB 处读取 -- 全为零 | +| **修复** | `mi300_cosim.py` 复刻了 Q35 的拆分逻辑:`below_4g = min(total_mem, 0x80000000 if total_mem >= 0xB0000000 else 0xB0000000)` | + +### 4.9 定时器溢出崩溃 + +| | | +|---|---| +| **症状** | 经过数十亿 tick 后,gem5 因 `curTick()` 整数溢出而崩溃。`schedule()` 断言失败 | +| **根因** | RTC 和 PIT 定时器持续调度事件,在 cosim 的长期运行模式下导致 tick 计数器溢出 | +| **修复** | 为 `Cmos` 添加了 `disable_rtc_events` 参数,为 `I8254` 添加了 `disable_timer_events` 参数。在 `mi300_cosim.py` 中均设为禁用。cosim 桥接中的 keepalive 事件防止事件队列变空 | + +### 4.10 PM4ReleaseMem.dataSelect Panic + +| | | +|---|---| +| **症状** | gem5 panic,报错 `Unimplemented PM4ReleaseMem.dataSelect` | +| **根因** | `pm4_packet_processor.cc` 仅实现了 `dataSelect == 1`(32 位数据写入)。驱动在 GFX 初始化过程中使用其他模式 | +| **修复** | 添加了所有常见 dataSelect 值:0 = 不写入数据(仅触发事件),1 = 32 位写入(已有),2 = 64 位写入,3 = 64 位 GPU 时钟计数器,其他 = 警告并视为空操作 | + +### 4.11 不支持的 PM4 操作码 + +| | | +|---|---| +| **症状** | gem5 在遇到未识别的 PM4 opcode 时崩溃 | +| **根因** | `ACQUIRE_MEM` (0x58) 和 `SET_RESOURCES` (0xA0) 未被处理 | +| **修复** | 两者均已添加到 `pm4_defines.hh` 并在 `pm4_packet_processor.cc:decodeHeader()` 中作为跳过并继续(NOP)处理 | + +### 4.12 PCI Class Code 不匹配 + +| | | +|---|---| +| **症状** | amdgpu 驱动跳过了 `0xC0000` 处的 legacy VGA ROM 检查 | +| **根因** | PCI class 为 `PCI_CLASS_DISPLAY_OTHER (0x0380)` 而非 `PCI_CLASS_DISPLAY_VGA (0x0300)` | +| **修复** | 改为 `PCI_CLASS_DISPLAY_VGA`。内核随即将该地址范围识别为"带有 shadowed ROM 的视频设备" | + +### 4.13 QEMU 串口控制台冲突 + +| | | +|---|---| +| **症状** | 同时使用 `-serial unix:/tmp/serial.sock -nographic` 时 guest 无串口输出 | +| **根因** | `-nographic` 隐含了 `-serial mon:stdio`,创建映射到 stdio 的 serial0。显式的 `-serial unix:...` 变成 serial1(ttyS1),但内核使用的是 `console=ttyS0` | +| **修复** | 单独使用 `-nographic`。如需程序化访问,在 `screen` 中运行 QEMU | + +### 4.14 gem5 链接时内存不足(OOM) + +| | | +|---|---| +| **症状** | 即使使用 `-j2`,链接器也被 OOM killer 终止 | +| **根因** | 默认链接器占用内存过多 | +| **修复** | 使用 `scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold` | + +### 4.15 DRM Client 错误 -13(缺少 DKMS 模块) + +| | | +|---|---| +| **症状** | `Failed to init DRM client: -13` 后内核 panic。`ttm_resource_move_to_lru_tail` 中空指针崩溃 | +| **根因** | 磁盘镜像缺少 `amddrm_exec.ko.zst` DKMS 模块。缺少此模块时 TTM 内存管理器初始化失败,`drm_dev_enter()` 返回 `-EACCES`(-13) | +| **修复** | 使用最新的 `gem5-resources`(`origin/stable` 分支)重新构建磁盘镜像。用 `guestfish` 确认 `amddrm_exec.ko.zst` 存在于 `/lib/modules/6.8.0-79-generic/updates/dkms/` 中 | + +### 4.16 驱动 hw_init 失败后 rmmod 导致 oops + +| | | +|---|---| +| **症状** | 驱动 `hw_init` 失败后,`rmmod amdgpu` 导致 kernel oops(`kgd2kfd_device_exit` 中的 page fault)。模块停留在 "busy" 状态 | +| **根因** | 部分初始化后清理路径不健壮 | +| **修复** | 无法绕过。需重启整个 cosim 环境(杀掉 QEMU,重启 gem5 Docker 容器,重启 QEMU) | + +--- + +## 5. 调试快速参考 + +### gem5 调试标志 + +| 标志组合 | 显示内容 | +|----------|----------| +| `MI300XCosim` | cosim socket/vfio-user 消息 | +| `AMDGPUDevice` | MMIO 寄存器读/写 | +| `PM4PacketProcessor` | PM4 包解码和处理 | +| `SDMAEngine` | SDMA 操作 | +| `AMDGPUDevice,PM4PacketProcessor` | MMIO + PM4(组合) | +| `MI300XCosim,AMDGPUDevice,PM4PacketProcessor` | 完整 cosim 调试 | + +用法: + +```bash +./scripts/cosim_launch.sh --gem5-debug MI300XCosim +# 或手动: +build/VEGA_X86/gem5.opt --debug-flags=MI300XCosim,AMDGPUDevice ... +``` + +### QEMU Trace 事件 + +```bash +./scripts/cosim_launch.sh --qemu-trace 'mi300x_gem5_*' +``` + +### 日志检查命令 + +```bash +# gem5 容器日志(stderr) +docker logs gem5-cosim 2>&1 | tee /tmp/gem5.log + +# 过滤警告/错误 +docker logs gem5-cosim 2>&1 | grep -E "warn|error|GART" + +# Guest dmesg(通过 screen) +screen -S qemu-cosim -X stuff 'dmesg | tail -20\n' + +# Guest 串口输出(独立仿真) +tail -f m5out/board.pc.com_1.device +``` + +### Socket 测试 + +```bash +python3 scripts/cosim_test_client.py /tmp/gem5-mi300x.sock +``` + +### 增量重建 + +```bash +# 删除过期的目标文件,然后重建 +docker run --rm -v "$PWD:/gem5" -w /gem5 gem5-run:local \ + sh -c 'rm -f build/VEGA_X86/dev/amdgpu/.o' +docker run --rm -v "$PWD:/gem5" -w /gem5 \ + gem5-run:local scons build/VEGA_X86/gem5.opt -j1 +``` + +### 快速诊断表 + +| 症状 | 首先检查 | +|------|----------| +| gem5 容器启动后立即退出 | `docker logs gem5-cosim` | +| QEMU 连接失败 | gem5 是否就绪?(socket `chmod 777` 了吗?) | +| `psp_gpu_reset` 空指针崩溃 | `ip_block_mask` 错误(应使用 `0x67`) | +| GART translation not found | 是否使用了最新编译的 gem5 二进制? | +| SDMA ring test -110 | 检查 `sdma_delay` 是否为 `1000` | +| hipcc "cannot find ROCm device library" | `ls /opt/rocm/lib/`,使用 `--offload-arch=gfx942` | +| MMIO 读取全部返回零 | gem5 未连接或已崩溃 | +| `insmod: ERROR: could not load module` | 内核版本不匹配 | +| `cosim-gpu-setup.service` 失败 | `journalctl -u cosim-gpu-setup` | +| BAR 布局 probe 错误 -12 | 使用正确的 BAR5=MMIO 布局重建 QEMU | + +--- + +## 6. GART 表格式与 PTE 布局 + +GPU 地址空间和转换流程的概念性说明请参阅[架构文档 §5](architecture.md#gpu-地址转换与-gart)。 + +### GART PTE 格式 + +每个 GART 页表项为 8 字节: + +| 位域 | 字段 | 描述 | +|------|-------|------| +| 0 | Valid | 条目有效 | +| 1 | System | 1 = 系统内存,0 = 本地 VRAM | +| 5:2 | Fragment | 页面片段大小 | +| 47:12 | Physical Page | 物理地址 >> 12 | +| 51:48 | Block Fragment | 块片段大小 | +| 63:52 | Flags | MTYPE、PRT 等 | + +**物理地址提取**:`paddr = (bits(PTE, 47, 12) << 12) | page_offset` + +### Aperture 寄存器 + +| 寄存器 | gem5 字段 | 格式 | 描述 | +|--------|-----------|------|------| +| `MC_VM_FB_LOCATION_BASE` | `vmContext0.fbBase` | `bits[23:0] << 24` | MC 地址空间中 VRAM 的起始地址 | +| `MC_VM_FB_LOCATION_TOP` | `vmContext0.fbTop` | `bits[23:0] << 24 \| 0xFFFFFF` | VRAM 结束地址 | +| `MC_VM_FB_OFFSET` | `vmContext0.fbOffset` | `bits[23:0] << 24` | FB 重定位偏移量 | +| `MC_VM_AGP_BASE` | `vmContext0.agpBase` | `bits[23:0] << 24` | AGP 重映射基地址 | +| `MC_VM_AGP_BOT` | `vmContext0.agpBot` | `bits[23:0] << 24` | AGP aperture 底部 | +| `MC_VM_AGP_TOP` | `vmContext0.agpTop` | `bits[23:0] << 24 \| 0xFFFFFF` | AGP aperture 顶部 | +| `MC_VM_SYSTEM_APERTURE_LOW_ADDR` | `vmContext0.sysAddrL` | `bits[29:0] << 18` | System aperture 低地址 | +| `MC_VM_SYSTEM_APERTURE_HIGH_ADDR` | `vmContext0.sysAddrH` | `bits[29:0] << 18` | System aperture 高地址 | +| `VM_CONTEXT0_PAGE_TABLE_BASE_ADDR` | `vmContext0.ptBase` | raw 64-bit | GART 表在 VRAM 中的位置 | +| `VM_CONTEXT0_PAGE_TABLE_START_ADDR` | `vmContext0.ptStart` | raw 64-bit | GART aperture 起始地址(页号) | +| `VM_CONTEXT0_PAGE_TABLE_END_ADDR` | `vmContext0.ptEnd` | raw 64-bit | GART aperture 结束地址(页号) | + +### 协同仿真中的典型值 + +``` +ptBase = 0x3EE600000 GART table at VRAM offset ~15.7 GiB +ptStart = 0x7FFF00000 GART covers GPU VAs from 0x7FFF00000000 +ptEnd = 0x7FFF1FFFF GART covers ~128K pages (512 MiB) +fbBase = 0x8000000000 VRAM starts at MC address 512 GiB +fbTop = 0x8400FFFFFF VRAM ends at ~528 GiB (16 GiB range) +sysAddrL = 0x0 System aperture start +sysAddrH = 0x3FFEC0000 System aperture end (~4 TiB) +``` + +### GART 表在 VRAM 中的布局 + +``` +VRAM offset = ptBase (gartBase) ++-------------------+ ptBase + 0 +| PTE[0] (8 bytes) | maps page ptStart ++-------------------+ ptBase + 8 +| PTE[1] | maps page ptStart + 1 ++-------------------+ ptBase + 16 +| PTE[2] | maps page ptStart + 2 +| ... | ++-------------------+ +| PTE[N] | maps page ptStart + N ++-------------------+ ptBase + (ptEnd - ptStart + 1) * 8 +``` + +### 协同仿真 PTE 回退查找 + +在 cosim 模式下,`gartTable` 为空(VRAM 写入绕过 gem5)。回退机制直接从共享 VRAM 读取 PTE: + +```cpp +Addr pte_table_offset = gart_addr - (ptStart * 8); +Addr pte_vram_offset = gartBase() + pte_table_offset; +memcpy(&pte, vramShmemPtr + pte_vram_offset, sizeof(pte)); +``` + +若 PTE 为 0(未映射),则映射到 sink(`paddr=0`)而非产生 fault。 + +--- + +## 7. 国内镜像配置 + +在国内构建磁盘镜像时,VM 内的 `apt` 会从 `us.archive.ubuntu.com` 拉包,常因网络波动挂住(Packer 报 `Timeout waiting for SSH`,或 provisioner 在安装 ROCm 时退出)。 + +### 应用补丁 + +```bash +cd gem5-resources +git apply ../scripts/patches/0001-user-data-cn-mirror.patch +``` + +### 回滚补丁 + +```bash +cd gem5-resources +git apply -R ../scripts/patches/0001-user-data-cn-mirror.patch +``` + +如需使用其他镜像源,修改 patch 文件中的 URI 后重新 apply。 + +--- + +## 8. 文件参考 + +### gem5 源文件(`src/dev/amdgpu/`) + +| 文件 | 用途 | +|------|------| +| `mi300x_vfio_user.{cc,hh}` | vfio-user 服务端 SimObject(**默认后端**) | +| `MI300XVfioUser.py` | SimObject Python 封装(vfio-user) | +| `cosim_bridge.hh` | 抽象 CosimBridge 接口(两种后端均实现此接口) | +| `mi300x_gem5_cosim.{cc,hh}` | Legacy socket 桥接 SimObject | +| `MI300XGem5Cosim.py` | SimObject Python 封装(legacy) | +| `amdgpu_device.cc` | GPU 设备模型核心,`readROM()`、`intrPost()`、`writeFrame()` | +| `amdgpu_vm.{cc,hh}` | 所有转换生成器(GART、AGP、MMHUB、User),cosim VRAM 回退 | +| `pm4_packet_processor.{cc,hh}` | PM4 包解码、DMA 路由、VRAM 写路由、`isVRAMAddress()` | +| `pm4_defines.hh` | PM4 操作码,包括 `IT_ACQUIRE_MEM`、`IT_SET_RESOURCES` | +| `sdma_engine.{cc,hh}` | SDMA 操作、rptr 回写路由、`sdma_delay` 参数 | +| `interrupt_handler.cc` | IH ring buffer DMA 和 MSI-X 中断发送 | +| `amdgpu_nbio.cc` | ASIC 初始化完成寄存器 | + +### gem5 配置和脚本 + +| 文件 | 用途 | +|------|------| +| `configs/example/gpufs/mi300_cosim.py` | cosim 系统配置(`--cosim-backend=vfio-user\|legacy`) | +| `configs/example/gem5_library/x86-mi300x-gpu.py` | 独立 stdlib 仿真配置 | +| `configs/example/gpufs/mi300.py` | Legacy 独立仿真配置 | +| `scripts/cosim_launch.sh` | cosim 编排(Docker + QEMU 启动) | +| `scripts/run_mi300x_fs.sh` | 构建编排(编译、磁盘镜像、运行) | +| `scripts/Dockerfile.run` | 运行时 Docker 镜像定义 | +| `scripts/cosim_test_client.py` | Socket 连通性测试工具 | +| `scripts/patches/0001-user-data-cn-mirror.patch` | 磁盘镜像构建的国内镜像补丁 | + +### gem5 修改的基础设施文件 + +| 文件 | 变更内容 | +|------|----------| +| `src/dev/intel_8254_timer.{cc,hh}` | `disable_timer_events` 参数(cosim 定时器溢出修复) | +| `src/dev/mc146818.{cc,hh}` | `disable_rtc_events` 参数(cosim 定时器溢出修复) | + +### gem5 Python 组件 + +| 文件 | 用途 | +|------|------| +| `src/python/gem5/prebuilt/viper/board.py` | ViperBoard:readfile 注入、驱动加载 | +| `src/python/gem5/components/devices/gpus/amdgpu.py` | MI300X 设备定义 | + +### QEMU 文件(仅 Legacy 后端) + +| 文件 | 用途 | +|------|------| +| `qemu/hw/misc/mi300x_gem5.c` | 带 socket 桥接的 MI300X PCI 设备 | +| `qemu/hw/misc/mi300x_gem5.h` | 头文件 | +| `qemu/hw/misc/trace-events` | trace 事件定义 | + +> vfio-user 后端使用 QEMU 内建的 `vfio-user-pci` 设备,不需要任何自定义 QEMU 代码。 + +### 外部依赖 + +| 路径 | 用途 | +|------|------| +| `ext/libvfio-user/` | libvfio-user 库(git 子模块,vfio-user 后端) | + +### Guest 磁盘镜像内容 + +| 文件(Guest 内部) | 用途 | +|--------------------|------| +| `/root/roms/mi300.rom` | VGA BIOS ROM 二进制 | +| `/usr/lib/firmware/amdgpu/mi300_discovery` | IP discovery 固件 | +| `/etc/systemd/system/cosim-gpu-setup.service` | 自动加载服务单元 | +| `/usr/local/bin/cosim-gpu-setup.sh` | 自动加载脚本 | +| `/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst` | amdgpu 内核模块(ROCm 7.0 DKMS) | +| `/home/gem5/load_amdgpu.sh` | 驱动加载脚本(独立仿真) | +| `/sbin/m5` | gem5 伪指令工具 | + +### PCI BAR 布局 + +| BAR | 资源 | 类型 | 大小 | +|-----|------|------|------| +| BAR0+1 | VRAM | 64-bit prefetchable | 16 GiB(共享内存) | +| BAR2+3 | Doorbell | 64-bit | 4 MiB | +| BAR4 | MSI-X | exclusive | -- | +| BAR5 | MMIO 寄存器 | 32-bit | 512 KiB(转发到 gem5) | + +驱动常量:`AMDGPU_VRAM_BAR=0`、`AMDGPU_DOORBELL_BAR=2`、`AMDGPU_MMIO_BAR=5`。 + +### 资源路由(两种后端通用) + +| 资源 | 通过 Socket/vfio-user? | 通过共享内存? | +|------|------------------------|---------------| +| MMIO 寄存器(BAR5) | 是 | 否 | +| VRAM(BAR0,16 GiB) | **否** | 是(`/dev/shm/mi300x-vram`) | +| Doorbell(BAR2) | 是 | 否 | + +任何通过拦截 VRAM 写入来填充的 gem5 数据结构(如 `gartTable`、页表、ring buffer)在 cosim 模式下都**不会**被填充,需要显式的共享 VRAM 回退机制。 diff --git a/docs/zh/xgmi-model.md b/docs/zh/xgmi-model.md deleted file mode 100644 index 7a31990..0000000 --- a/docs/zh/xgmi-model.md +++ /dev/null @@ -1,77 +0,0 @@ -[English](../en/xgmi-model.md) - -# xGMI 互连模型设计 - -## 概述 - -xGMI(芯片间全局内存互连)模型提供 cosim-gpu 多 GPU hive 中的 GPU 间通信。 -它挂载在每个 GPU 的 L2 缓存(TCC)出口端口上,将远程 VRAM 访问通过可配置的 -带宽、延迟和拓扑的 xGMI 链路模型进行路由。 - -## 数据包格式 - -| 字段 | 类型 | 描述 | -|----------|--------|------------------------------------| -| src_gpu | uint8 | 源 GPU ID | -| dst_gpu | uint8 | 目标 GPU ID | -| addr | uint64 | 目标 VRAM 地址 | -| size | uint32 | 负载大小(字节) | -| payload | bytes | 数据(写操作时) | - -## 地址映射 - -每个 GPU 拥有连续的 VRAM 地址范围: - -``` -GPU 0: [0, vram_size) -GPU 1: [vram_size, 2 * vram_size) -GPU N: [N * vram_size, (N+1) * vram_size) -``` - -桥接器通过检查地址落入哪个 GPU 的范围来判断本地或远程访问。 - -## 拓扑配置 - -启动参数 `--xgmi-topology`: - -- **mesh**:每个 GPU 与所有其他 GPU 直连。8 GPU mesh 创建 28 条双向链路。 -- **ring**:每个 GPU 连接其两个邻居。链路数更少但非相邻 GPU 需多跳。 - -## 链路参数 - -| 参数 | 默认值 | CLI 标志 | -|----------------|----------|---------------------| -| 每链路带宽 | 128 GB/s | `--xgmi-bandwidth` | -| 每跳延迟 | 100 ns | `--xgmi-latency` | -| 每链路通道数 | 16 | (SimObject 参数) | -| 每 GPU 最大链路 | 7 | (SimObject 参数) | -| 流控信用 | 32 | (SimObject 参数) | - -## 流量控制 - -基于信用的背压机制防止数据丢失: - -1. 每条链路初始 N 个信用(默认 32)。 -2. 发送一个数据包消耗一个信用。 -3. 接收方在接受数据包后归还信用。 -4. 信用归零时发送方阻塞(永不丢弃)。 - -## 架构阶段 - -### Path A(里程碑 1-3):自建 xGMI 模型 - -- 单进程多 GPU(里程碑 1-2):进程内函数调用 -- 多进程 8-GPU hive(里程碑 3):通过共享内存环形缓冲区或 Unix socket 的 IPC 传输 - -### Path B(里程碑 4-5):SST Merlin 集成 - -- 用 SST Merlin 网络引擎替换 xGMI 传输 -- 三层同步:QEMU(功能仿真)↔ gem5(GPU 时序)↔ SST(网络时序) -- 支持任意拓扑(fat-tree、dragonfly) - -## 关键源文件 - -- `gem5/src/dev/amdgpu/XGMIBridge.py` — SimObject 定义 -- `gem5/src/dev/amdgpu/xgmi_bridge.hh` — C++ 头文件 -- `gem5/src/dev/amdgpu/xgmi_bridge.cc` — C++ 实现 -- `gem5/configs/example/gpufs/mi300_cosim.py` — 配置和连线 From 22eb4a08344a4a7a985af8583a18d4d06f06c576 Mon Sep 17 00:00:00 2001 From: Chao Liu Date: Wed, 29 Apr 2026 15:47:00 +0800 Subject: [PATCH 2/2] fix(docs): correct manual build path and align ASCII diagrams - Fix manual disk-image build path: cd gem5-resources/ (not ../gem5-resources/) - Align component diagram box borders in architecture.md (en+zh) - Replace getting-started overview diagram with README's proven layout Signed-off-by: Chao Liu --- docs/en/architecture.md | 8 ++++---- docs/en/getting-started.md | 35 +++++++++++++++++------------------ docs/zh/architecture.md | 8 ++++---- docs/zh/getting-started.md | 35 +++++++++++++++++------------------ 4 files changed, 42 insertions(+), 44 deletions(-) diff --git a/docs/en/architecture.md b/docs/en/architecture.md index 3fe617b..5b46db5 100644 --- a/docs/en/architecture.md +++ b/docs/en/architecture.md @@ -70,18 +70,18 @@ The co-simulation system splits GPU workload execution across two processes: QEM ``` +--------------------------------------+ -| QEMU (Q35 + KVM) | +| QEMU (Q35 + KVM) | | +--------------------------------+ | | | Guest Linux (Ubuntu 24) | | | | amdgpu driver (ROCm 7) | | | | ROCm userspace | | | +--------------+-----------------+ | -| | MMIO / Doorbell | +| | MMIO / Doorbell | | +--------------v-----------------+ | | | vfio-user-pci | | | | (QEMU built-in device) | | | +--------------+-----------------+ | -| | vfio-user protocol | +| | vfio-user protocol | +-----------------+--------------------+ | /tmp/gem5-mi300x.sock | (Unix socket) @@ -92,7 +92,7 @@ The co-simulation system splits GPU workload execution across two processes: QEM | | (mi300x_vfio_user.cc) | | | | [libvfio-user server] | | | +--------------+-----------------+ | -| | AMDGPUDevice API | +| | AMDGPUDevice API | | +--------------v-----------------+ | | | AMDGPUDevice | | | | PM4PacketProcessor | | diff --git a/docs/en/getting-started.md b/docs/en/getting-started.md index 9a98e02..8709b31 100644 --- a/docs/en/getting-started.md +++ b/docs/en/getting-started.md @@ -8,23 +8,22 @@ From building the components to running your first HIP GPU compute test. ## Overview ``` -+---------------------------------+ +------------------------------+ -| QEMU (Q35 + KVM) | | gem5 (inside Docker) | -| +---------------------------+ | | +------------------------+ | -| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU Model | | -| | amdgpu driver | | | | - Shader + CU | | -| | ROCm 7.0 / HIP runtime | | | | - PM4 / SDMA Engines | | -| +-----------+---------------+ | | | - Ruby Cache Hierarchy | | -| | MMIO/Doorbell | | +----------+-------------+ | -| +-----------v---------------+ | | +----------v-------------+ | -| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | -| +---------------------------+ |vfio-| +------------------------+ | -| |user | | -+---------------------------------+ +------------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (Guest Physical Memory, Shared) (GPU VRAM, Shared) ++-----------------------------+ +----------------------------+ +| QEMU (Q35 + KVM) | | gem5 (Docker) | +| +-----------------------+ | | +----------------------+ | +| | Guest Linux | | | | MI300X GPU Model | | +| | amdgpu driver | | | | Shader / CU / SDMA | | +| | ROCm 7.0 / HIP | | | | PM4 / Ruby caches | | +| +----------+------------+ | | +---------+------------+ | +| +----------v------------+ | | +---------v------------+ | +| | vfio-user-pci |<-------->| | MI300XVfioUser | | +| | (QEMU built-in) | |vfio- | | (libvfio-user) | | +| +-----------------------+ |user | +----------------------+ | ++-----------------------------+ +----------------------------+ + | | + v v + /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram + (shared guest RAM) (shared GPU VRAM) ``` - **QEMU** handles CPU execution, Linux kernel boot, PCIe enumeration, and amdgpu driver loading. @@ -121,7 +120,7 @@ If `gem5-resources` does not exist, it will be cloned automatically before the b ### Manual Build ```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml +cd gem5-resources/src/x86-ubuntu-gpu-ml ./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" ``` diff --git a/docs/zh/architecture.md b/docs/zh/architecture.md index eefa9f9..0b48705 100644 --- a/docs/zh/architecture.md +++ b/docs/zh/architecture.md @@ -70,18 +70,18 @@ ``` +--------------------------------------+ -| QEMU (Q35 + KVM) | +| QEMU (Q35 + KVM) | | +--------------------------------+ | | | Guest Linux (Ubuntu 24) | | | | amdgpu driver (ROCm 7) | | | | ROCm userspace | | | +--------------+-----------------+ | -| | MMIO / Doorbell | +| | MMIO / Doorbell | | +--------------v-----------------+ | | | vfio-user-pci | | | | (QEMU built-in device) | | | +--------------+-----------------+ | -| | vfio-user protocol | +| | vfio-user protocol | +-----------------+--------------------+ | /tmp/gem5-mi300x.sock | (Unix socket) @@ -92,7 +92,7 @@ | | (mi300x_vfio_user.cc) | | | | [libvfio-user server] | | | +--------------+-----------------+ | -| | AMDGPUDevice API | +| | AMDGPUDevice API | | +--------------v-----------------+ | | | AMDGPUDevice | | | | PM4PacketProcessor | | diff --git a/docs/zh/getting-started.md b/docs/zh/getting-started.md index 17d7d89..8f3f43a 100644 --- a/docs/zh/getting-started.md +++ b/docs/zh/getting-started.md @@ -8,23 +8,22 @@ QEMU + gem5 MI300X 联合仿真项目的快速入门指南。 ## 架构概述 ``` -+---------------------------------+ +------------------------------+ -| QEMU (Q35 + KVM) | | gem5 (Docker 容器内) | -| +---------------------------+ | | +------------------------+ | -| | Guest Linux (Ubuntu 24.04)| | | | MI300X GPU 模型 | | -| | amdgpu 驱动 | | | | - Shader + CU | | -| | ROCm 7.0 / HIP 运行时 | | | | - PM4 / SDMA 引擎 | | -| +-----------+---------------+ | | | - Ruby 缓存层次 | | -| | MMIO/Doorbell | | +----------+-------------+ | -| +-----------v---------------+ | | +----------v-------------+ | -| | vfio-user-pci (built-in) |<--------->| MI300XVfioUser Server | | -| +---------------------------+ |vfio-| +------------------------+ | -| |user | | -+---------------------------------+ +------------------------------+ - | | - v v - /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram - (Guest 物理内存, 共享) (GPU VRAM, 共享) ++-----------------------------+ +----------------------------+ +| QEMU (Q35 + KVM) | | gem5 (Docker) | +| +-----------------------+ | | +----------------------+ | +| | Guest Linux | | | | MI300X GPU Model | | +| | amdgpu driver | | | | Shader / CU / SDMA | | +| | ROCm 7.0 / HIP | | | | PM4 / Ruby caches | | +| +----------+------------+ | | +---------+------------+ | +| +----------v------------+ | | +---------v------------+ | +| | vfio-user-pci |<-------->| | MI300XVfioUser | | +| | (QEMU built-in) | |vfio- | | (libvfio-user) | | +| +-----------------------+ |user | +----------------------+ | ++-----------------------------+ +----------------------------+ + | | + v v + /dev/shm/cosim-guest-ram /dev/shm/mi300x-vram + (shared guest RAM) (shared GPU VRAM) ``` - **QEMU** 负责:CPU 执行、Linux 内核引导、PCIe 枚举、amdgpu 驱动加载 @@ -121,7 +120,7 @@ make -j$(nproc) ### 手动构建 ```bash -cd ../gem5-resources/src/x86-ubuntu-gpu-ml +cd gem5-resources/src/x86-ubuntu-gpu-ml ./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64" ```