Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude/commands/cosim-debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Without ROM in shared memory, `atom_context` is NULL.
**Fix**:
```bash
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

**Why it works**: The `dd` writes ROM to guest physical memory at 0xC0000, which maps
Expand Down Expand Up @@ -133,7 +133,7 @@ modules (returns exit 0 without loading).
**Fix**: The setup script must remove the runtime blacklist before calling modprobe:
```bash
rm -f /run/modprobe.d/*blacklist* 2>/dev/null
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

## Debugging Tips
Expand Down
2 changes: 1 addition & 1 deletion .claude/commands/disk-image-edit.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ cp /path/to/local/file "$MOUNTPOINT/root/file"
```bash
# Create modprobe config
cat > "$MOUNTPOINT/etc/modprobe.d/amdgpu-cosim.conf" << 'EOF'
options amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
options amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
EOF
```

Expand Down
14 changes: 13 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,19 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Run ShellCheck
run: shellcheck scripts/*.sh tests/run_tests.sh
run: shellcheck scripts/*.sh tests/run_tests.sh tests/test_modprobe_params.sh

modprobe-params:
name: Modprobe Params (Regression #9)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Init gem5-resources submodule only
run: |
git config --global url."https://github.com/".insteadOf "git@github.com:"
git submodule update --init --depth 1 gem5-resources
- name: Verify cosim modprobe parameters
run: bash tests/test_modprobe_params.sh

python-lint:
name: Python Lint
Expand Down
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,14 @@ Use `-j1` for gem5 linking if OOM-killed.
```

After guest boots: driver auto-loads via `cosim-gpu-setup.service` (dd ROM + modprobe).
Manual: `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 && modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0`
Manual: `dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 && modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`

## Architecture

QEMU (Q35+KVM) ←Unix socket→ gem5 (MI300X GPU model, no kernel).
Shared memory: `/dev/shm/cosim-guest-ram` (guest RAM) + `/dev/shm/mi300x-vram` (VRAM).
BAR layout: 0+1=VRAM, 2+3=Doorbell, 4=MSI-X, 5=MMIO.
Driver params: `ip_block_mask=0x67` (disable PSP+SMU), `discovery=2` (firmware).
Driver params: `ip_block_mask=0x67` (disable PSP+SMU), `ppfeaturemask=0 dpm=0 audio=0` (disable power-play/DPM/audio), `discovery=2` (firmware).

## Debugging

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ rocminfo # should show gfx942

# Manual setup (if the systemd service is not installed):
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2

# Run a HIP test
cat > /tmp/test.cpp << 'EOF'
Expand Down
2 changes: 1 addition & 1 deletion README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ rocminfo # 应显示 gfx942

# 手动加载(如果 systemd 服务未安装):
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2

# 运行 HIP 测试
cat > /tmp/test.cpp << 'EOF'
Expand Down
2 changes: 1 addition & 1 deletion docs/en/cosim-debugging-pitfalls.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ The driver logs `"Unable to locate a BIOS ROM"` and `"VBIOS image optional, proc

```bash
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

The ROM data at `0xC0000` is accessible by gem5 via `/dev/shm/cosim-guest-ram`. When the driver reads the ROM via SMU MMIO registers, gem5's `AMDGPUDevice::readROM()` reads from `system->getPhysMem()` at `VGA_ROM_DEFAULT + offset` and returns the ROM content through the cosim socket.
Expand Down
9 changes: 6 additions & 3 deletions docs/en/cosim-guest-gpu-init.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The disk image ships with `cosim-gpu-setup.service`, which runs at boot and perf

1. `dd` the VGA ROM to `0xC0000` (required for gem5's `readROM()` via shared memory)
2. Symlink IP discovery firmware
3. `modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0`
3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`

The service completes in ~40 seconds. After guest login, GPU is ready:

Expand Down Expand Up @@ -58,7 +58,7 @@ If the systemd service is not installed, run these commands manually after guest
```bash
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

## Detailed Steps
Expand Down Expand Up @@ -98,7 +98,7 @@ ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \
### Step 3: Load the amdgpu Kernel Module

```bash
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

**What it does**: Loads the amdgpu driver with co-simulation parameters.
Expand All @@ -108,6 +108,9 @@ modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
| Parameter | Value | Meaning |
|-----------|-------|---------|
| `ip_block_mask` | `0x67` | Disable PSP (bit 3) and SMU (bit 4); cosim does not model these |
| `ppfeaturemask` | `0` | Disable PowerPlay features; cosim has no power management hardware |
| `dpm` | `0` | Disable Dynamic Power Management |
| `audio` | `0` | Disable audio; no HDMI/DP audio in cosim |
| `ras_enable` | `0` | Disable RAS — prevents NULL deref on `atom_context` when VBIOS is minimal |
| `discovery` | `2` | Use firmware file for IP discovery |

Expand Down
4 changes: 2 additions & 2 deletions docs/en/cosim-technical-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold
- QEMU -> guest IH handler
- **GART translation**: co-simulation fallback reads PTEs from shared VRAM; unmapped pages safely routed to sink
- **65,000+ MMIO operations** handled without crashes
- **Disk image**: `cosim-gpu-setup.service` auto-loads driver at boot (dd ROM → modprobe with `ip_block_mask=0x67 discovery=2 ras_enable=0`)
- **Disk image**: `cosim-gpu-setup.service` auto-loads driver at boot (dd ROM → modprobe with `ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`)

### Known Limitations

Expand Down Expand Up @@ -334,7 +334,7 @@ screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \

# 4. Manual GPU setup (if cosim-gpu-setup.service is not installed)
screen -S qemu-cosim -X stuff 'dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128\n'
screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0\n'
screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2\n'
```

## 6. Debugging Tips
Expand Down
4 changes: 2 additions & 2 deletions docs/en/cosim-usage-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ The disk image includes `cosim-gpu-setup.service` which runs at boot:

1. Writes VGA ROM to `0xC0000` via `dd` (required for gem5 `readROM()`)
2. Symlinks IP discovery firmware
3. Runs `modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0`
3. Runs `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`

The service completes in ~40 seconds. After login, verify with `rocm-smi`.

Expand All @@ -341,7 +341,7 @@ ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \
/usr/lib/firmware/amdgpu/ip_discovery.bin

# 3. Load the amdgpu driver
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

> **Key parameter notes:**
Expand Down
2 changes: 1 addition & 1 deletion docs/zh/cosim-debugging-pitfalls.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ MI300X 检测顺序(来自 dmesg):

```bash
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

`0xC0000` 处的 ROM 数据通过 `/dev/shm/cosim-guest-ram` 可被 gem5 访问。当驱动通过 SMU MMIO 寄存器读取 ROM 时,gem5 的 `AMDGPUDevice::readROM()` 从 `system->getPhysMem()` 的 `VGA_ROM_DEFAULT + offset` 处读取,通过 cosim socket 返回 ROM 内容。
Expand Down
9 changes: 6 additions & 3 deletions docs/zh/cosim-guest-gpu-init.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ MI300X GPU 驱动可在 QEMU 客户机启动后**自动**或**手动**加载。

1. `dd` 写入 VGA ROM 到 `0xC0000`(gem5 通过共享内存的 `readROM()` 需要此数据)
2. 链接 IP discovery 固件
3. `modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0`
3. `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`

服务约 40 秒完成。登录后 GPU 即可使用:

Expand Down Expand Up @@ -58,7 +58,7 @@ WantedBy=multi-user.target
```bash
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
ln -sf /usr/lib/firmware/amdgpu/mi300_discovery /usr/lib/firmware/amdgpu/ip_discovery.bin
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

## 详细步骤
Expand Down Expand Up @@ -98,7 +98,7 @@ ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \
### 步骤 3:加载 amdgpu 内核模块

```bash
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

**功能说明**:使用协同仿真参数加载 amdgpu 驱动。
Expand All @@ -108,6 +108,9 @@ modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
| 参数 | 值 | 含义 |
|-----------|-------|---------|
| `ip_block_mask` | `0x67` | 禁用 PSP(bit 3)和 SMU(bit 4);cosim 不模拟这些 IP 块 |
| `ppfeaturemask` | `0` | 禁用 PowerPlay 特性;cosim 无电源管理硬件 |
| `dpm` | `0` | 禁用动态电源管理 |
| `audio` | `0` | 禁用音频;cosim 无 HDMI/DP 音频 |
| `ras_enable` | `0` | 禁用 RAS — 防止 VBIOS 最小化时 `atom_context` 为 NULL 导致的空指针崩溃 |
| `discovery` | `2` | 使用固件文件进行 IP discovery |

Expand Down
4 changes: 2 additions & 2 deletions docs/zh/cosim-technical-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ scons build/VEGA_X86/gem5.opt -j1 GOLD_LINKER=True --linker=gold
- QEMU → guest IH 处理程序
- **GART 翻译**:协同仿真兜底机制从共享 VRAM 读取 PTE;未映射页安全路由到 sink
- **65,000+ 次 MMIO 操作**处理无崩溃
- **磁盘镜像**:`cosim-gpu-setup.service` 开机自动加载驱动(dd ROM → modprobe `ip_block_mask=0x67 discovery=2 ras_enable=0`)
- **磁盘镜像**:`cosim-gpu-setup.service` 开机自动加载驱动(dd ROM → modprobe `ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`)

### 已知限制

Expand Down Expand Up @@ -334,7 +334,7 @@ screen -dmS qemu-cosim -L -Logfile /tmp/qemu-cosim-screen.log \

# 4. 手动 GPU 初始化(如果 cosim-gpu-setup.service 未安装)
screen -S qemu-cosim -X stuff 'dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128\n'
screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0\n'
screen -S qemu-cosim -X stuff 'modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2\n'
```

## 6. 调试技巧
Expand Down
4 changes: 2 additions & 2 deletions docs/zh/cosim-usage-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ Guest Linux 启动完成后(自动以 root 登录),执行以下命令加

1. 通过 `dd` 写入 VGA ROM 到 `0xC0000`(gem5 `readROM()` 需要此数据)
2. 链接 IP discovery 固件
3. 执行 `modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0`
3. 执行 `modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2`

服务约 40 秒完成。登录后用 `rocm-smi` 验证。

Expand All @@ -341,7 +341,7 @@ ln -sf /usr/lib/firmware/amdgpu/mi300_discovery \
/usr/lib/firmware/amdgpu/ip_discovery.bin

# 3. 加载 amdgpu 驱动
modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0
modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2
```

> **关键参数说明:**
Expand Down
2 changes: 1 addition & 1 deletion gem5-resources
23 changes: 15 additions & 8 deletions scripts/cosim_guest_setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,16 +56,23 @@ else
fi

# ---- Step 3: Load amdgpu driver ----
# NOTE: Do NOT delegate to /home/gem5/load_amdgpu.sh — that script uses
# ip_block_mask=0x6f (PSP enabled) which is for standalone gem5 only.
echo "[3/4] Loading amdgpu kernel module..."
if [ -f /home/gem5/load_amdgpu.sh ]; then
sh /home/gem5/load_amdgpu.sh
elif [ -f "/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko" ]; then
modprobe -v amdgpu \
ip_block_mask=0x67 \
ras_enable=0 \
discovery=2
AMDGPU_ARGS=(ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2)

# Kernel cmdline modprobe.blacklist=amdgpu creates a runtime blacklist that
# causes modprobe to silently skip the module (exit 0 without loading).
rm -f /run/modprobe.d/*blacklist* 2>/dev/null

if modprobe -v amdgpu "${AMDGPU_ARGS[@]}"; then
echo " amdgpu loaded (modprobe)"
elif insmod "/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko.zst" "${AMDGPU_ARGS[@]}" 2>/dev/null; then
echo " amdgpu loaded (insmod .ko.zst)"
elif insmod "/lib/modules/$(uname -r)/updates/dkms/amdgpu.ko" "${AMDGPU_ARGS[@]}" 2>/dev/null; then
echo " amdgpu loaded (insmod .ko)"
else
echo " ERROR: amdgpu.ko not found for kernel $(uname -r)"
echo " ERROR: failed to load amdgpu for kernel $(uname -r)"
echo " Make sure the disk image was built with the GPU ML packer config."
exit 1
fi
Expand Down
2 changes: 1 addition & 1 deletion scripts/cosim_launch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ echo " rocminfo # should show gfx942"
echo ""
echo "Manual setup (if service is not installed):"
echo " dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128"
echo " modprobe amdgpu ip_block_mask=0x67 discovery=2 ras_enable=0"
echo " modprobe amdgpu ip_block_mask=0x67 ppfeaturemask=0 dpm=0 audio=0 ras_enable=0 discovery=2"
echo ""
if [[ -n "$SHARE_DIR" ]]; then
echo "Shared directory: $SHARE_DIR"
Expand Down
95 changes: 95 additions & 0 deletions tests/test_modprobe_params.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/bin/bash
# Regression test: ensure all cosim modprobe/insmod commands include the
# required parameters. Missing ppfeaturemask/dpm/audio causes -EINVAL
# on ROCm 7.0+ (see issue #9).
#
# Two checks per file:
# 1. If AMDGPU_ARGS is defined, its definition must contain all params.
# 2. Every inline modprobe/insmod amdgpu line (with literal params, not
# a variable reference) must also contain all params independently.

set -uo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(dirname "$SCRIPT_DIR")"

REQUIRED_PARAMS=(
"ip_block_mask=0x67"
"ppfeaturemask=0"
"dpm=0"
"audio=0"
"ras_enable=0"
"discovery=2"
Comment thread
zevorn marked this conversation as resolved.
)

FAILED=0
CHECKED=0

check_line() {
local file="$1"
local lineno="$2"
local line="$3"

CHECKED=$((CHECKED + 1))
for param in "${REQUIRED_PARAMS[@]}"; do
if ! echo " $line " | tr '()"'"'"'' ' ' | grep -qF " $param "; then
echo "FAIL: ${file}:${lineno} missing '${param}'"
echo " line: ${line}"
FAILED=$((FAILED + 1))
return
fi
done
}

check_file() {
local filepath="$1"
local contents
contents="$(cat "$filepath")"
local lineno=0

while IFS= read -r line; do
lineno=$((lineno + 1))
[[ "$line" =~ ^[[:space:]]*# ]] && continue

if [[ "$line" =~ ^[[:space:]]*AMDGPU_ARGS= ]]; then
check_line "$filepath" "$lineno" "$line"
continue
fi

# Match modprobe/insmod amdgpu lines, but skip lines that pass
# params via a variable (e.g., "${AMDGPU_ARGS[@]}") — those are
# validated through the AMDGPU_ARGS definition check above.
if echo "$line" | grep -qE '(modprobe|insmod).*amdgpu' &&
! echo "$line" | grep -qF 'AMDGPU_ARGS'; then
check_line "$filepath" "$lineno" "$line"
fi
done <<< "$contents"
}

COSIM_SCRIPTS=(
"$REPO_ROOT/scripts/cosim_guest_setup.sh"
"$REPO_ROOT/gem5-resources/src/x86-ubuntu-gpu-ml/files/cosim-gpu-setup.sh"
)

for script in "${COSIM_SCRIPTS[@]}"; do
if [[ -f "$script" ]]; then
check_file "$script"
else
echo "SKIP: ${script} not found (submodule not checked out?)"
fi
done

echo ""
echo "Checked $CHECKED definitions/invocations, $FAILED failures."

if [[ $FAILED -gt 0 ]]; then
echo "FAIL: Missing required modprobe parameters (see issue #9)."
exit 1
fi

if [[ $CHECKED -eq 0 ]]; then
echo "WARN: No modprobe parameter definitions found."
exit 0
fi

echo "PASS: All cosim modprobe commands include required parameters."
Loading