Skip to content

Add scale_i32_bf16 operator #99

@albiol2004

Description

@albiol2004

Problem

INT8 GEMM outputs int32 accumulators that need converting back to bf16 with a scale factor. The existing dequant_i32_bf16 (#96) operator uses per-group packed buffer formats that don't compose directly with GEMM output. CPU-side dequantization requires expensive NPU↔CPU round trips (16MB i32 download + 8MB bf16 upload per GEMM call).

Solution

New scale_i32_bf16 operator that takes:

  • Input 1: plain (size,) int32 buffer, directly from GEMM output
  • Input 2: tiny (num_cores × 16,) bf16 scale buffer (~256 bytes)
  • Output: plain (size,) bf16 buffer

No packed formats. The scale ObjectFIFO is acquired once per core and reused across all tile iterations. Wins over CPU dequant at prompt lengths ≥~3500 tokens.

Tests

7 non-extensive parameter combinations, all passing. Configurations: 1-8 columns, 1-2 channels, tile sizes 256-8192.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions