fix(prepare): use raw byte length for BPB token_bytes — avoids UTF-8 replacement inflation by warren618 · Pull Request #385 · karpathy/autoresearch

warren618 · 2026-03-22T18:52:36Z

Summary

Fixes #384

The BPB token byte count is computed by decoding each token via tiktoken and re-encoding to UTF-8. For tokens whose raw bytes are not valid standalone UTF-8, tiktoken replaces them with U+FFFD (3 bytes), inflating the byte count and producing artificially lower BPB scores.

Fix

Use mergeable_ranks directly to get the true raw byte length, bypassing the decode/re-encode roundtrip:

rank_to_bytes = {rank: raw for raw, rank in mergeable_ranks.items()}
for token_id in range(enc.n_vocab):
    if token_id in rank_to_bytes:
        token_bytes_list.append(len(rank_to_bytes[token_id]))
    else:
        token_bytes_list.append(0)

Why this matters

BPB is the core metric driving the entire autoresearch loop. An inflated byte count denominator means experiments appear to perform better than they actually do.

The previous approach decoded each token via tiktoken and then re-encoded to UTF-8 to get byte length. For BPE tokens whose raw bytes are not valid standalone UTF-8 (e.g. a single continuation byte 0x80 = 1 byte), tiktoken decode produces the replacement character U+FFFD which encodes to 3 UTF-8 bytes. This inflates the byte-count denominator in BPB, producing artificially lower (better-looking) scores. Use mergeable_ranks directly to get the true raw byte length of each token, avoiding the decode/re-encode roundtrip entirely.

Per reviewer feedback — this file was not part of the fix.

svlandeg self-assigned this Mar 22, 2026

svlandeg reviewed Mar 22, 2026

View reviewed changes

Comment thread README_PROFILE.md Outdated

chore: remove accidentally committed README_PROFILE.md

54eb413

Per reviewer feedback — this file was not part of the fix.

prathamesh-lang approved these changes Mar 23, 2026

View reviewed changes

svlandeg mentioned this pull request Apr 7, 2026

fix: use raw token bytes for BPB calculation #493

Closed

ravyg mentioned this pull request Apr 8, 2026

feat: CLI analysis tool for experiment results #495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prepare): use raw byte length for BPB token_bytes — avoids UTF-8 replacement inflation#385

fix(prepare): use raw byte length for BPB token_bytes — avoids UTF-8 replacement inflation#385
warren618 wants to merge 2 commits intokarpathy:masterfrom
warren618:fix/bpb-token-byte-count

warren618 commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

warren618 commented Mar 22, 2026

Summary

Fix

Why this matters

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants