Skip to content

fix(prepare): use raw byte length for BPB token_bytes — avoids UTF-8 replacement inflation#385

Open
warren618 wants to merge 2 commits intokarpathy:masterfrom
warren618:fix/bpb-token-byte-count
Open

fix(prepare): use raw byte length for BPB token_bytes — avoids UTF-8 replacement inflation#385
warren618 wants to merge 2 commits intokarpathy:masterfrom
warren618:fix/bpb-token-byte-count

Conversation

@warren618
Copy link
Copy Markdown

Summary

Fixes #384

The BPB token byte count is computed by decoding each token via tiktoken and re-encoding to UTF-8. For tokens whose raw bytes are not valid standalone UTF-8, tiktoken replaces them with U+FFFD (3 bytes), inflating the byte count and producing artificially lower BPB scores.

Fix

Use mergeable_ranks directly to get the true raw byte length, bypassing the decode/re-encode roundtrip:

rank_to_bytes = {rank: raw for raw, rank in mergeable_ranks.items()}
for token_id in range(enc.n_vocab):
    if token_id in rank_to_bytes:
        token_bytes_list.append(len(rank_to_bytes[token_id]))
    else:
        token_bytes_list.append(0)

Why this matters

BPB is the core metric driving the entire autoresearch loop. An inflated byte count denominator means experiments appear to perform better than they actually do.

The previous approach decoded each token via tiktoken and then
re-encoded to UTF-8 to get byte length. For BPE tokens whose raw
bytes are not valid standalone UTF-8 (e.g. a single continuation
byte 0x80 = 1 byte), tiktoken decode produces the replacement
character U+FFFD which encodes to 3 UTF-8 bytes. This inflates
the byte-count denominator in BPB, producing artificially lower
(better-looking) scores.

Use mergeable_ranks directly to get the true raw byte length of
each token, avoiding the decode/re-encode roundtrip entirely.
@svlandeg svlandeg self-assigned this Mar 22, 2026
Comment thread README_PROFILE.md Outdated
Per reviewer feedback — this file was not part of the fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: BPB metric inflated by UTF-8 replacement characters in token byte count

3 participants