Skip to content

Fix GPT-2 compatibility and long-input handling in perplexity#767

Open
zanvari wants to merge 1 commit into
huggingface:mainfrom
zanvari:fix-perplexity-gpt2-tokenizer
Open

Fix GPT-2 compatibility and long-input handling in perplexity#767
zanvari wants to merge 1 commit into
huggingface:mainfrom
zanvari:fix-perplexity-gpt2-tokenizer

Conversation

@zanvari

@zanvari zanvari commented Jun 17, 2026

Copy link
Copy Markdown

Summary

This PR fixes GPT-2 compatibility issues in the perplexity metric/measurement and prevents overlong inputs from raising an IndexError.

While investigating GPT-2 compatibility, I found that overlong inputs could also raise an IndexError when max_length=None, so this PR addresses both issues and adds regression tests.

Changes

  • Replace tokenizer.special_tokens_map_extended with tokenizer.special_tokens_map.
  • Use the tokenizer EOS token as the padding token when padding is required and no pad token is defined.
  • Avoid padding when batch_size=1.
  • Default max_length to the tokenizer or model maximum length when it is not explicitly provided, preventing overlong inputs from causing indexing errors.
  • Add regression tests for GPT-2 perplexity computation and long-input handling.

Tests

python3 -m pytest tests/test_perplexity.py -v

Result:

2 passed

Fixes #766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Perplexity metric fails with GPT-2 tokenizer

1 participant