⚡️ Speed up function `sanitize_filename` by 32% #156

codeflash-ai · 2025-12-18T01:02:40Z

📄 32% (0.32x) speedup for `sanitize_filename` in `skyvern/forge/sdk/api/files.py`

⏱️ Runtime : 471 microseconds → 357 microseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces the repeated list creation ["-", "_", ".", "%", " "] with a pre-computed set allowed and converts the generator expression to a list comprehension. This yields a 31% speedup through two key improvements:

What was optimized:

Set membership lookup: Changed from c in ["-", "_", ".", "%", " "] to c in allowed where allowed is a set, reducing membership testing from O(n) to O(1)
List comprehension: Replaced generator expression with list comprehension inside join() for reduced overhead

Why this is faster:

The original code recreates the list ["-", "_", ".", "%", " "] for every character comparison, resulting in O(n×m) operations where n is filename length and m is allowed character count
Set membership testing is O(1) vs O(m) for list membership, providing substantial speedups especially for longer filenames
List comprehensions have lower overhead than generator expressions when the result is immediately consumed

Performance impact by workload:
The function is called in file processing workflows (download_file, rename_file, create_named_temporary_file) where filenames are sanitized. Test results show:

Large filenames benefit most: 25-75% speedup for 1000+ character inputs
High invalid character ratio: Up to 62% speedup when most characters are filtered out
Mixed content: 25-41% improvement for realistic filename patterns
Small inputs: Still 10-15% faster for typical short filenames

This optimization is particularly valuable since file operations often process user-generated filenames that can be long or contain many invalid characters, making the O(1) set lookup highly beneficial in the file handling hot path.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 48 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import random  # used for generating large/edge test cases
import string  # used for generating large/edge test cases

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# ---------------------
# Basic Test Cases
# ---------------------

def test_all_valid_characters():
    # All valid characters should be preserved
    valid = string.ascii_letters + string.digits + "-_.% "
    codeflash_output = sanitize_filename(valid) # 3.62μs -> 3.13μs (15.5% faster)

def test_only_invalid_characters():
    # All invalid characters should be removed
    invalid = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    codeflash_output = sanitize_filename(invalid) # 2.77μs -> 2.04μs (36.2% faster)

def test_mixed_valid_and_invalid():
    # Mixed valid and invalid characters: invalid ones are removed
    input_str = "hello/world\\test:file*name?.txt"
    expected = "helloworldtestfilename.txt"
    codeflash_output = sanitize_filename(input_str) # 2.58μs -> 2.30μs (12.0% faster)

def test_spaces_preserved():
    # Spaces should be preserved
    input_str = "file name with spaces.txt"
    expected = "file name with spaces.txt"
    codeflash_output = sanitize_filename(input_str) # 2.28μs -> 2.09μs (9.23% faster)

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = sanitize_filename("") # 1.06μs -> 924ns (14.5% faster)

def test_filename_with_percent():
    # Percent sign should be preserved
    codeflash_output = sanitize_filename("report%20final.pdf") # 2.05μs -> 1.88μs (9.09% faster)

# ---------------------
# Edge Test Cases
# ---------------------

def test_all_ascii_printable():
    # Only valid characters from all printable ASCII are preserved
    all_printable = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_printable if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_printable) # 4.45μs -> 4.29μs (3.68% faster)

def test_unicode_characters():
    # Unicode letters and digits are NOT preserved (since isalnum() returns True for unicode)
    # This is a mutation-sensitive test: if the implementation changes, this test will fail
    input_str = "résumé_文件_123.txt"
    # 'é' and '文' and '件' are isalnum() True, so they are preserved
    expected = "résumé_文件_123.txt"
    codeflash_output = sanitize_filename(input_str) # 2.66μs -> 2.64μs (0.834% faster)

def test_leading_trailing_invalid():
    # Leading and trailing invalid characters are removed, valid ones preserved
    input_str = "!@#file_name.txt$%"
    expected = "file_name.txt%"
    codeflash_output = sanitize_filename(input_str) # 2.24μs -> 2.05μs (9.07% faster)

def test_filename_only_dots():
    # Only dots are valid
    input_str = "....."
    codeflash_output = sanitize_filename(input_str) # 1.42μs -> 1.49μs (4.75% slower)

def test_filename_with_newlines_and_tabs():
    # Newlines and tabs are removed
    input_str = "file\nname\twith\tspecial\rchars.txt"
    expected = "filenamewithspecialchars.txt"
    codeflash_output = sanitize_filename(input_str) # 2.65μs -> 2.30μs (15.2% faster)

def test_filename_with_multiple_spaces():
    # Multiple spaces are preserved
    input_str = "file    name  with   spaces.txt"
    expected = "file    name  with   spaces.txt"
    codeflash_output = sanitize_filename(input_str) # 2.73μs -> 2.37μs (14.8% faster)

def test_filename_with_surrogate_pairs():
    # Surrogate pairs/emoji are removed (they are not alnum and not in allowed list)
    input_str = "file😀name😎.txt"
    expected = "filename.txt"
    codeflash_output = sanitize_filename(input_str) # 2.22μs -> 2.08μs (6.74% faster)

def test_filename_with_control_characters():
    # Control characters are removed
    input_str = "file\x00name\x1f.txt"
    expected = "filename.txt"
    codeflash_output = sanitize_filename(input_str) # 1.84μs -> 1.64μs (12.0% faster)

def test_filename_with_only_spaces():
    # Only spaces are preserved
    input_str = "     "
    codeflash_output = sanitize_filename(input_str) # 1.53μs -> 1.41μs (7.92% faster)

def test_filename_with_percent_and_invalid():
    # Percent is preserved, invalids are removed
    input_str = "%file%na!me%.txt"
    expected = "%file%name%.txt"
    codeflash_output = sanitize_filename(input_str) # 2.07μs -> 1.91μs (8.31% faster)

# ---------------------
# Large Scale Test Cases
# ---------------------

def test_large_filename_all_valid():
    # Large filename with only valid characters
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    large_input = "".join(random.choices(valid_chars, k=1000))
    codeflash_output = sanitize_filename(large_input) # 34.3μs -> 27.5μs (24.8% faster)

def test_large_filename_all_invalid():
    # Large filename with only invalid characters
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    large_input = "".join(random.choices(invalid_chars, k=1000))
    codeflash_output = sanitize_filename(large_input) # 54.4μs -> 33.6μs (62.1% faster)

def test_large_filename_mixed():
    # Large filename with a mix of valid and invalid characters
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    all_chars = valid_chars + invalid_chars
    # Make sure the expected output contains only valid chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 43.7μs -> 34.8μs (25.4% faster)

def test_large_filename_unicode():
    # Large filename with unicode and valid ascii
    unicode_chars = "文件测试éàü"
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    all_chars = valid_chars + unicode_chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 36.3μs -> 30.6μs (18.8% faster)

def test_large_filename_with_spaces_and_invalid():
    # Large filename with spaces and invalids
    valid_chars = string.ascii_letters + string.digits + " "
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    all_chars = valid_chars + invalid_chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 42.8μs -> 33.2μs (29.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import string  # for generating large scale test cases

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# ------------------- BASIC TEST CASES -------------------

def test_basic_alphanumeric():
    # Only alphanumeric characters should be preserved
    codeflash_output = sanitize_filename("abc123") # 1.54μs -> 1.58μs (2.78% slower)

def test_basic_allowed_symbols():
    # Allowed symbols should be preserved
    codeflash_output = sanitize_filename("file-name_01.txt") # 2.09μs -> 2.00μs (4.65% faster)
    codeflash_output = sanitize_filename("hello world.txt") # 1.20μs -> 1.01μs (18.4% faster)
    codeflash_output = sanitize_filename("percent%file.txt") # 969ns -> 820ns (18.2% faster)

def test_basic_mixed_allowed_and_disallowed():
    # Disallowed symbols should be removed
    codeflash_output = sanitize_filename("my*file?name.txt") # 1.92μs -> 1.73μs (11.1% faster)
    codeflash_output = sanitize_filename("test|file<name>.txt") # 1.31μs -> 1.12μs (16.7% faster)

def test_basic_only_disallowed():
    # If only disallowed characters, result should be empty string
    codeflash_output = sanitize_filename("!@#$^&*()[]{};:'\",<>?/\\|") # 2.60μs -> 1.86μs (39.6% faster)

def test_basic_empty_string():
    # Empty string should return empty string
    codeflash_output = sanitize_filename("") # 1.02μs -> 824ns (24.3% faster)

# ------------------- EDGE TEST CASES -------------------

def test_edge_unicode_characters():
    # Unicode letters and digits should be preserved, symbols removed
    # 'é' and 'ü' are alphanumeric, 'ß' is alphanumeric, 'ø' is alphanumeric, 'π' is alphanumeric
    codeflash_output = sanitize_filename("café_über_ßøπ.txt") # 2.54μs -> 2.36μs (7.61% faster)
    # Emoji and non-alphanumeric unicode should be removed
    codeflash_output = sanitize_filename("file😀name💾.txt") # 1.72μs -> 1.55μs (10.8% faster)

def test_edge_spaces_and_dots():
    # Multiple spaces and dots should be preserved
    codeflash_output = sanitize_filename("   ...file   name...  ") # 2.44μs -> 2.14μs (14.1% faster)

def test_edge_leading_trailing_disallowed():
    # Disallowed characters at start/end should be removed
    codeflash_output = sanitize_filename("!file.txt?") # 1.68μs -> 1.54μs (9.09% faster)

def test_edge_only_allowed_symbols():
    # Only allowed symbols should be preserved
    codeflash_output = sanitize_filename("-_ .%") # 1.48μs -> 1.44μs (2.78% faster)

def test_edge_long_disallowed_sequence():
    # Long sequence of disallowed characters should result in empty string
    codeflash_output = sanitize_filename("!@#$%^&*()+=~`[]{}|\\:;\"'<>,/?") # 2.89μs -> 2.25μs (28.5% faster)

def test_edge_filename_with_newlines_and_tabs():
    # Newlines and tabs should be removed
    codeflash_output = sanitize_filename("file\nname\t.txt") # 1.86μs -> 1.75μs (6.39% faster)

def test_edge_filename_with_percent():
    # Percent symbol should be preserved
    codeflash_output = sanitize_filename("100%_complete.txt") # 2.02μs -> 1.88μs (7.49% faster)

def test_edge_filename_with_multiple_dots():
    # Multiple dots are allowed
    codeflash_output = sanitize_filename("a.b.c.d.txt") # 1.68μs -> 1.63μs (2.57% faster)

def test_edge_filename_with_mixed_case():
    # Case should be preserved
    codeflash_output = sanitize_filename("FileNAME.TXT") # 1.65μs -> 1.64μs (0.854% faster)

def test_edge_filename_with_spaces_and_disallowed():
    # Spaces are allowed, disallowed removed
    codeflash_output = sanitize_filename("my file*name?.txt") # 2.01μs -> 1.82μs (10.3% faster)

def test_edge_filename_with_surrogate_pairs():
    # Surrogate pairs (e.g., emoji) are not alphanumeric and should be removed
    codeflash_output = sanitize_filename("file\U0001F4A9name.txt") # 2.18μs -> 2.08μs (4.96% faster)

def test_edge_filename_with_control_characters():
    # Control characters should be removed
    codeflash_output = sanitize_filename("file\x00name\x1F.txt") # 1.88μs -> 1.72μs (9.50% faster)

# ------------------- LARGE SCALE TEST CASES -------------------

def test_large_scale_long_filename():
    # Test with a very long filename (1000 characters, all allowed)
    long_filename = "a" * 1000
    codeflash_output = sanitize_filename(long_filename) # 26.9μs -> 21.1μs (27.8% faster)

def test_large_scale_long_filename_with_disallowed():
    # Test with a very long filename (1000 characters, half disallowed)
    allowed = "a" * 500
    disallowed = "!" * 500
    mixed = "".join([allowed[i] + disallowed[i] for i in range(500)])
    # Should only keep the allowed characters
    codeflash_output = sanitize_filename(mixed) # 40.4μs -> 23.1μs (74.7% faster)

def test_large_scale_all_ascii_printable():
    # Test with all ASCII printable characters
    all_ascii = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_ascii if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_ascii) # 4.52μs -> 4.24μs (6.53% faster)

def test_large_scale_repeated_pattern():
    # Test with repeated pattern of allowed and disallowed
    pattern = "abc!@#-_.% "
    repeated = pattern * 100
    expected = "".join(c for c in repeated if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(repeated) # 47.8μs -> 32.1μs (49.1% faster)

def test_large_scale_unicode_mixed():
    # Test with a large string of mixed unicode, allowed, and disallowed
    allowed = "üßøπ" * 100
    disallowed = "😀💾" * 100
    mixed = allowed + disallowed
    codeflash_output = sanitize_filename(mixed) # 23.4μs -> 19.5μs (19.8% faster)

def test_large_scale_filename_with_everything():
    # Test with a mix of all allowed, disallowed, unicode, control, and emoji
    allowed = string.ascii_letters + string.digits + "-_.% "
    disallowed = "!@#$^&*()+=~`[]{}|\\:;\"'<>,/?"
    unicode_allowed = "üßøπ"
    emoji_disallowed = "😀💾"
    control_disallowed = "\x00\x1F"
    long_string = (allowed + disallowed + unicode_allowed + emoji_disallowed + control_disallowed) * 10
    expected = "".join(c for c in long_string if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(long_string) # 39.5μs -> 27.9μs (41.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-sanitize_filename-mjaqkc4u and push.

The optimization replaces the repeated list creation `["-", "_", ".", "%", " "]` with a pre-computed set `allowed` and converts the generator expression to a list comprehension. This yields a **31% speedup** through two key improvements: **What was optimized:** 1. **Set membership lookup**: Changed from `c in ["-", "_", ".", "%", " "]` to `c in allowed` where `allowed` is a set, reducing membership testing from O(n) to O(1) 2. **List comprehension**: Replaced generator expression with list comprehension inside `join()` for reduced overhead **Why this is faster:** - The original code recreates the list `["-", "_", ".", "%", " "]` for every character comparison, resulting in O(n×m) operations where n is filename length and m is allowed character count - Set membership testing is O(1) vs O(m) for list membership, providing substantial speedups especially for longer filenames - List comprehensions have lower overhead than generator expressions when the result is immediately consumed **Performance impact by workload:** The function is called in file processing workflows (download_file, rename_file, create_named_temporary_file) where filenames are sanitized. Test results show: - **Large filenames benefit most**: 25-75% speedup for 1000+ character inputs - **High invalid character ratio**: Up to 62% speedup when most characters are filtered out - **Mixed content**: 25-41% improvement for realistic filename patterns - **Small inputs**: Still 10-15% faster for typical short filenames This optimization is particularly valuable since file operations often process user-generated filenames that can be long or contain many invalid characters, making the O(1) set lookup highly beneficial in the file handling hot path.

codeflash-ai bot requested a review from mashraf-222 December 18, 2025 01:02

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `sanitize_filename` by 32% #156

⚡️ Speed up function `sanitize_filename` by 32% #156

Uh oh!

codeflash-ai bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function sanitize_filename by 32% #156

Are you sure you want to change the base?

⚡️ Speed up function sanitize_filename by 32% #156

Uh oh!

Conversation

codeflash-ai bot commented Dec 18, 2025

📄 32% (0.32x) speedup for sanitize_filename in skyvern/forge/sdk/api/files.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `sanitize_filename` by 32% #156

⚡️ Speed up function `sanitize_filename` by 32% #156

📄 32% (0.32x) speedup for `sanitize_filename` in `skyvern/forge/sdk/api/files.py`