Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 18, 2025

📄 32% (0.32x) speedup for sanitize_filename in skyvern/forge/sdk/api/files.py

⏱️ Runtime : 471 microseconds 357 microseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces the repeated list creation ["-", "_", ".", "%", " "] with a pre-computed set allowed and converts the generator expression to a list comprehension. This yields a 31% speedup through two key improvements:

What was optimized:

  1. Set membership lookup: Changed from c in ["-", "_", ".", "%", " "] to c in allowed where allowed is a set, reducing membership testing from O(n) to O(1)
  2. List comprehension: Replaced generator expression with list comprehension inside join() for reduced overhead

Why this is faster:

  • The original code recreates the list ["-", "_", ".", "%", " "] for every character comparison, resulting in O(n×m) operations where n is filename length and m is allowed character count
  • Set membership testing is O(1) vs O(m) for list membership, providing substantial speedups especially for longer filenames
  • List comprehensions have lower overhead than generator expressions when the result is immediately consumed

Performance impact by workload:
The function is called in file processing workflows (download_file, rename_file, create_named_temporary_file) where filenames are sanitized. Test results show:

  • Large filenames benefit most: 25-75% speedup for 1000+ character inputs
  • High invalid character ratio: Up to 62% speedup when most characters are filtered out
  • Mixed content: 25-41% improvement for realistic filename patterns
  • Small inputs: Still 10-15% faster for typical short filenames

This optimization is particularly valuable since file operations often process user-generated filenames that can be long or contain many invalid characters, making the O(1) set lookup highly beneficial in the file handling hot path.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 48 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import random  # used for generating large/edge test cases
import string  # used for generating large/edge test cases

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# ---------------------
# Basic Test Cases
# ---------------------

def test_all_valid_characters():
    # All valid characters should be preserved
    valid = string.ascii_letters + string.digits + "-_.% "
    codeflash_output = sanitize_filename(valid) # 3.62μs -> 3.13μs (15.5% faster)

def test_only_invalid_characters():
    # All invalid characters should be removed
    invalid = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    codeflash_output = sanitize_filename(invalid) # 2.77μs -> 2.04μs (36.2% faster)

def test_mixed_valid_and_invalid():
    # Mixed valid and invalid characters: invalid ones are removed
    input_str = "hello/world\\test:file*name?.txt"
    expected = "helloworldtestfilename.txt"
    codeflash_output = sanitize_filename(input_str) # 2.58μs -> 2.30μs (12.0% faster)

def test_spaces_preserved():
    # Spaces should be preserved
    input_str = "file name with spaces.txt"
    expected = "file name with spaces.txt"
    codeflash_output = sanitize_filename(input_str) # 2.28μs -> 2.09μs (9.23% faster)

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = sanitize_filename("") # 1.06μs -> 924ns (14.5% faster)

def test_filename_with_percent():
    # Percent sign should be preserved
    codeflash_output = sanitize_filename("report%20final.pdf") # 2.05μs -> 1.88μs (9.09% faster)

# ---------------------
# Edge Test Cases
# ---------------------

def test_all_ascii_printable():
    # Only valid characters from all printable ASCII are preserved
    all_printable = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_printable if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_printable) # 4.45μs -> 4.29μs (3.68% faster)

def test_unicode_characters():
    # Unicode letters and digits are NOT preserved (since isalnum() returns True for unicode)
    # This is a mutation-sensitive test: if the implementation changes, this test will fail
    input_str = "résumé_文件_123.txt"
    # 'é' and '文' and '件' are isalnum() True, so they are preserved
    expected = "résumé_文件_123.txt"
    codeflash_output = sanitize_filename(input_str) # 2.66μs -> 2.64μs (0.834% faster)

def test_leading_trailing_invalid():
    # Leading and trailing invalid characters are removed, valid ones preserved
    input_str = "!@#file_name.txt$%"
    expected = "file_name.txt%"
    codeflash_output = sanitize_filename(input_str) # 2.24μs -> 2.05μs (9.07% faster)

def test_filename_only_dots():
    # Only dots are valid
    input_str = "....."
    codeflash_output = sanitize_filename(input_str) # 1.42μs -> 1.49μs (4.75% slower)

def test_filename_with_newlines_and_tabs():
    # Newlines and tabs are removed
    input_str = "file\nname\twith\tspecial\rchars.txt"
    expected = "filenamewithspecialchars.txt"
    codeflash_output = sanitize_filename(input_str) # 2.65μs -> 2.30μs (15.2% faster)

def test_filename_with_multiple_spaces():
    # Multiple spaces are preserved
    input_str = "file    name  with   spaces.txt"
    expected = "file    name  with   spaces.txt"
    codeflash_output = sanitize_filename(input_str) # 2.73μs -> 2.37μs (14.8% faster)

def test_filename_with_surrogate_pairs():
    # Surrogate pairs/emoji are removed (they are not alnum and not in allowed list)
    input_str = "file😀name😎.txt"
    expected = "filename.txt"
    codeflash_output = sanitize_filename(input_str) # 2.22μs -> 2.08μs (6.74% faster)

def test_filename_with_control_characters():
    # Control characters are removed
    input_str = "file\x00name\x1f.txt"
    expected = "filename.txt"
    codeflash_output = sanitize_filename(input_str) # 1.84μs -> 1.64μs (12.0% faster)

def test_filename_with_only_spaces():
    # Only spaces are preserved
    input_str = "     "
    codeflash_output = sanitize_filename(input_str) # 1.53μs -> 1.41μs (7.92% faster)

def test_filename_with_percent_and_invalid():
    # Percent is preserved, invalids are removed
    input_str = "%file%na!me%.txt"
    expected = "%file%name%.txt"
    codeflash_output = sanitize_filename(input_str) # 2.07μs -> 1.91μs (8.31% faster)

# ---------------------
# Large Scale Test Cases
# ---------------------

def test_large_filename_all_valid():
    # Large filename with only valid characters
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    large_input = "".join(random.choices(valid_chars, k=1000))
    codeflash_output = sanitize_filename(large_input) # 34.3μs -> 27.5μs (24.8% faster)

def test_large_filename_all_invalid():
    # Large filename with only invalid characters
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    large_input = "".join(random.choices(invalid_chars, k=1000))
    codeflash_output = sanitize_filename(large_input) # 54.4μs -> 33.6μs (62.1% faster)

def test_large_filename_mixed():
    # Large filename with a mix of valid and invalid characters
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    all_chars = valid_chars + invalid_chars
    # Make sure the expected output contains only valid chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 43.7μs -> 34.8μs (25.4% faster)

def test_large_filename_unicode():
    # Large filename with unicode and valid ascii
    unicode_chars = "文件测试éàü"
    valid_chars = string.ascii_letters + string.digits + "-_.% "
    all_chars = valid_chars + unicode_chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 36.3μs -> 30.6μs (18.8% faster)

def test_large_filename_with_spaces_and_invalid():
    # Large filename with spaces and invalids
    valid_chars = string.ascii_letters + string.digits + " "
    invalid_chars = "!@#$^&*()+={}[]|\\:;\"'<>,/?~`"
    all_chars = valid_chars + invalid_chars
    large_input = "".join(random.choices(all_chars, k=1000))
    expected = "".join(c for c in large_input if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(large_input) # 42.8μs -> 33.2μs (29.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import string  # for generating large scale test cases

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import sanitize_filename

# unit tests

# ------------------- BASIC TEST CASES -------------------

def test_basic_alphanumeric():
    # Only alphanumeric characters should be preserved
    codeflash_output = sanitize_filename("abc123") # 1.54μs -> 1.58μs (2.78% slower)

def test_basic_allowed_symbols():
    # Allowed symbols should be preserved
    codeflash_output = sanitize_filename("file-name_01.txt") # 2.09μs -> 2.00μs (4.65% faster)
    codeflash_output = sanitize_filename("hello world.txt") # 1.20μs -> 1.01μs (18.4% faster)
    codeflash_output = sanitize_filename("percent%file.txt") # 969ns -> 820ns (18.2% faster)

def test_basic_mixed_allowed_and_disallowed():
    # Disallowed symbols should be removed
    codeflash_output = sanitize_filename("my*file?name.txt") # 1.92μs -> 1.73μs (11.1% faster)
    codeflash_output = sanitize_filename("test|file<name>.txt") # 1.31μs -> 1.12μs (16.7% faster)

def test_basic_only_disallowed():
    # If only disallowed characters, result should be empty string
    codeflash_output = sanitize_filename("!@#$^&*()[]{};:'\",<>?/\\|") # 2.60μs -> 1.86μs (39.6% faster)

def test_basic_empty_string():
    # Empty string should return empty string
    codeflash_output = sanitize_filename("") # 1.02μs -> 824ns (24.3% faster)

# ------------------- EDGE TEST CASES -------------------

def test_edge_unicode_characters():
    # Unicode letters and digits should be preserved, symbols removed
    # 'é' and 'ü' are alphanumeric, 'ß' is alphanumeric, 'ø' is alphanumeric, 'π' is alphanumeric
    codeflash_output = sanitize_filename("café_über_ßøπ.txt") # 2.54μs -> 2.36μs (7.61% faster)
    # Emoji and non-alphanumeric unicode should be removed
    codeflash_output = sanitize_filename("file😀name💾.txt") # 1.72μs -> 1.55μs (10.8% faster)

def test_edge_spaces_and_dots():
    # Multiple spaces and dots should be preserved
    codeflash_output = sanitize_filename("   ...file   name...  ") # 2.44μs -> 2.14μs (14.1% faster)

def test_edge_leading_trailing_disallowed():
    # Disallowed characters at start/end should be removed
    codeflash_output = sanitize_filename("!file.txt?") # 1.68μs -> 1.54μs (9.09% faster)

def test_edge_only_allowed_symbols():
    # Only allowed symbols should be preserved
    codeflash_output = sanitize_filename("-_ .%") # 1.48μs -> 1.44μs (2.78% faster)

def test_edge_long_disallowed_sequence():
    # Long sequence of disallowed characters should result in empty string
    codeflash_output = sanitize_filename("!@#$%^&*()+=~`[]{}|\\:;\"'<>,/?") # 2.89μs -> 2.25μs (28.5% faster)

def test_edge_filename_with_newlines_and_tabs():
    # Newlines and tabs should be removed
    codeflash_output = sanitize_filename("file\nname\t.txt") # 1.86μs -> 1.75μs (6.39% faster)

def test_edge_filename_with_percent():
    # Percent symbol should be preserved
    codeflash_output = sanitize_filename("100%_complete.txt") # 2.02μs -> 1.88μs (7.49% faster)

def test_edge_filename_with_multiple_dots():
    # Multiple dots are allowed
    codeflash_output = sanitize_filename("a.b.c.d.txt") # 1.68μs -> 1.63μs (2.57% faster)

def test_edge_filename_with_mixed_case():
    # Case should be preserved
    codeflash_output = sanitize_filename("FileNAME.TXT") # 1.65μs -> 1.64μs (0.854% faster)

def test_edge_filename_with_spaces_and_disallowed():
    # Spaces are allowed, disallowed removed
    codeflash_output = sanitize_filename("my file*name?.txt") # 2.01μs -> 1.82μs (10.3% faster)

def test_edge_filename_with_surrogate_pairs():
    # Surrogate pairs (e.g., emoji) are not alphanumeric and should be removed
    codeflash_output = sanitize_filename("file\U0001F4A9name.txt") # 2.18μs -> 2.08μs (4.96% faster)

def test_edge_filename_with_control_characters():
    # Control characters should be removed
    codeflash_output = sanitize_filename("file\x00name\x1F.txt") # 1.88μs -> 1.72μs (9.50% faster)

# ------------------- LARGE SCALE TEST CASES -------------------

def test_large_scale_long_filename():
    # Test with a very long filename (1000 characters, all allowed)
    long_filename = "a" * 1000
    codeflash_output = sanitize_filename(long_filename) # 26.9μs -> 21.1μs (27.8% faster)

def test_large_scale_long_filename_with_disallowed():
    # Test with a very long filename (1000 characters, half disallowed)
    allowed = "a" * 500
    disallowed = "!" * 500
    mixed = "".join([allowed[i] + disallowed[i] for i in range(500)])
    # Should only keep the allowed characters
    codeflash_output = sanitize_filename(mixed) # 40.4μs -> 23.1μs (74.7% faster)

def test_large_scale_all_ascii_printable():
    # Test with all ASCII printable characters
    all_ascii = "".join(chr(i) for i in range(32, 127))
    expected = "".join(c for c in all_ascii if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(all_ascii) # 4.52μs -> 4.24μs (6.53% faster)

def test_large_scale_repeated_pattern():
    # Test with repeated pattern of allowed and disallowed
    pattern = "abc!@#-_.% "
    repeated = pattern * 100
    expected = "".join(c for c in repeated if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(repeated) # 47.8μs -> 32.1μs (49.1% faster)

def test_large_scale_unicode_mixed():
    # Test with a large string of mixed unicode, allowed, and disallowed
    allowed = "üßøπ" * 100
    disallowed = "😀💾" * 100
    mixed = allowed + disallowed
    codeflash_output = sanitize_filename(mixed) # 23.4μs -> 19.5μs (19.8% faster)

def test_large_scale_filename_with_everything():
    # Test with a mix of all allowed, disallowed, unicode, control, and emoji
    allowed = string.ascii_letters + string.digits + "-_.% "
    disallowed = "!@#$^&*()+=~`[]{}|\\:;\"'<>,/?"
    unicode_allowed = "üßøπ"
    emoji_disallowed = "😀💾"
    control_disallowed = "\x00\x1F"
    long_string = (allowed + disallowed + unicode_allowed + emoji_disallowed + control_disallowed) * 10
    expected = "".join(c for c in long_string if c.isalnum() or c in ["-", "_", ".", "%", " "])
    codeflash_output = sanitize_filename(long_string) # 39.5μs -> 27.9μs (41.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-sanitize_filename-mjaqkc4u and push.

Codeflash Static Badge

The optimization replaces the repeated list creation `["-", "_", ".", "%", " "]` with a pre-computed set `allowed` and converts the generator expression to a list comprehension. This yields a **31% speedup** through two key improvements:

**What was optimized:**
1. **Set membership lookup**: Changed from `c in ["-", "_", ".", "%", " "]` to `c in allowed` where `allowed` is a set, reducing membership testing from O(n) to O(1)
2. **List comprehension**: Replaced generator expression with list comprehension inside `join()` for reduced overhead

**Why this is faster:**
- The original code recreates the list `["-", "_", ".", "%", " "]` for every character comparison, resulting in O(n×m) operations where n is filename length and m is allowed character count
- Set membership testing is O(1) vs O(m) for list membership, providing substantial speedups especially for longer filenames
- List comprehensions have lower overhead than generator expressions when the result is immediately consumed

**Performance impact by workload:**
The function is called in file processing workflows (download_file, rename_file, create_named_temporary_file) where filenames are sanitized. Test results show:
- **Large filenames benefit most**: 25-75% speedup for 1000+ character inputs 
- **High invalid character ratio**: Up to 62% speedup when most characters are filtered out
- **Mixed content**: 25-41% improvement for realistic filename patterns
- **Small inputs**: Still 10-15% faster for typical short filenames

This optimization is particularly valuable since file operations often process user-generated filenames that can be long or contain many invalid characters, making the O(1) set lookup highly beneficial in the file handling hot path.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 18, 2025 01:02
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant