Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 18, 2025

📄 66% (0.66x) speedup for extract_google_drive_file_id in skyvern/forge/sdk/api/files.py

⏱️ Runtime : 3.45 milliseconds 2.08 milliseconds (best of 189 runs)

📝 Explanation and details

The optimization replaces repeated regex compilation with a pre-compiled regex pattern stored in a module-level variable _drive_file_id_re.

Key Change: Instead of calling re.search(pattern, url) which compiles the regex on every function call, the optimized version uses a pre-compiled re.compile(pattern) object stored as _drive_file_id_re and calls its .search() method directly.

Why This is Faster: Regular expression compilation is computationally expensive, involving pattern parsing and finite state machine construction. The line profiler shows the original re.search() call taking 16.23ms (79.4% of total time) versus the optimized pre-compiled search taking only 3.86ms (47.5% of total time) - a 2.5x speedup on the critical path.

Performance Impact: This optimization is particularly valuable given the function's usage context. The download_file() function calls extract_google_drive_file_id() for every Google Drive URL processed, meaning this function is in a hot path for file downloading workflows. The 66% overall speedup will compound significantly when processing multiple files or in high-throughput scenarios.

Test Case Performance: The optimization shows consistent 45-150% speedups across all test cases, with particularly strong gains on invalid URLs (96-149% faster) since the pre-compiled pattern can fail-fast without recompilation overhead. Large-scale tests demonstrate the optimization scales well, maintaining 60-80% speedups even when processing 1000 URLs.

The module-level compilation happens only once at import time, making this a zero-overhead optimization for repeated calls.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 8091 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import re

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import extract_google_drive_file_id

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_basic_standard_url():
    # Test a standard Google Drive file URL
    url = "https://drive.google.com/file/d/1A2B3C4D5E6F7G8H9I0J/view?usp=sharing"
    expected = "1A2B3C4D5E6F7G8H9I0J"
    codeflash_output = extract_google_drive_file_id(url) # 1.67μs -> 1.11μs (49.9% faster)

def test_basic_url_with_extra_params():
    # Test URL with extra query parameters
    url = "https://drive.google.com/file/d/abc123XYZ/view?usp=sharing&foo=bar"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.71μs -> 1.10μs (54.5% faster)

def test_basic_url_with_trailing_slash():
    # Test URL with trailing slash after file ID
    url = "https://drive.google.com/file/d/abc123XYZ/"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.51μs -> 981ns (53.6% faster)

def test_basic_url_with_dash_and_underscore():
    # Test file ID containing dashes and underscores
    url = "https://drive.google.com/file/d/abc-123_XYZ/view"
    expected = "abc-123_XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.67μs -> 1.03μs (61.1% faster)

def test_basic_url_with_minimum_id_length():
    # Test file ID with only one character
    url = "https://drive.google.com/file/d/a/view"
    expected = "a"
    codeflash_output = extract_google_drive_file_id(url) # 1.71μs -> 1.01μs (69.7% faster)

# -------------------------
# Edge Test Cases
# -------------------------

def test_edge_no_file_id():
    # Test URL missing file ID
    url = "https://drive.google.com/file/d//view"
    codeflash_output = extract_google_drive_file_id(url) # 1.58μs -> 949ns (67.0% faster)

def test_edge_url_with_similar_but_wrong_pattern():
    # Test URL that looks similar but not matching pattern
    url = "https://drive.google.com/drive/folders/abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.23μs -> 628ns (96.3% faster)

def test_edge_non_google_drive_url():
    # Test a completely unrelated URL
    url = "https://example.com/file/d/abc123XYZ/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.77μs -> 1.17μs (51.3% faster)

def test_edge_empty_string():
    # Test empty string input
    url = ""
    codeflash_output = extract_google_drive_file_id(url) # 1.20μs -> 528ns (127% faster)

def test_edge_none_input():
    # Test None input (should raise TypeError)
    with pytest.raises(TypeError):
        extract_google_drive_file_id(None) # 2.24μs -> 1.44μs (55.5% faster)

def test_edge_file_id_with_special_characters():
    # Test file ID with invalid characters (should not match)
    url = "https://drive.google.com/file/d/abc$123XYZ/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.90μs -> 1.23μs (54.1% faster)

def test_edge_file_id_with_multiple_file_id_segments():
    # Test URL with multiple /file/d/ segments (should match first)
    url = "https://drive.google.com/file/d/firstID/file/d/secondID/view"
    expected = "firstID"
    codeflash_output = extract_google_drive_file_id(url) # 1.76μs -> 1.18μs (49.2% faster)

def test_edge_file_id_with_long_id():
    # Test file ID with maximum reasonable length (100 chars)
    file_id = "a" * 100
    url = f"https://drive.google.com/file/d/{file_id}/view"
    expected = file_id
    codeflash_output = extract_google_drive_file_id(url) # 1.88μs -> 1.26μs (49.4% faster)

def test_edge_file_id_with_non_ascii():
    # Test file ID with non-ASCII characters (should not match)
    url = "https://drive.google.com/file/d/abc123éü/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.94μs -> 1.28μs (52.0% faster)

def test_edge_url_with_spaces():
    # Test URL containing spaces (should not match)
    url = "https://drive.google.com/file/d/abc 123XYZ/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.81μs -> 1.18μs (53.7% faster)

def test_edge_url_with_subdomain():
    # Test URL with subdomain (should still match)
    url = "https://subdomain.drive.google.com/file/d/abc123XYZ/view"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.79μs -> 1.18μs (51.6% faster)

def test_edge_url_with_port():
    # Test URL with port number (should still match)
    url = "https://drive.google.com:443/file/d/abc123XYZ/view"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.75μs -> 1.10μs (59.3% faster)

def test_edge_url_with_multiple_query_params():
    # Test URL with multiple query params after file ID
    url = "https://drive.google.com/file/d/abc123XYZ/view?foo=bar&baz=qux"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.76μs -> 1.12μs (57.8% faster)

def test_edge_url_with_fragment():
    # Test URL with fragment after file ID
    url = "https://drive.google.com/file/d/abc123XYZ/view#section"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.81μs -> 1.13μs (60.7% faster)

def test_edge_url_with_unusual_path():
    # Test URL with additional path segments after file ID
    url = "https://drive.google.com/file/d/abc123XYZ/view/extra"
    expected = "abc123XYZ"
    codeflash_output = extract_google_drive_file_id(url) # 1.74μs -> 1.12μs (54.9% faster)

def test_edge_url_with_uppercase_id():
    # Test file ID in uppercase
    url = "https://drive.google.com/file/d/ABCDEF123456/view"
    expected = "ABCDEF123456"
    codeflash_output = extract_google_drive_file_id(url) # 1.68μs -> 1.06μs (57.9% faster)

def test_edge_url_with_mixed_case_id():
    # Test file ID with mixed case
    url = "https://drive.google.com/file/d/aBcDeF123/view"
    expected = "aBcDeF123"
    codeflash_output = extract_google_drive_file_id(url) # 1.69μs -> 1.12μs (51.5% faster)

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_large_scale_multiple_valid_urls():
    # Test extracting file IDs from a large list of valid URLs
    base_url = "https://drive.google.com/file/d/{}/view"
    file_ids = [f"id_{i:03d}" for i in range(1000)]  # 1000 unique file IDs
    urls = [base_url.format(fid) for fid in file_ids]
    # Check that each file ID is correctly extracted
    for url, expected in zip(urls, file_ids):
        codeflash_output = extract_google_drive_file_id(url) # 443μs -> 275μs (61.0% faster)

def test_large_scale_multiple_invalid_urls():
    # Test extracting file IDs from a large list of invalid URLs
    urls = [f"https://example.com/file/d/id_{i:03d}/view" for i in range(1000)]
    # None should match
    for url in urls:
        codeflash_output = extract_google_drive_file_id(url) # 442μs -> 268μs (65.0% faster)

def test_large_scale_mixed_valid_and_invalid_urls():
    # Test a mixed list of valid and invalid URLs
    valid_base = "https://drive.google.com/file/d/{}/view"
    invalid_base = "https://drive.google.com/drive/folders/{}"
    file_ids = [f"id_{i:03d}" for i in range(500)]
    valid_urls = [valid_base.format(fid) for fid in file_ids]
    invalid_urls = [invalid_base.format(fid) for fid in file_ids]
    urls = valid_urls + invalid_urls
    expected_results = file_ids + [None] * 500
    for url, expected in zip(urls, expected_results):
        codeflash_output = extract_google_drive_file_id(url) # 376μs -> 210μs (78.6% faster)

def test_large_scale_long_file_ids():
    # Test extracting file IDs with maximum allowed length in bulk
    long_ids = ["A" * 100 for _ in range(1000)]
    urls = [f"https://drive.google.com/file/d/{fid}/view" for fid in long_ids]
    for url, expected in zip(urls, long_ids):
        codeflash_output = extract_google_drive_file_id(url) # 526μs -> 360μs (45.9% faster)

def test_large_scale_urls_with_extra_path_and_params():
    # Test extracting file IDs from URLs with extra path segments and query params
    file_ids = [f"id_{i:03d}" for i in range(1000)]
    urls = [f"https://drive.google.com/file/d/{fid}/view/extra?foo=bar" for fid in file_ids]
    for url, expected in zip(urls, file_ids):
        codeflash_output = extract_google_drive_file_id(url) # 447μs -> 273μs (63.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import re

# imports
import pytest  # used for our unit tests
from skyvern.forge.sdk.api.files import extract_google_drive_file_id

# unit tests

# ------------------ Basic Test Cases ------------------

def test_basic_standard_url():
    # Standard Google Drive file URL
    url = "https://drive.google.com/file/d/1A2B3C4D5E6F7G8H9I0J/view?usp=sharing"
    codeflash_output = extract_google_drive_file_id(url) # 2.14μs -> 1.35μs (58.7% faster)

def test_basic_url_with_extra_params():
    # URL with extra query parameters
    url = "https://drive.google.com/file/d/abc123XYZ/view?usp=sharing&foo=bar"
    codeflash_output = extract_google_drive_file_id(url) # 1.79μs -> 1.11μs (60.6% faster)

def test_basic_url_with_different_file_id():
    # Different file ID, mixed case and underscores
    url = "https://drive.google.com/file/d/AbC_123-xYz/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.76μs -> 1.10μs (60.6% faster)

def test_basic_url_with_trailing_slash():
    # URL with trailing slash after file ID
    url = "https://drive.google.com/file/d/1234567890/"
    codeflash_output = extract_google_drive_file_id(url) # 1.61μs -> 1.08μs (48.3% faster)

def test_basic_url_with_no_view():
    # URL without /view or any suffix
    url = "https://drive.google.com/file/d/abcdefg"
    codeflash_output = extract_google_drive_file_id(url) # 1.71μs -> 1.19μs (44.3% faster)

# ------------------ Edge Test Cases ------------------

def test_edge_url_with_no_file_id():
    # URL missing file ID
    url = "https://drive.google.com/file/d//view"
    codeflash_output = extract_google_drive_file_id(url) # 1.62μs -> 995ns (62.8% faster)

def test_edge_url_with_no_file_segment():
    # URL missing /file/d/ segment
    url = "https://drive.google.com/drive/folders/1A2B3C4D5E6F7G8H9I0J"
    codeflash_output = extract_google_drive_file_id(url) # 1.29μs -> 686ns (88.8% faster)

def test_edge_url_with_non_google_domain():
    # Non-Google Drive domain
    url = "https://example.com/file/d/1A2B3C4D5E6F7G8H9I0J/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.84μs -> 1.26μs (46.2% faster)

def test_edge_url_with_multiple_file_ids():
    # Multiple /file/d/ segments; should match first one
    url = "https://drive.google.com/file/d/firstID/view/file/d/secondID/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.83μs -> 1.22μs (50.2% faster)

def test_edge_url_with_special_characters_in_id():
    # File ID with allowed special characters
    url = "https://drive.google.com/file/d/aA-_Zz09/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.86μs -> 1.15μs (61.2% faster)

def test_edge_empty_string():
    # Empty string as input
    url = ""
    codeflash_output = extract_google_drive_file_id(url) # 1.23μs -> 496ns (149% faster)

def test_edge_none_input():
    # None as input should raise TypeError
    with pytest.raises(TypeError):
        extract_google_drive_file_id(None) # 2.32μs -> 1.44μs (60.6% faster)

def test_edge_url_with_spaces():
    # URL with spaces (should not match)
    url = "https://drive.google.com/file/d/abc 123/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.89μs -> 1.32μs (43.6% faster)

def test_edge_url_with_partial_match():
    # URL with /file/d/ but not followed by valid ID
    url = "https://drive.google.com/file/d//view"
    codeflash_output = extract_google_drive_file_id(url) # 1.68μs -> 972ns (73.1% faster)

def test_edge_url_with_long_id():
    # File ID at the upper length limit (e.g., 100 chars)
    file_id = "A" * 100
    url = f"https://drive.google.com/file/d/{file_id}/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.87μs -> 1.32μs (41.8% faster)

def test_edge_url_with_invalid_characters():
    # File ID contains invalid characters (should only match up to first invalid char)
    url = "https://drive.google.com/file/d/abc123!@#/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.85μs -> 1.14μs (62.5% faster)

def test_edge_url_with_subdomain():
    # Google Drive URL with subdomain
    url = "https://docs.drive.google.com/file/d/abc123/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.75μs -> 1.19μs (47.5% faster)

def test_edge_url_with_uppercase_drive():
    # URL with uppercase DRIVE (should still match, regex is case-sensitive for path)
    url = "https://DRIVE.google.com/file/d/abc123/view"
    codeflash_output = extract_google_drive_file_id(url) # 1.78μs -> 1.12μs (58.4% faster)

# ------------------ Large Scale Test Cases ------------------

def test_large_scale_many_urls():
    # Test with a large number of valid URLs
    base_url = "https://drive.google.com/file/d/{}/view"
    ids = [f"id_{i:03d}" for i in range(1000)]  # 1000 unique file IDs
    urls = [base_url.format(file_id) for file_id in ids]
    # Ensure all IDs are extracted correctly
    for url, file_id in zip(urls, ids):
        codeflash_output = extract_google_drive_file_id(url) # 441μs -> 269μs (63.6% faster)

def test_large_scale_many_invalid_urls():
    # Test with a large number of invalid URLs
    invalid_urls = [f"https://drive.google.com/file/x/{i}/view" for i in range(1000)]
    # None should match
    for url in invalid_urls:
        codeflash_output = extract_google_drive_file_id(url) # 299μs -> 143μs (108% faster)

def test_large_scale_mixed_valid_and_invalid():
    # Mix valid and invalid URLs
    valid_ids = [f"valid_{i}" for i in range(500)]
    valid_urls = [f"https://drive.google.com/file/d/{file_id}/view" for file_id in valid_ids]
    invalid_urls = [f"https://drive.google.com/file/x/{i}/view" for i in range(500)]
    urls = valid_urls + invalid_urls
    # Shuffle to mix (no need to import random, just alternate)
    for i, url in enumerate(urls):
        if i < 500:
            codeflash_output = extract_google_drive_file_id(url)
        else:
            codeflash_output = extract_google_drive_file_id(url)

def test_large_scale_long_file_ids():
    # Test with very long file IDs (max 1000 chars)
    long_id = "A" * 1000
    url = f"https://drive.google.com/file/d/{long_id}/view"
    codeflash_output = extract_google_drive_file_id(url) # 2.86μs -> 2.23μs (28.0% faster)

def test_large_scale_urls_with_varied_patterns():
    # Test with URLs that have varied but valid patterns
    ids = [f"id_{i}" for i in range(10)]
    patterns = [
        "https://drive.google.com/file/d/{}/view",
        "https://drive.google.com/file/d/{}/",
        "https://drive.google.com/file/d/{}/edit",
        "https://drive.google.com/file/d/{}/preview",
        "https://drive.google.com/file/d/{}/somethingelse",
    ]
    for file_id in ids:
        for pattern in patterns:
            url = pattern.format(file_id)
            codeflash_output = extract_google_drive_file_id(url)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-extract_google_drive_file_id-mjaq2z7g and push.

Codeflash Static Badge

The optimization replaces repeated regex compilation with a pre-compiled regex pattern stored in a module-level variable `_drive_file_id_re`. 

**Key Change**: Instead of calling `re.search(pattern, url)` which compiles the regex on every function call, the optimized version uses a pre-compiled `re.compile(pattern)` object stored as `_drive_file_id_re` and calls its `.search()` method directly.

**Why This is Faster**: Regular expression compilation is computationally expensive, involving pattern parsing and finite state machine construction. The line profiler shows the original `re.search()` call taking 16.23ms (79.4% of total time) versus the optimized pre-compiled search taking only 3.86ms (47.5% of total time) - a **2.5x speedup** on the critical path.

**Performance Impact**: This optimization is particularly valuable given the function's usage context. The `download_file()` function calls `extract_google_drive_file_id()` for every Google Drive URL processed, meaning this function is in a hot path for file downloading workflows. The 66% overall speedup will compound significantly when processing multiple files or in high-throughput scenarios.

**Test Case Performance**: The optimization shows consistent 45-150% speedups across all test cases, with particularly strong gains on invalid URLs (96-149% faster) since the pre-compiled pattern can fail-fast without recompilation overhead. Large-scale tests demonstrate the optimization scales well, maintaining 60-80% speedups even when processing 1000 URLs.

The module-level compilation happens only once at import time, making this a zero-overhead optimization for repeated calls.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 18, 2025 00:49
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant