Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 127% (1.27x) speedup for is_remote_uri in xarray/core/utils.py

⏱️ Runtime : 3.48 milliseconds 1.54 milliseconds (best of 19 runs)

📝 Explanation and details

The optimization replaces the regex-based pattern matching approach with a pre-compiled regex pattern and changes from re.search() to re.match() with is not None check.

Key optimizations applied:

  1. Pre-compiled regex pattern: The regex pattern r"^[a-z][a-z0-9]*(\://|\:\:)" is compiled once at module import time and stored in _PATTERN_REMOTE_URI, eliminating the overhead of recompiling the pattern on every function call.

  2. Switched from re.search() to re.match(): Since the pattern already starts with ^ (beginning of string anchor), re.match() is more efficient as it only checks from the start of the string rather than searching through the entire string.

  3. Explicit is not None comparison: Replaced bool() conversion with direct is not None check, which is slightly more efficient and clearer.

Why this leads to speedup:

  • Eliminated regex compilation overhead: The most significant performance gain comes from avoiding regex recompilation. In Python, re.search() with a string pattern compiles the regex on every call, which is expensive for frequently called functions.
  • More efficient matching: re.match() is faster than re.search() for patterns anchored at the beginning since it doesn't need to scan the entire string.
  • Better object reuse: The compiled pattern object is reused across all function calls.

Impact on workloads:

Based on the function references, is_remote_uri() is called in critical paths within xarray's backend system:

  • File path validation in _get_default_engine() and _get_mtime()
  • Path normalization in _normalize_path() and _find_absolute_paths()

These functions are likely called during dataset opening and file operations, making this optimization particularly valuable for workloads that process many files or repeatedly check URI types.

Test case performance:

The optimization shows consistent 50-250% speedups across all test scenarios, with particularly strong performance on:

  • Invalid URI detection (non-matching cases): 90-286% faster, as re.match() fails faster on invalid inputs
  • Large-scale processing: 100-180% speedup when processing many URIs in batches
  • Long invalid strings: Up to 286% faster on large non-URI strings due to early rejection

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 37 Passed
🌀 Generated Regression Tests 4498 Passed
⏪ Replay Tests 255 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_utils.py::test_is_remote_uri 7.20μs 3.88μs 85.3%✅
🌀 Generated Regression Tests and Runtime
import re
import string  # used for generating large scale test cases

# imports
import pytest  # used for our unit tests
from xarray.core.utils import is_remote_uri

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_http_protocol():
    # Standard HTTP URL
    codeflash_output = is_remote_uri(
        "http://example.com"
    )  # 4.26μs -> 2.68μs (58.9% faster)


def test_https_protocol():
    # Standard HTTPS URL
    codeflash_output = is_remote_uri(
        "https://example.com"
    )  # 3.83μs -> 2.31μs (66.0% faster)


def test_ftp_protocol():
    # Standard FTP URL
    codeflash_output = is_remote_uri(
        "ftp://example.com"
    )  # 3.66μs -> 2.31μs (58.1% faster)


def test_s3_protocol():
    # S3 protocol with double colon
    codeflash_output = is_remote_uri(
        "s3::bucket/key"
    )  # 3.60μs -> 2.20μs (63.5% faster)


def test_custom_protocol():
    # Custom protocol with single colon double slash
    codeflash_output = is_remote_uri(
        "myproto://some/path"
    )  # 3.71μs -> 2.17μs (71.5% faster)


def test_custom_protocol_double_colon():
    # Custom protocol with double colon
    codeflash_output = is_remote_uri(
        "myproto::some/path"
    )  # 3.69μs -> 2.21μs (66.9% faster)


def test_local_file_path():
    # Local file path should not be considered remote
    codeflash_output = is_remote_uri(
        "/usr/local/file.txt"
    )  # 3.41μs -> 1.65μs (106% faster)


def test_windows_path():
    # Windows path should not be considered remote
    codeflash_output = is_remote_uri(
        "C:\\Users\\file.txt"
    )  # 2.94μs -> 1.52μs (93.7% faster)


def test_relative_path():
    # Relative path should not be considered remote
    codeflash_output = is_remote_uri(
        "folder/file.txt"
    )  # 3.49μs -> 1.98μs (76.1% faster)


def test_url_like_but_no_protocol():
    # Looks like a URL but missing protocol
    codeflash_output = is_remote_uri(
        "://example.com"
    )  # 2.94μs -> 1.56μs (87.7% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_empty_string():
    # Empty string should not be considered remote
    codeflash_output = is_remote_uri("")  # 2.00μs -> 1.28μs (56.6% faster)


def test_protocol_with_uppercase():
    # Protocol must start with lowercase letter, so this should fail
    codeflash_output = is_remote_uri(
        "HTTP://example.com"
    )  # 3.27μs -> 1.53μs (113% faster)


def test_protocol_with_number_start():
    # Protocol must start with a letter, not a number
    codeflash_output = is_remote_uri(
        "1http://example.com"
    )  # 3.23μs -> 1.59μs (103% faster)


def test_protocol_with_underscore():
    # Protocol cannot contain underscores
    codeflash_output = is_remote_uri(
        "my_proto://example.com"
    )  # 3.64μs -> 2.00μs (81.6% faster)


def test_protocol_with_hyphen():
    # Protocol cannot contain hyphens
    codeflash_output = is_remote_uri(
        "my-proto://example.com"
    )  # 3.75μs -> 2.01μs (86.9% faster)


def test_protocol_with_dot():
    # Protocol cannot contain dots
    codeflash_output = is_remote_uri(
        "my.proto://example.com"
    )  # 3.75μs -> 2.00μs (87.4% faster)


def test_protocol_with_mixed_case():
    # Protocol must be all lowercase, so this should fail
    codeflash_output = is_remote_uri(
        "hTtp://example.com"
    )  # 3.48μs -> 1.98μs (75.5% faster)


def test_protocol_with_trailing_colon():
    # Protocol with only a single colon should not match
    codeflash_output = is_remote_uri(
        "http:example.com"
    )  # 3.83μs -> 2.08μs (83.7% faster)


def test_protocol_with_triple_colon():
    # Protocol with triple colon is not supported
    codeflash_output = is_remote_uri(
        "http:::example.com"
    )  # 3.76μs -> 2.23μs (68.1% faster)


def test_protocol_with_no_slash():
    # Protocol with colon but no slashes should not match
    codeflash_output = is_remote_uri(
        "http:example.com"
    )  # 3.69μs -> 2.00μs (84.4% faster)


def test_protocol_with_spaces():
    # Protocol with spaces is invalid
    codeflash_output = is_remote_uri(
        "ht tp://example.com"
    )  # 3.41μs -> 1.92μs (77.3% faster)


def test_protocol_with_special_chars():
    # Protocol with special chars is invalid
    codeflash_output = is_remote_uri(
        "ht@tp://example.com"
    )  # 3.71μs -> 1.92μs (93.7% faster)


def test_protocol_with_only_colon():
    # Only colon, no protocol
    codeflash_output = is_remote_uri("::example.com")  # 3.34μs -> 1.59μs (110% faster)


def test_protocol_with_only_slashes():
    # Only slashes, no protocol
    codeflash_output = is_remote_uri("///example.com")  # 3.30μs -> 1.60μs (106% faster)


def test_protocol_with_no_path():
    # Protocol with no path after
    codeflash_output = is_remote_uri("http://")  # 3.72μs -> 2.26μs (64.5% faster)
    codeflash_output = is_remote_uri("s3::")  # 1.08μs -> 602ns (79.4% faster)


def test_protocol_with_long_name():
    # Very long but valid protocol name
    proto = "a" * 50
    codeflash_output = is_remote_uri(
        f"{proto}://example.com"
    )  # 3.60μs -> 2.36μs (52.9% faster)
    codeflash_output = is_remote_uri(
        f"{proto}::example.com"
    )  # 1.15μs -> 709ns (62.8% faster)


def test_protocol_with_numbers():
    # Protocol with numbers after first letter is valid
    codeflash_output = is_remote_uri(
        "a1b2c3://example.com"
    )  # 3.76μs -> 2.18μs (72.5% faster)


def test_protocol_with_leading_zero():
    # Protocol starting with zero is invalid
    codeflash_output = is_remote_uri(
        "0abc://example.com"
    )  # 3.25μs -> 1.54μs (111% faster)


def test_protocol_with_colon_and_slash_wrong_order():
    # Colon and slash in wrong order
    codeflash_output = is_remote_uri(
        "http:/example.com"
    )  # 3.88μs -> 2.09μs (85.4% faster)
    codeflash_output = is_remote_uri(
        "http//:example.com"
    )  # 1.18μs -> 609ns (93.6% faster)


def test_protocol_with_multiple_double_colon():
    # Multiple double colons, only first should be checked
    codeflash_output = is_remote_uri(
        "s3::bucket::key"
    )  # 3.94μs -> 2.24μs (75.8% faster)


def test_protocol_with_trailing_whitespace():
    # Trailing whitespace should not affect detection
    codeflash_output = is_remote_uri(
        "http://example.com "
    )  # 3.72μs -> 2.25μs (65.1% faster)


def test_protocol_with_leading_whitespace():
    # Leading whitespace should prevent detection
    codeflash_output = is_remote_uri(
        " http://example.com"
    )  # 3.42μs -> 1.65μs (108% faster)


def test_protocol_with_embedded_whitespace():
    # Embedded whitespace in protocol
    codeflash_output = is_remote_uri(
        "ht tp://example.com"
    )  # 3.61μs -> 1.92μs (88.3% faster)


def test_protocol_with_unicode_letters():
    # Protocol with unicode letters (should not match)
    codeflash_output = is_remote_uri(
        "hттp://example.com"
    )  # 4.11μs -> 2.61μs (57.6% faster)


def test_protocol_with_non_ascii_digit():
    # Protocol with non-ascii digit (should not match)
    codeflash_output = is_remote_uri(
        "a١b://example.com"
    )  # 3.66μs -> 2.29μs (59.6% faster)


def test_protocol_with_valid_and_invalid_mix():
    # Valid protocol followed by invalid character
    codeflash_output = is_remote_uri(
        "http_://example.com"
    )  # 3.57μs -> 2.05μs (74.2% faster)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_many_valid_protocols():
    # Test a large number of valid protocol names
    for i in range(1, 100):
        proto = "a" + "b" * i
        codeflash_output = is_remote_uri(
            f"{proto}://example.com"
        )  # 72.5μs -> 41.0μs (76.8% faster)
        codeflash_output = is_remote_uri(f"{proto}::example.com")


def test_many_invalid_protocols():
    # Test a large number of invalid protocol names (start with non-letter)
    for i in range(1, 100):
        proto = "1" + "b" * i
        codeflash_output = is_remote_uri(
            f"{proto}://example.com"
        )  # 86.4μs -> 24.1μs (258% faster)
        codeflash_output = is_remote_uri(f"{proto}::example.com")


def test_large_path_string():
    # Test with a very long path after the protocol
    path = "http://" + "a" * 900
    codeflash_output = is_remote_uri(path)  # 3.59μs -> 2.37μs (51.8% faster)


def test_large_non_remote_string():
    # Large string that is not a remote URI
    path = "a" * 1000
    codeflash_output = is_remote_uri(path)  # 13.6μs -> 8.30μs (63.4% faster)


def test_protocol_with_max_length():
    # Protocol name with maximum allowed length (arbitrary, here 255)
    proto = "a" * 255
    codeflash_output = is_remote_uri(
        f"{proto}://example.com"
    )  # 4.27μs -> 2.66μs (60.9% faster)


def test_protocol_with_all_valid_protocol_chars():
    # Protocol with all valid chars (lowercase letters and digits)
    proto = "".join([c for c in string.ascii_lowercase + string.digits])
    codeflash_output = is_remote_uri(
        f"{proto}://example.com"
    )  # 3.98μs -> 2.36μs (68.7% faster)


def test_protocol_with_all_invalid_protocol_chars():
    # Protocol with all invalid chars (should not match)
    proto = "".join([c for c in string.punctuation if c not in [":", "/"]])
    codeflash_output = is_remote_uri(
        f"{proto}://example.com"
    )  # 3.28μs -> 1.61μs (103% faster)


def test_protocol_with_mixed_valid_and_invalid_chars():
    # Protocol with a mix of valid and invalid chars
    proto = "abc$def"
    codeflash_output = is_remote_uri(
        f"{proto}://example.com"
    )  # 3.50μs -> 2.00μs (74.5% faster)


def test_many_random_strings():
    # Test a variety of random strings that should not match
    for i in range(100):
        s = "".join([chr((i + j) % 128) for j in range(10)])


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import re
import string  # used for generating large test cases

# imports
import pytest  # used for our unit tests
from xarray.core.utils import is_remote_uri

# unit tests

# ------------------------
# 1. Basic Test Cases
# ------------------------


def test_http_uri():
    # Standard http URI
    codeflash_output = is_remote_uri(
        "http://example.com"
    )  # 3.91μs -> 2.35μs (65.9% faster)


def test_https_uri():
    # Standard https URI
    codeflash_output = is_remote_uri(
        "https://example.com"
    )  # 3.87μs -> 2.35μs (64.7% faster)


def test_s3_uri():
    # S3 protocol
    codeflash_output = is_remote_uri(
        "s3://bucket/key"
    )  # 3.90μs -> 2.31μs (68.7% faster)


def test_custom_protocol_double_colon():
    # Custom protocol with double colon
    codeflash_output = is_remote_uri(
        "myproto::some/path"
    )  # 3.78μs -> 2.27μs (66.2% faster)


def test_custom_protocol_single_char():
    # Single character protocol
    codeflash_output = is_remote_uri("a://foo")  # 3.73μs -> 2.30μs (62.2% faster)


def test_non_remote_file_path():
    # Plain file path, no protocol
    codeflash_output = is_remote_uri(
        "/home/user/file.txt"
    )  # 3.18μs -> 1.63μs (95.3% faster)


def test_relative_file_path():
    # Relative file path
    codeflash_output = is_remote_uri(
        "folder/file.txt"
    )  # 3.73μs -> 1.97μs (88.8% faster)


def test_windows_path():
    # Windows path with drive letter
    codeflash_output = is_remote_uri(
        "C:\\Users\\file.txt"
    )  # 2.96μs -> 1.60μs (85.2% faster)


def test_empty_string():
    # Empty string should not match
    codeflash_output = is_remote_uri("")  # 2.24μs -> 1.21μs (85.6% faster)


def test_protocol_with_numbers():
    # Protocol contains numbers
    codeflash_output = is_remote_uri(
        "proto123://foo"
    )  # 3.96μs -> 2.32μs (70.4% faster)


def test_protocol_with_uppercase():
    # Protocol starts with uppercase letter (should not match, regex is lowercase only)
    codeflash_output = is_remote_uri(
        "HTTP://example.com"
    )  # 3.42μs -> 1.67μs (105% faster)


def test_protocol_with_mixed_case():
    # Protocol with mixed case (should not match, regex is lowercase only)
    codeflash_output = is_remote_uri(
        "hTtP://example.com"
    )  # 3.69μs -> 1.97μs (87.3% faster)


def test_protocol_with_underscore():
    # Protocol with underscore (should not match, regex only allows a-z0-9)
    codeflash_output = is_remote_uri(
        "my_proto://foo"
    )  # 3.73μs -> 1.88μs (98.6% faster)


def test_protocol_with_dash():
    # Protocol with dash (should not match, regex only allows a-z0-9)
    codeflash_output = is_remote_uri(
        "my-proto://foo"
    )  # 3.33μs -> 1.94μs (71.3% faster)


def test_protocol_with_colon_in_path():
    # Colon in path, not protocol
    codeflash_output = is_remote_uri(
        "folder:subfolder/file.txt"
    )  # 3.44μs -> 2.14μs (60.8% faster)


def test_protocol_with_trailing_colon():
    # Protocol with trailing colon but not double colon
    codeflash_output = is_remote_uri("proto:/foo")  # 3.68μs -> 2.13μs (72.9% faster)


def test_protocol_with_triple_colon():
    # Protocol with triple colon (should match, as starts with double colon)
    codeflash_output = is_remote_uri("proto:::foo")  # 3.79μs -> 2.33μs (62.6% faster)


# ------------------------
# 2. Edge Test Cases
# ------------------------


def test_protocol_starts_with_digit():
    # Protocol starts with a digit (should not match, regex requires a-z)
    codeflash_output = is_remote_uri("1proto://foo")  # 3.36μs -> 1.62μs (107% faster)


def test_protocol_starts_with_special_char():
    # Protocol starts with special character (should not match)
    codeflash_output = is_remote_uri("_proto://foo")  # 3.17μs -> 1.62μs (95.3% faster)


def test_protocol_with_only_colons():
    # Only colons, no protocol
    codeflash_output = is_remote_uri("::foo")  # 3.08μs -> 1.60μs (92.4% faster)


def test_protocol_with_spaces():
    # Spaces in protocol (should not match)
    codeflash_output = is_remote_uri(
        "my proto://foo"
    )  # 3.72μs -> 1.92μs (93.8% faster)


def test_protocol_with_tab():
    # Tab character in protocol (should not match)
    codeflash_output = is_remote_uri(
        "my\tproto://foo"
    )  # 3.47μs -> 1.96μs (76.5% faster)


def test_protocol_with_non_ascii():
    # Non-ASCII protocol (should not match)
    codeflash_output = is_remote_uri(
        "prötocol://foo"
    )  # 3.55μs -> 2.03μs (74.7% faster)


def test_protocol_with_long_name():
    # Very long protocol name (should match if only a-z0-9)
    proto = "a" * 100
    codeflash_output = is_remote_uri(
        f"{proto}://foo"
    )  # 3.91μs -> 2.46μs (58.9% faster)


def test_protocol_with_no_path():
    # Protocol but no path (should match)
    codeflash_output = is_remote_uri("s3://")  # 3.73μs -> 2.37μs (57.9% faster)


def test_protocol_with_double_colon_and_slash():
    # Protocol with double colon and slash (should match)
    codeflash_output = is_remote_uri("proto::/foo")  # 3.61μs -> 2.44μs (48.2% faster)


def test_protocol_with_colon_and_slash_in_path():
    # Colon and slash in path, not protocol
    codeflash_output = is_remote_uri(
        "folder:/file.txt"
    )  # 3.75μs -> 2.17μs (72.7% faster)


def test_protocol_with_only_protocol_and_colons():
    # Only protocol and colons, no path
    codeflash_output = is_remote_uri("proto::")  # 4.05μs -> 2.34μs (73.3% faster)


def test_protocol_with_only_protocol_and_double_slash():
    # Only protocol and double slash, no path
    codeflash_output = is_remote_uri("proto://")  # 3.58μs -> 2.40μs (49.4% faster)


def test_protocol_with_trailing_spaces():
    # Trailing spaces after protocol (should not match)
    codeflash_output = is_remote_uri("proto:// foo")  # 3.81μs -> 2.31μs (64.5% faster)


def test_protocol_with_leading_spaces():
    # Leading spaces before protocol (should not match)
    codeflash_output = is_remote_uri(" proto://foo")  # 3.14μs -> 1.51μs (107% faster)


def test_protocol_with_leading_newline():
    # Leading newline before protocol (should not match)
    codeflash_output = is_remote_uri("\nproto://foo")  # 3.05μs -> 1.50μs (103% faster)


def test_protocol_with_unicode_spaces():
    # Unicode space before protocol (should not match)
    codeflash_output = is_remote_uri(
        "\u2003proto://foo"
    )  # 3.17μs -> 1.80μs (75.7% faster)


def test_protocol_with_path_containing_protocol():
    # Protocol in path, not at start
    codeflash_output = is_remote_uri(
        "folder/http://foo"
    )  # 3.65μs -> 2.11μs (73.1% faster)


def test_protocol_with_protocol_in_middle():
    # Protocol in middle of string
    codeflash_output = is_remote_uri("foo s3://bar")  # 3.58μs -> 1.92μs (86.3% faster)


def test_protocol_with_protocol_at_end():
    # Protocol at end of string (should not match)
    codeflash_output = is_remote_uri("foo://")  # 3.82μs -> 2.31μs (65.8% faster)


def test_protocol_with_multiple_protocols():
    # Multiple protocols in string, only first at start counts
    codeflash_output = is_remote_uri(
        "s3://http://foo"
    )  # 3.76μs -> 2.27μs (65.9% faster)


def test_protocol_with_double_colon_and_no_path():
    # Protocol with double colon and no path
    codeflash_output = is_remote_uri("proto::")  # 3.84μs -> 2.31μs (65.8% faster)


def test_protocol_with_double_colon_and_path_with_colon():
    # Protocol with double colon and path containing colon
    codeflash_output = is_remote_uri(
        "proto::foo:bar"
    )  # 3.50μs -> 2.33μs (50.4% faster)


def test_protocol_with_double_colon_and_path_with_slash():
    # Protocol with double colon and path containing slash
    codeflash_output = is_remote_uri(
        "proto::foo/bar"
    )  # 3.74μs -> 2.24μs (66.8% faster)


def test_protocol_with_double_colon_and_path_with_double_colon():
    # Protocol with double colon and path containing double colon
    codeflash_output = is_remote_uri(
        "proto::foo::bar"
    )  # 3.64μs -> 2.17μs (67.4% faster)


def test_protocol_with_double_colon_and_empty_path():
    # Protocol with double colon and empty path
    codeflash_output = is_remote_uri("proto::")  # 3.54μs -> 2.42μs (46.4% faster)


def test_protocol_with_double_slash_and_empty_path():
    # Protocol with double slash and empty path
    codeflash_output = is_remote_uri("proto://")  # 3.89μs -> 2.22μs (75.6% faster)


def test_protocol_with_double_slash_and_path_with_colon():
    # Protocol with double slash and path containing colon
    codeflash_output = is_remote_uri(
        "proto://foo:bar"
    )  # 3.85μs -> 2.27μs (69.5% faster)


def test_protocol_with_double_slash_and_path_with_slash():
    # Protocol with double slash and path containing slash
    codeflash_output = is_remote_uri(
        "proto://foo/bar"
    )  # 3.73μs -> 2.31μs (61.4% faster)


def test_protocol_with_double_slash_and_path_with_double_colon():
    # Protocol with double slash and path containing double colon
    codeflash_output = is_remote_uri(
        "proto://foo::bar"
    )  # 3.76μs -> 2.19μs (71.3% faster)


def test_protocol_with_double_slash_and_path_with_double_slash():
    # Protocol with double slash and path containing double slash
    codeflash_output = is_remote_uri(
        "proto://foo//bar"
    )  # 3.90μs -> 2.28μs (70.8% faster)


def test_protocol_with_protocol_and_no_separator():
    # Protocol with no separator (should not match)
    codeflash_output = is_remote_uri("protofoo")  # 3.54μs -> 1.95μs (81.7% faster)


def test_protocol_with_protocol_and_single_colon():
    # Protocol with single colon (should not match)
    codeflash_output = is_remote_uri("proto:foo")  # 3.73μs -> 1.95μs (91.2% faster)


def test_protocol_with_protocol_and_triple_colon():
    # Protocol with triple colon (should match, as starts with double colon)
    codeflash_output = is_remote_uri("proto:::foo")  # 3.80μs -> 2.30μs (65.2% faster)


def test_protocol_with_protocol_and_multiple_colons():
    # Protocol with multiple colons (should match if first is double colon)
    codeflash_output = is_remote_uri("proto::::foo")  # 3.73μs -> 2.27μs (64.1% faster)


# ------------------------
# 3. Large Scale Test Cases
# ------------------------


def test_many_remote_uris():
    # Test a large number of valid remote URIs
    for i in range(1000):
        proto = f"proto{i}"
        path = f"{proto}://some/path/{i}"
        codeflash_output = is_remote_uri(path)  # 602μs -> 287μs (110% faster)


def test_many_non_remote_uris():
    # Test a large number of invalid URIs
    for i in range(1000):
        path = f"folder_{i}/file.txt"
        codeflash_output = is_remote_uri(path)  # 665μs -> 275μs (142% faster)


def test_large_protocol_name():
    # Test with a protocol name at maximum reasonable length (255 chars)
    proto = "a" * 255
    path = f"{proto}://foo"
    codeflash_output = is_remote_uri(path)  # 4.36μs -> 2.73μs (59.8% faster)


def test_large_path():
    # Test with a very large path after valid protocol
    path = "s3://" + "a" * 990
    codeflash_output = is_remote_uri(path)  # 3.62μs -> 2.26μs (59.9% faster)


def test_large_invalid_path():
    # Test with a very large path with no protocol
    path = "folder/" + "a" * 990
    codeflash_output = is_remote_uri(path)  # 7.82μs -> 2.02μs (286% faster)


def test_protocols_with_all_valid_chars():
    # Test protocols with all valid chars (a-z, 0-9)
    valid_chars = string.ascii_lowercase + string.digits
    proto = "".join(valid_chars)
    path = f"{proto}://foo"
    codeflash_output = is_remote_uri(path)  # 3.79μs -> 2.37μs (60.3% faster)


def test_protocols_with_all_invalid_chars():
    # Test protocols with all invalid chars (upper, symbols)
    invalid_chars = string.ascii_uppercase + string.punctuation
    proto = "".join(invalid_chars)
    path = f"{proto}://foo"
    codeflash_output = is_remote_uri(path)  # 3.57μs -> 1.55μs (130% faster)


def test_mixed_large_scale():
    # Mix valid and invalid URIs in a large batch
    for i in range(500):
        # Valid
        codeflash_output = is_remote_uri(
            f"proto{i}://foo/bar/{i}"
        )  # 316μs -> 156μs (102% faster)
        # Invalid
        codeflash_output = is_remote_uri(f"folder_{i}/foo/bar/{i}")


def test_large_batch_double_colon():
    # Test a large batch of double colon URIs
    for i in range(500):
        proto = f"proto{i}"
        path = f"{proto}::foo/bar/{i}"
        codeflash_output = is_remote_uri(path)  # 304μs -> 143μs (112% faster)


def test_large_batch_invalid_double_colon():
    # Test a large batch of invalid double colon URIs (protocol starts with digit)
    for i in range(500):
        proto = f"{i}proto"
        path = f"{proto}::foo/bar/{i}"
        codeflash_output = is_remote_uri(path)  # 314μs -> 111μs (182% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from xarray.core.utils import is_remote_uri

def test_is_remote_uri():
    is_remote_uri('')

Timer unit: 1e-09 s
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_xarrayteststest_concat_py_xarrayteststest_computation_py_xarrayteststest_formatting_py_xarray__replay_test_0.py::test_xarray_core_utils_is_remote_uri 242μs 78.5μs 209%✅

To edit these changes git checkout codeflash/optimize-is_remote_uri-mj9v9e20 and push.

Codeflash Static Badge

The optimization replaces the regex-based pattern matching approach with a pre-compiled regex pattern and changes from `re.search()` to `re.match()` with `is not None` check.

**Key optimizations applied:**

1. **Pre-compiled regex pattern**: The regex pattern `r"^[a-z][a-z0-9]*(\://|\:\:)"` is compiled once at module import time and stored in `_PATTERN_REMOTE_URI`, eliminating the overhead of recompiling the pattern on every function call.

2. **Switched from `re.search()` to `re.match()`**: Since the pattern already starts with `^` (beginning of string anchor), `re.match()` is more efficient as it only checks from the start of the string rather than searching through the entire string.

3. **Explicit `is not None` comparison**: Replaced `bool()` conversion with direct `is not None` check, which is slightly more efficient and clearer.

**Why this leads to speedup:**

- **Eliminated regex compilation overhead**: The most significant performance gain comes from avoiding regex recompilation. In Python, `re.search()` with a string pattern compiles the regex on every call, which is expensive for frequently called functions.
- **More efficient matching**: `re.match()` is faster than `re.search()` for patterns anchored at the beginning since it doesn't need to scan the entire string.
- **Better object reuse**: The compiled pattern object is reused across all function calls.

**Impact on workloads:**

Based on the function references, `is_remote_uri()` is called in critical paths within xarray's backend system:
- File path validation in `_get_default_engine()` and `_get_mtime()` 
- Path normalization in `_normalize_path()` and `_find_absolute_paths()`

These functions are likely called during dataset opening and file operations, making this optimization particularly valuable for workloads that process many files or repeatedly check URI types.

**Test case performance:**

The optimization shows consistent 50-250% speedups across all test scenarios, with particularly strong performance on:
- **Invalid URI detection** (non-matching cases): 90-286% faster, as `re.match()` fails faster on invalid inputs
- **Large-scale processing**: 100-180% speedup when processing many URIs in batches
- **Long invalid strings**: Up to 286% faster on large non-URI strings due to early rejection
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 17, 2025 10:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant