Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 7, 2025

📄 124% (1.24x) speedup for regex_match in src/algorithms/string.py

⏱️ Runtime : 3.87 milliseconds 1.73 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization improves performance by pre-compiling the regex pattern once instead of compiling it on every iteration.

Key Change: Added regex = re.compile(pattern) before the loop and used regex.match(s) instead of re.match(pattern, s).

Why This is Faster: In the original code, re.match(pattern, s) internally compiles the pattern string into a regex object on every call. With thousands of strings to process, this compilation overhead becomes significant. The line profiler shows the original re.match(pattern, s) line took 83.4% of total runtime (34.9ms out of 41.9ms), while the optimized version reduces this to just 22% (5.4ms out of 24.7ms).

Performance Impact: The optimization delivers a 123% speedup (from 3.87ms to 1.73ms) because:

  • Pattern compilation now happens once instead of 12,000+ times
  • The compiled regex object has faster matching performance
  • Memory allocation overhead is reduced

Test Case Performance: The optimization particularly excels in:

  • Large-scale tests with 1000+ strings where pattern reuse is maximized
  • Complex patterns (unicode, lookaheads, quantifiers) where compilation cost is higher
  • Repeated pattern matching scenarios common in data processing pipelines

This optimization is especially valuable when the function processes large datasets or is called frequently, as the compilation savings compound with input size.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 66 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from src.algorithms.string import regex_match

# unit tests

# --- Basic Test Cases ---


def test_basic_exact_match():
    # Test simple exact matching
    strings = ["apple", "banana", "apricot", "applepie"]
    pattern = r"apple"
    # Only 'apple' should match at the start of the string
    codeflash_output = regex_match(strings, pattern)


def test_basic_start_of_string():
    # Test ^ anchor (start of string)
    strings = ["cat", "catalog", "dog", "scatter"]
    pattern = r"^cat"
    # Should match 'cat' and 'catalog', but not 'scatter' (which starts with 'sca')
    codeflash_output = regex_match(strings, pattern)


def test_basic_dot_wildcard():
    # Test . wildcard
    strings = ["bat", "cat", "rat", "mat"]
    pattern = r".at"
    # All should match as . matches any single character
    codeflash_output = regex_match(strings, pattern)


def test_basic_digit_match():
    # Test \d digit matching
    strings = ["a1", "b2", "c3", "d", "12"]
    pattern = r".\d"
    # Only 'a1', 'b2', 'c3' should match (single char + digit)
    codeflash_output = regex_match(strings, pattern)


def test_basic_alternation():
    # Test alternation (| operator)
    strings = ["dog", "cat", "cow", "bat"]
    pattern = r"dog|cat"
    # Should match 'dog' and 'cat'
    codeflash_output = regex_match(strings, pattern)


def test_basic_empty_pattern():
    # An empty pattern matches every string at the start
    strings = ["", "a", "b", " "]
    pattern = r""
    codeflash_output = regex_match(strings, pattern)


def test_basic_empty_strings():
    # Test empty strings in the input
    strings = ["", "a", "b"]
    pattern = r"^$"
    # Only the empty string should match
    codeflash_output = regex_match(strings, pattern)


# --- Edge Test Cases ---


def test_edge_empty_list():
    # Test with an empty list of strings
    strings = []
    pattern = r".*"
    codeflash_output = regex_match(strings, pattern)


def test_edge_special_characters():
    # Test special regex characters
    strings = ["a.c", "abc", "a-c", "a+c"]
    pattern = r"a\.c"
    # Only 'a.c' should match (escaped .)
    codeflash_output = regex_match(strings, pattern)


def test_edge_start_and_end_anchors():
    # Test ^ and $ anchors together (exact match)
    strings = ["yes", "no", "yesno", "noyes"]
    pattern = r"^yes$"
    codeflash_output = regex_match(strings, pattern)


def test_edge_case_sensitive():
    # Test case sensitivity
    strings = ["Hello", "hello", "HELLO"]
    pattern = r"hello"
    # Only 'hello' should match
    codeflash_output = regex_match(strings, pattern)


def test_edge_case_insensitive_flag():
    # Test case insensitivity using inline flag
    strings = ["Hello", "hello", "HELLO"]
    pattern = r"(?i)hello"
    # All should match due to case-insensitive flag
    codeflash_output = regex_match(strings, pattern)


def test_edge_greedy_quantifier():
    # Test greedy quantifiers
    strings = ["aaa", "aa", "a", ""]
    pattern = r"a+"
    # All except the empty string should match
    codeflash_output = regex_match(strings, pattern)


def test_edge_lazy_quantifier():
    # Test lazy quantifiers (should still match at least one 'a')
    strings = ["aaa", "aa", "a", ""]
    pattern = r"a+?"
    # All except the empty string should match
    codeflash_output = regex_match(strings, pattern)


def test_edge_non_ascii_characters():
    # Test Unicode/non-ASCII characters
    strings = ["café", "cafe", "CAFÉ"]
    pattern = r"caf."
    # 'café' and 'cafe' should match ('.' matches é or e)
    codeflash_output = regex_match(strings, pattern)


def test_edge_multiline_pattern():
    # Test multiline pattern (should not match newlines unless specified)
    strings = ["abc\ndef", "abc", "def"]
    pattern = r"^abc$"
    # Only 'abc' should match, not 'abc\ndef'
    codeflash_output = regex_match(strings, pattern)


def test_edge_lookahead():
    # Test lookahead
    strings = ["foo1", "foo2", "foo", "foobar"]
    pattern = r"foo(?=\d)"
    # Should match 'foo1' and 'foo2' (foo followed by a digit)
    codeflash_output = regex_match(strings, pattern)


def test_edge_lookbehind():
    # Test lookbehind
    strings = ["1bar", "2bar", "bar", "foobar"]
    pattern = r"(?<=\d)bar"
    # Should match '1bar' and '2bar'
    codeflash_output = regex_match(strings, pattern)


def test_edge_invalid_pattern():
    # Test invalid regex pattern should raise re.error
    strings = ["a", "b"]
    pattern = r"([a-z"  # missing closing bracket
    with pytest.raises(re.error):
        regex_match(strings, pattern)


def test_edge_pattern_longer_than_string():
    # Pattern longer than any string
    strings = ["a", "ab", "abc"]
    pattern = r"abcdef"
    codeflash_output = regex_match(strings, pattern)


def test_edge_match_entire_string():
    # Pattern matches the entire string only
    strings = ["abc", "abcd", "abcde"]
    pattern = r"^abc$"
    codeflash_output = regex_match(strings, pattern)


# --- Large Scale Test Cases ---


def test_large_all_match():
    # All strings match the pattern
    strings = ["test"] * 1000
    pattern = r"test"
    codeflash_output = regex_match(strings, pattern)


def test_large_none_match():
    # No strings match the pattern
    strings = ["foo"] * 1000
    pattern = r"bar"
    codeflash_output = regex_match(strings, pattern)


def test_large_some_match():
    # Half the strings match the pattern
    strings = ["match" if i % 2 == 0 else "nope" for i in range(1000)]
    pattern = r"match"
    expected = ["match"] * 500
    codeflash_output = regex_match(strings, pattern)


def test_large_varied_patterns():
    # Varied strings and a pattern that matches numbers at the start
    strings = [f"{i}foo" for i in range(1000)] + ["foo", "bar"]
    pattern = r"^\d+foo"
    expected = [f"{i}foo" for i in range(1000)]
    codeflash_output = regex_match(strings, pattern)


def test_large_long_strings():
    # Very long strings, only some match
    strings = ["a" * 500 + "b" for _ in range(500)] + ["b" * 501 for _ in range(500)]
    pattern = r"^a+b$"
    # Only the first 500 strings should match
    codeflash_output = regex_match(strings, pattern)


def test_large_pattern_performance():
    # Test with a pattern that could be catastrophic for backtracking if implemented incorrectly
    strings = ["a" * 50 + "b" for _ in range(100)] + ["a" * 51 for _ in range(100)]
    pattern = r"a{50,}b"
    # Only the first 100 strings should match
    codeflash_output = regex_match(strings, pattern)


# --- Additional Robustness Tests ---


def test_pattern_matches_at_start_only():
    # re.match only matches at the start, not in the middle
    strings = ["foo", "barfoo", "foobar", "bar"]
    pattern = r"foo"
    # Only 'foo' and 'foobar' should match
    codeflash_output = regex_match(strings, pattern)


def test_pattern_matches_middle_should_not_match():
    # Should not match if pattern is not at the start
    strings = ["abc", "xabc", "abcx", "xabcx"]
    pattern = r"abc"
    # Only 'abc' and 'abcx' should match (starts with 'abc')
    codeflash_output = regex_match(strings, pattern)


def test_pattern_with_escape_sequences():
    # Test pattern with escape sequences
    strings = ["tab\tchar", "tab char", "tab\t"]
    pattern = r"tab\t"
    # Should match 'tab\tchar' and 'tab\t'
    codeflash_output = regex_match(strings, pattern)


def test_pattern_with_grouping_and_repetition():
    # Pattern with groups and repetition
    strings = ["ababab", "ab", "abab", "aabb"]
    pattern = r"(ab)+"
    # All except 'aabb' should match
    codeflash_output = regex_match(strings, pattern)


def test_pattern_with_optional():
    # Pattern with optional '?'
    strings = ["color", "colour", "colr"]
    pattern = r"colou?r"
    # 'color' and 'colour' should match
    codeflash_output = regex_match(strings, pattern)


def test_pattern_with_unicode_flag():
    # Pattern with unicode flag (should match unicode chars)
    strings = ["naïve", "naive", "naïveté"]
    pattern = r"naïve"
    codeflash_output = regex_match(strings, pattern)


def test_pattern_with_word_boundary():
    # Pattern with \b word boundary
    strings = ["foo", "foobar", "barfoo", "bar foo"]
    pattern = r"\bfoo\b"
    # Only 'foo' should match
    codeflash_output = regex_match(strings, pattern)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import random
import re
import string

# imports
import pytest  # used for our unit tests
from src.algorithms.string import regex_match

# unit tests

# ------------------- Basic Test Cases -------------------


def test_basic_exact_match():
    # Should match strings that are exactly 'abc'
    strings = ["abc", "abcd", "ab", "abc"]
    pattern = r"^abc$"
    expected = ["abc", "abc"]
    codeflash_output = regex_match(strings, pattern)


def test_basic_startswith():
    # Should match strings starting with 'foo'
    strings = ["foobar", "foo", "barfoo", "bar", "foobaz"]
    pattern = r"^foo"
    expected = ["foobar", "foo", "foobaz"]
    codeflash_output = regex_match(strings, pattern)


def test_basic_digit():
    # Should match strings starting with a digit
    strings = ["1abc", "abc", "2def", "3", "a1", ""]
    pattern = r"^\d"
    expected = ["1abc", "2def", "3"]
    codeflash_output = regex_match(strings, pattern)


def test_basic_dot_any_character():
    # Should match any string starting with any character
    strings = ["abc", "def", "", "1", " "]
    pattern = r"^."
    expected = ["abc", "def", "1", " "]
    codeflash_output = regex_match(strings, pattern)


def test_basic_empty_pattern():
    # Empty pattern should match every string (matches at position 0)
    strings = ["abc", "", "def"]
    pattern = r""
    expected = ["abc", "", "def"]
    codeflash_output = regex_match(strings, pattern)


# ------------------- Edge Test Cases -------------------


def test_edge_empty_strings_list():
    # Empty list should always return empty list
    strings = []
    pattern = r".*"
    expected = []
    codeflash_output = regex_match(strings, pattern)


def test_edge_pattern_never_matches():
    # Pattern that cannot match any string
    strings = ["a", "b", "c"]
    pattern = r"^xyz"
    expected = []
    codeflash_output = regex_match(strings, pattern)


def test_edge_all_strings_empty():
    # All input strings are empty
    strings = ["", "", ""]
    pattern = r"^$"
    expected = ["", "", ""]
    codeflash_output = regex_match(strings, pattern)


def test_edge_special_characters():
    # Pattern with special regex characters
    strings = ["a.c", "abc", "a-c", "a+c", "a*c"]
    pattern = r"^a\.c$"
    expected = ["a.c"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_unicode_strings():
    # Unicode and non-ASCII characters
    strings = ["café", "cafe", "CAFÉ", "café123"]
    pattern = r"^café$"
    expected = ["café"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_multiline_strings():
    # Multiline string, pattern should only match start of string, not start of lines
    strings = ["abc\ndef", "abc", "def\nabc"]
    pattern = r"^abc"
    expected = ["abc\ndef", "abc"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_anchors():
    # Patterns with start and end anchors
    strings = ["test", "test1", "1test", "test\n", "test"]
    pattern = r"^test$"
    expected = ["test", "test"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_case_sensitivity():
    # Regex is case sensitive by default
    strings = ["Test", "test", "TEST"]
    pattern = r"^test$"
    expected = ["test"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_match_vs_search():
    # Should only match at the beginning, not anywhere in the string
    strings = ["abc", "xabc", "abcx", "xabcx"]
    pattern = r"abc"
    expected = ["abc"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_empty_pattern_with_empty_string():
    # Empty pattern matches empty string
    strings = [""]
    pattern = r""
    expected = [""]
    codeflash_output = regex_match(strings, pattern)


def test_edge_empty_pattern_with_nonempty_string():
    # Empty pattern matches all strings
    strings = ["a", "b", "c"]
    pattern = r""
    expected = ["a", "b", "c"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_pattern_with_group():
    # Pattern with a group
    strings = ["foo", "foobar", "barfoo", "foofoo"]
    pattern = r"^(foo)+$"
    expected = ["foo", "foofoo"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_pattern_with_optional():
    # Pattern with optional group
    strings = ["color", "colour", "colr"]
    pattern = r"^colou?r$"
    expected = ["color", "colour"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_pattern_with_alternation():
    # Pattern with alternation
    strings = ["cat", "dog", "bat", "rat"]
    pattern = r"^(cat|dog)$"
    expected = ["cat", "dog"]
    codeflash_output = regex_match(strings, pattern)


def test_edge_pattern_with_escaped_backslash():
    # Pattern with escaped backslash
    strings = [r"\n", "\\n", "n", r"\\n"]
    pattern = r"^\\n$"
    expected = ["\\n"]
    codeflash_output = regex_match(strings, pattern)


# ------------------- Large Scale Test Cases -------------------


def test_large_scale_all_match():
    # All strings match the pattern
    strings = ["abc"] * 1000
    pattern = r"^abc$"
    expected = ["abc"] * 1000
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_none_match():
    # No strings match the pattern
    strings = ["def"] * 1000
    pattern = r"^abc$"
    expected = []
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_some_match():
    # Half the strings match the pattern
    strings = ["abc" if i % 2 == 0 else "def" for i in range(1000)]
    pattern = r"^abc$"
    expected = ["abc"] * 500
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_long_strings():
    # Long strings, only some match
    base = "a" * 500
    strings = [base, base + "b", "b" + base, base]
    pattern = r"^a{500}$"
    expected = [base, base]
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_randomized():
    # Random strings, only those starting with 'test' match
    random.seed(42)
    strings = [
        "test" + "".join(random.choices(string.ascii_lowercase, k=10))
        for _ in range(500)
    ]
    strings += [
        "foo" + "".join(random.choices(string.ascii_lowercase, k=10))
        for _ in range(500)
    ]
    pattern = r"^test"
    expected = [s for s in strings if s.startswith("test")]
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_unicode():
    # Large list with unicode, only those starting with 'ü'
    strings = [
        "ü" + "".join(random.choices(string.ascii_lowercase, k=5)) for _ in range(500)
    ]
    strings += [
        "a" + "".join(random.choices(string.ascii_lowercase, k=5)) for _ in range(500)
    ]
    pattern = r"^ü"
    expected = [s for s in strings if s.startswith("ü")]
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_empty_strings():
    # Large number of empty strings, pattern is '^
    strings = [""] * 1000
    pattern = r"^$"
    expected = [""] * 1000
    codeflash_output = regex_match(strings, pattern)


def test_large_scale_pattern_with_dot_star():
    # Pattern '.*' matches everything including empty string
    strings = ["abc", "", "def", "ghi"] * 250
    pattern = r".*"
    expected = strings.copy()
    codeflash_output = regex_match(strings, pattern)


# ------------------- Mutation Testing Guards -------------------


def test_mutation_guard_must_use_match_not_search():
    # If regex_match uses re.search instead of re.match, this test will fail
    strings = ["abc", "xabc", "abcx"]
    pattern = r"abc"
    # Only 'abc' should match, not 'xabc' or 'abcx'
    expected = ["abc"]
    codeflash_output = regex_match(strings, pattern)


def test_mutation_guard_must_not_return_input():
    # If regex_match returns the input list instead of filtering, this will fail
    strings = ["abc", "def", "ghi"]
    pattern = r"^abc$"
    expected = ["abc"]
    codeflash_output = regex_match(strings, pattern)


def test_mutation_guard_must_not_return_empty():
    # If regex_match always returns empty, this will fail
    strings = ["abc"]
    pattern = r"^abc$"
    expected = ["abc"]
    codeflash_output = regex_match(strings, pattern)


def test_mutation_guard_must_not_return_all():
    # If regex_match always returns all, this will fail
    strings = ["abc", "def"]
    pattern = r"^abc$"
    expected = ["abc"]
    codeflash_output = regex_match(strings, pattern)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from src.algorithms.string import regex_match


def test_regex_match():
    regex_match(["", "\x00"], "\x00")
🔎 Concolic Coverage Tests and Runtime

To edit these changes git checkout codeflash/optimize-regex_match-mivnsmwn and push.

Codeflash Static Badge

The optimization improves performance by **pre-compiling the regex pattern** once instead of compiling it on every iteration. 

**Key Change**: Added `regex = re.compile(pattern)` before the loop and used `regex.match(s)` instead of `re.match(pattern, s)`.

**Why This is Faster**: In the original code, `re.match(pattern, s)` internally compiles the pattern string into a regex object on every call. With thousands of strings to process, this compilation overhead becomes significant. The line profiler shows the original `re.match(pattern, s)` line took 83.4% of total runtime (34.9ms out of 41.9ms), while the optimized version reduces this to just 22% (5.4ms out of 24.7ms).

**Performance Impact**: The optimization delivers a **123% speedup** (from 3.87ms to 1.73ms) because:
- Pattern compilation now happens once instead of 12,000+ times
- The compiled regex object has faster matching performance
- Memory allocation overhead is reduced

**Test Case Performance**: The optimization particularly excels in:
- **Large-scale tests** with 1000+ strings where pattern reuse is maximized
- **Complex patterns** (unicode, lookaheads, quantifiers) where compilation cost is higher
- **Repeated pattern matching** scenarios common in data processing pipelines

This optimization is especially valuable when the function processes large datasets or is called frequently, as the compilation savings compound with input size.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 December 7, 2025 11:48
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 7, 2025
@KRRT7 KRRT7 closed this Dec 7, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-regex_match-mivnsmwn branch December 7, 2025 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants