Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 22% (0.22x) speedup for _highlight_between in pandas/io/formats/style.py

⏱️ Runtime : 53.9 milliseconds 44.3 milliseconds (best of 10 runs)

📝 Explanation and details

The optimized code achieves a 21% speedup through several key performance improvements in the pandas styling functions:

Key Optimizations:

  1. Smart Type-Specific Handling in _validate_apply_axis_arg:

    • Added explicit fast path for np.ndarray inputs to avoid redundant np.asarray() calls
    • Uses astype(copy=False) for dtype conversion when possible, reducing unnecessary memory copies
    • Changed dtype = {"dtype": dtype} if dtype else {} to dtype_kw = {"dtype": dtype} if dtype is not None else {} for cleaner null checking
  2. Reduced Memory Allocations in _highlight_between:

    • Pre-validates and converts bounds (left_array/right_array) only once, avoiding repeated validation calls
    • Converts pandas DataFrame/Series masks to numpy arrays with to_numpy(dtype=bool, copy=False) to ensure all boolean operations work on efficient numpy arrays
    • Performs final np.where() operation only once on pre-computed numpy boolean masks instead of mixed pandas/numpy types
  3. Minimized Branching Overhead:

    • Streamlined the conditional logic flow to reduce the number of isinstance checks and branches
    • Early return patterns in _validate_apply_axis_arg eliminate unnecessary processing for common cases

Performance Impact: The optimizations are particularly effective for test cases with:

  • Scalar bounds (23-30% faster) - benefits from reduced validation overhead
  • None bounds (42-56% faster) - avoids unnecessary array conversions entirely
  • Large DataFrames (15-22% faster) - reduced memory allocations and more efficient numpy operations scale well

The improvements target the most common styling operations while maintaining full backward compatibility and preserving all error handling behavior.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 71 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import numpy as np
import pandas as pd

# imports
import pytest
from pandas.io.formats.style import _highlight_between

# unit tests

# ----------- BASIC TEST CASES -----------


def test_basic_scalar_bounds_inclusive_both():
    # All values between 1 and 5, inclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "x", left=2, right=5, inclusive="both")
    out = codeflash_output  # 873μs -> 707μs (23.4% faster)
    expected = np.array([["", "x", "x"], ["x", "x", ""]])


def test_basic_scalar_bounds_inclusive_neither():
    # Strict between
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "y", left=2, right=5, inclusive="neither")
    out = codeflash_output  # 866μs -> 696μs (24.4% faster)
    expected = np.array([["", "", "y"], ["y", "", ""]])


def test_basic_scalar_bounds_inclusive_left():
    # Left inclusive, right exclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "z", left=2, right=5, inclusive="left")
    out = codeflash_output  # 859μs -> 685μs (25.5% faster)
    expected = np.array([["", "z", "z"], ["z", "", ""]])


def test_basic_scalar_bounds_inclusive_right():
    # Left exclusive, right inclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 5]])
    codeflash_output = _highlight_between(df, "w", left=2, right=5, inclusive="right")
    out = codeflash_output  # 861μs -> 696μs (23.6% faster)
    expected = np.array([["", "", "w"], ["w", "w", "w"]])


def test_basic_sequence_bounds():
    # left and right as arrays
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [[0, 1], [2, 3]]
    right = [[2, 3], [4, 5]]
    codeflash_output = _highlight_between(
        df, "a", left=left, right=right, inclusive="both"
    )
    out = codeflash_output  # 947μs -> 802μs (18.0% faster)
    expected = np.array([["a", "a"], ["a", "a"]])


def test_basic_dataframe_bounds():
    # left and right as DataFrames
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[0, 2], [3, 1]], index=df.index, columns=df.columns)
    right = pd.DataFrame([[1, 3], [5, 4]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "b", left=left, right=right, inclusive="both"
    )
    out = codeflash_output  # 984μs -> 841μs (16.9% faster)
    expected = np.array([["b", "b"], ["b", "b"]])


def test_basic_props_empty_string():
    # props is empty string, should fill with empty string where condition met
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "", left=2, right=3, inclusive="both")
    out = codeflash_output  # 874μs -> 685μs (27.7% faster)
    expected = np.array([["", ""], ["", ""]])


def test_basic_all_highlighted():
    # All values highlighted
    df = pd.DataFrame([[1, 1], [1, 1]])
    codeflash_output = _highlight_between(df, "c", left=1, right=1, inclusive="both")
    out = codeflash_output  # 880μs -> 684μs (28.6% faster)
    expected = np.full((2, 2), "c")


def test_basic_none_highlighted():
    # No values highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "d", left=10, right=20, inclusive="both")
    out = codeflash_output  # 864μs -> 664μs (30.1% faster)


def test_basic_left_none():
    # left is None, only right bound
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "e", left=None, right=2, inclusive="both")
    out = codeflash_output  # 606μs -> 388μs (56.2% faster)
    expected = np.array([["e", "e"], ["", ""]])


def test_basic_right_none():
    # right is None, only left bound
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "f", left=2, right=None, inclusive="both")
    out = codeflash_output  # 579μs -> 407μs (42.0% faster)
    expected = np.array([["", "f"], ["f", "f"]])


def test_basic_both_none():
    # Both bounds None, everything highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "g", left=None, right=None, inclusive="both"
    )
    out = codeflash_output  # 23.3μs -> 21.3μs (9.58% faster)


def test_edge_empty_dataframe():
    # Empty DataFrame
    df = pd.DataFrame([])
    codeflash_output = _highlight_between(
        df, "x", left=None, right=None, inclusive="both"
    )
    out = codeflash_output  # 31.1μs -> 29.4μs (5.89% faster)


def test_edge_nan_values():
    # DataFrame with NaN
    df = pd.DataFrame([[1, np.nan], [np.nan, 4]])
    codeflash_output = _highlight_between(df, "i", left=1, right=4, inclusive="both")
    out = codeflash_output  # 905μs -> 736μs (23.0% faster)
    expected = np.array([["i", ""], ["", "i"]])


def test_edge_inf_values():
    # DataFrame with inf/-inf
    df = pd.DataFrame([[np.inf, -np.inf], [1, 2]])
    codeflash_output = _highlight_between(df, "j", left=0, right=3, inclusive="both")
    out = codeflash_output  # 864μs -> 663μs (30.4% faster)
    expected = np.array([["", ""], ["j", "j"]])


def test_edge_left_greater_than_right():
    # left > right, nothing should be highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "k", left=5, right=2, inclusive="both")
    out = codeflash_output  # 873μs -> 663μs (31.7% faster)


def test_edge_left_right_same():
    # left == right, only those values highlighted
    df = pd.DataFrame([[1, 2], [2, 1]])
    codeflash_output = _highlight_between(df, "l", left=2, right=2, inclusive="both")
    out = codeflash_output  # 875μs -> 691μs (26.5% faster)
    expected = np.array([["", "l"], ["l", ""]])


def test_edge_non_numeric_data():
    # DataFrame with strings, should work if bounds are strings
    df = pd.DataFrame([["a", "b"], ["c", "d"]])
    codeflash_output = _highlight_between(
        df, "m", left="b", right="d", inclusive="both"
    )
    out = codeflash_output  # 853μs -> 660μs (29.2% faster)
    expected = np.array([["", "m"], ["m", "m"]])


def test_edge_invalid_inclusive_value():
    # Invalid inclusive value
    df = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "n", left=1, right=4, inclusive="invalid"
        )  # 3.95μs -> 4.04μs (2.18% slower)


def test_edge_shape_mismatch_left():
    # left shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [1, 2, 3]
    with pytest.raises(ValueError):
        _highlight_between(
            df, "o", left=left, right=4, inclusive="both"
        )  # 10.5μs -> 10.7μs (2.16% slower)


def test_edge_shape_mismatch_right():
    # right shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = [1, 2, 3]
    with pytest.raises(ValueError):
        _highlight_between(
            df, "p", left=1, right=right, inclusive="both"
        )  # 10.7μs -> 11.0μs (2.98% slower)


def test_edge_props_special_characters():
    # props is a special string
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "<b>", left=2, right=3, inclusive="both")
    out = codeflash_output  # 884μs -> 748μs (18.1% faster)
    expected = np.array([["", "<b>"], ["<b>", ""]])


def test_edge_data_is_series():
    # data is Series
    s = pd.Series([1, 2, 3])
    codeflash_output = _highlight_between(s, "s", left=2, right=3, inclusive="both")
    out = codeflash_output  # 609μs -> 563μs (8.15% faster)
    expected = np.array(["", "s", "s"])


def test_edge_left_is_dataframe_right_is_scalar():
    # left is DataFrame, right is scalar
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[1, 2], [3, 4]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(df, "t", left=left, right=4, inclusive="both")
    out = codeflash_output  # 967μs -> 820μs (17.9% faster)
    expected = np.array([["t", "t"], ["t", "t"]])


def test_edge_left_is_scalar_right_is_dataframe():
    # left is scalar, right is DataFrame
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = pd.DataFrame([[2, 3], [4, 5]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "u", left=1, right=right, inclusive="both"
    )
    out = codeflash_output  # 951μs -> 795μs (19.7% faster)
    expected = np.array([["u", "u"], ["u", "u"]])


def test_edge_string_bounds_with_nan():
    # DataFrame with strings and NaN
    df = pd.DataFrame([["a", np.nan], ["c", "d"]])
    codeflash_output = _highlight_between(
        df, "v", left="b", right="d", inclusive="both"
    )
    out = codeflash_output  # 831μs -> 695μs (19.5% faster)
    expected = np.array([["", ""], ["v", "v"]])


# ----------- LARGE SCALE TEST CASES -----------


def test_large_scale_100x100_all_highlighted():
    # 100x100 DataFrame, all highlighted
    df = pd.DataFrame(np.full((100, 100), 5))
    codeflash_output = _highlight_between(df, "X", left=5, right=5, inclusive="both")
    out = codeflash_output  # 1.01ms -> 862μs (16.9% faster)


def test_large_scale_100x100_none_highlighted():
    # 100x100 DataFrame, none highlighted
    df = pd.DataFrame(np.full((100, 100), 0))
    codeflash_output = _highlight_between(df, "Y", left=1, right=2, inclusive="both")
    out = codeflash_output  # 1.00ms -> 851μs (17.7% faster)


def test_large_scale_100x100_diagonal_highlighted():
    # 100x100, only diagonal highlighted
    arr = np.zeros((100, 100))
    np.fill_diagonal(arr, 1)
    df = pd.DataFrame(arr)
    codeflash_output = _highlight_between(df, "Z", left=1, right=1, inclusive="both")
    out = codeflash_output  # 1.01ms -> 860μs (17.8% faster)
    expected = np.full((100, 100), "")
    for i in range(100):
        expected[i, i] = "Z"


def test_large_scale_1000x1_random_data():
    # 1000x1, random data
    np.random.seed(0)
    df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 1)))
    codeflash_output = _highlight_between(df, "B", left=10, right=20, inclusive="both")
    out = codeflash_output  # 897μs -> 730μs (22.9% faster)
    # Check that all highlighted values are in 10..20
    highlighted = df.values[(out == "B").flatten()]


def test_large_scale_500x2_nan_inf():
    # 500x2, with NaN and inf
    arr = np.full((500, 2), np.nan)
    arr[0, 0] = np.inf
    arr[-1, -1] = -np.inf
    arr[250, 0] = 5
    arr[250, 1] = 10
    df = pd.DataFrame(arr)
    codeflash_output = _highlight_between(df, "D", left=5, right=10, inclusive="both")
    out = codeflash_output  # 973μs -> 842μs (15.6% faster)
    # All others should be ""
    arr[250, 0] = arr[250, 1] = np.nan  # Remove highlights and check again
    codeflash_output = _highlight_between(df, "D", left=5, right=10, inclusive="both")
    out2 = codeflash_output  # 725μs -> 551μs (31.7% faster)


def test_large_scale_100x100_string_data():
    # 100x100, string data
    df = pd.DataFrame([["str" + str(i % 5) for i in range(100)] for _ in range(100)])
    codeflash_output = _highlight_between(
        df, "E", left="str1", right="str3", inclusive="both"
    )
    out = codeflash_output  # 1.85ms -> 1.69ms (9.48% faster)
    # Only columns where value is "str1", "str2", "str3" are highlighted
    for i in range(100):
        for j in range(100):
            if df.iloc[i, j] in ("str1", "str2", "str3"):
                pass
            else:
                pass


def test_large_scale_100x100_props_long_string():
    # 100x100, props is a long string
    df = pd.DataFrame(np.full((100, 100), 42))
    long_str = "long" * 25
    codeflash_output = _highlight_between(
        df, long_str, left=42, right=42, inclusive="both"
    )
    out = codeflash_output  # 1.76ms -> 1.62ms (8.93% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np
import pandas as pd

# imports
import pytest
from pandas.io.formats.style import _highlight_between

# --------- UNIT TESTS ----------

# 1. Basic Test Cases


def test_basic_scalar_bounds_inclusive_both():
    # Test with simple DataFrame and scalar bounds, inclusive='both'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "H", left=2, right=3, inclusive="both")
    result = codeflash_output  # 893μs -> 725μs (23.2% faster)
    expected = np.array([["", "H"], ["H", ""]])


def test_basic_scalar_bounds_inclusive_neither():
    # Test with simple DataFrame and scalar bounds, inclusive='neither'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "X", left=2, right=3, inclusive="neither")
    result = codeflash_output  # 888μs -> 699μs (27.0% faster)
    expected = np.array([["", ""], ["", ""]])  # No value strictly between 2 and 3


def test_basic_scalar_bounds_inclusive_left():
    # Test with inclusive='left'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "L", left=2, right=3, inclusive="left")
    result = codeflash_output  # 892μs -> 712μs (25.2% faster)
    expected = np.array([["", "L"], ["", ""]])


def test_basic_scalar_bounds_inclusive_right():
    # Test with inclusive='right'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "R", left=2, right=3, inclusive="right")
    result = codeflash_output  # 880μs -> 712μs (23.6% faster)
    expected = np.array([["", ""], ["R", ""]])


def test_basic_sequence_bounds():
    # Test with sequence bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [[0, 2], [2, 3]]
    right = [[2, 3], [4, 5]]
    codeflash_output = _highlight_between(
        df, "S", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 946μs -> 796μs (18.8% faster)
    expected = np.array([["S", "S"], ["S", "S"]])


def test_basic_ndarray_bounds():
    # Test with ndarray bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = np.array([[0, 2], [2, 3]])
    right = np.array([[2, 3], [4, 5]])
    codeflash_output = _highlight_between(
        df, "N", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 952μs -> 786μs (21.2% faster)
    expected = np.array([["N", "N"], ["N", "N"]])


def test_basic_dataframe_bounds():
    # Test with DataFrame bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[0, 2], [2, 3]], index=df.index, columns=df.columns)
    right = pd.DataFrame([[2, 3], [4, 5]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "D", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.01ms -> 861μs (16.8% faster)
    expected = np.array([["D", "D"], ["D", "D"]])


def test_basic_none_left_right():
    # Test with left=None, right=None (should always highlight)
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "A", left=None, right=None, inclusive="both"
    )
    result = codeflash_output  # 24.5μs -> 22.7μs (7.99% faster)
    expected = np.full(df.shape, "A")


def test_basic_none_left_only():
    # Test with left=None, right=3
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "B", left=None, right=3, inclusive="both")
    result = codeflash_output  # 632μs -> 432μs (46.2% faster)
    expected = np.array([["B", "B"], ["B", ""]])


def test_basic_none_right_only():
    # Test with left=2, right=None
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "C", left=2, right=None, inclusive="both")
    result = codeflash_output  # 584μs -> 408μs (42.9% faster)
    expected = np.array([["", "C"], ["C", "C"]])


def test_basic_with_nan_values():
    # Test with NaN values
    df = pd.DataFrame([[1, np.nan], [3, 4]])
    codeflash_output = _highlight_between(df, "Z", left=1, right=4, inclusive="both")
    result = codeflash_output  # 1.10ms -> 971μs (13.2% faster)
    expected = np.array([["Z", ""], ["Z", "Z"]])


def test_basic_empty_dataframe():
    # Test with empty DataFrame
    df = pd.DataFrame([])
    codeflash_output = _highlight_between(df, "E", left=0, right=1, inclusive="both")
    result = codeflash_output  # 404μs -> 331μs (22.1% faster)
    expected = np.empty(df.shape, dtype=str)


def test_basic_with_string_props():
    # Test with props as string
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "style", left=2, right=3, inclusive="both"
    )
    result = codeflash_output  # 877μs -> 698μs (25.6% faster)
    expected = np.array([["", "style"], ["style", ""]])


# 2. Edge Test Cases


def test_edge_invalid_inclusive_value():
    # Test with invalid inclusive value
    df = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "X", left=2, right=3, inclusive="invalid"
        )  # 3.96μs -> 4.15μs (4.70% slower)


def test_edge_left_wrong_shape():
    # Test with left shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [1, 2, 3]  # shape mismatch
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Y", left=left, right=4, inclusive="both"
        )  # 10.8μs -> 11.4μs (5.32% slower)


def test_edge_right_wrong_shape():
    # Test with right shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = [1, 2, 3]  # shape mismatch
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Y", left=1, right=right, inclusive="both"
        )  # 10.8μs -> 11.7μs (7.96% slower)


def test_edge_left_series_with_dataframe():
    # Test with left as Series and data as DataFrame (should raise)
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.Series([1, 2])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Q", left=left, right=4, inclusive="both"
        )  # 9.61μs -> 9.84μs (2.34% slower)


def test_edge_right_dataframe_with_series():
    # Test with right as DataFrame and data as Series (should raise)
    s = pd.Series([1, 2])
    right = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            s, "Q", left=0, right=right, inclusive="both"
        )  # 7.95μs -> 8.54μs (6.94% slower)


def test_edge_all_nan_bounds():
    # Test with all bounds as NaN
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = np.full(df.shape, np.nan)
    right = np.full(df.shape, np.nan)
    codeflash_output = _highlight_between(
        df, "NAN", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 972μs -> 818μs (18.8% faster)
    expected = np.full(df.shape, "")


def test_edge_props_empty_string():
    # Test with empty string as props
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "", left=2, right=3, inclusive="both")
    result = codeflash_output  # 852μs -> 687μs (24.1% faster)
    expected = np.array([["", ""], ["", ""]])


def test_edge_props_special_characters():
    # Test with props as special characters
    df = pd.DataFrame([[2, 3], [4, 5]])
    codeflash_output = _highlight_between(df, "@!#", left=3, right=4, inclusive="both")
    result = codeflash_output  # 869μs -> 699μs (24.3% faster)
    expected = np.array([["", "@!#"], ["@!#", ""]])


def test_edge_data_with_inf():
    # Test with infinite values
    df = pd.DataFrame([[np.inf, -np.inf], [0, 1]])
    codeflash_output = _highlight_between(df, "I", left=-1, right=2, inclusive="both")
    result = codeflash_output  # 854μs -> 708μs (20.7% faster)
    expected = np.array([["", ""], ["I", "I"]])


def test_edge_data_with_boolean_values():
    # Test with boolean values
    df = pd.DataFrame([[True, False], [True, True]])
    codeflash_output = _highlight_between(
        df, "B", left=True, right=True, inclusive="both"
    )
    result = codeflash_output  # 946μs -> 768μs (23.2% faster)
    expected = np.array([["B", ""], ["B", "B"]])


def test_edge_data_with_datetime():
    # Test with datetime values
    dt1 = pd.Timestamp("2020-01-01")
    dt2 = pd.Timestamp("2020-01-02")
    dt3 = pd.Timestamp("2020-01-03")
    df = pd.DataFrame([[dt1, dt2], [dt3, dt1]])
    left = pd.Timestamp("2020-01-01")
    right = pd.Timestamp("2020-01-02")
    codeflash_output = _highlight_between(
        df, "T", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 903μs -> 732μs (23.4% faster)
    expected = np.array([["T", "T"], ["", "T"]])


def test_edge_data_with_period_dtype():
    # Test with pandas Period dtype
    df = pd.DataFrame(
        [[pd.Period("2020-01", freq="M"), pd.Period("2020-02", freq="M")]]
    )
    left = pd.Period("2020-01", freq="M")
    right = pd.Period("2020-02", freq="M")
    codeflash_output = _highlight_between(
        df, "P", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.13ms -> 1.02ms (11.4% faster)
    expected = np.array([["P", "P"]])


def test_edge_data_with_timedelta():
    # Test with timedelta values
    df = pd.DataFrame([[pd.Timedelta(days=1), pd.Timedelta(days=2)]])
    left = pd.Timedelta(days=1)
    right = pd.Timedelta(days=2)
    codeflash_output = _highlight_between(
        df, "TD", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 907μs -> 723μs (25.4% faster)
    expected = np.array([["TD", "TD"]])


def test_edge_data_with_none_values():
    # Test with None values in data
    df = pd.DataFrame([[None, 2], [3, None]])
    codeflash_output = _highlight_between(df, "N", left=2, right=3, inclusive="both")
    result = codeflash_output  # 869μs -> 706μs (23.2% faster)
    expected = np.array([["", "N"], ["N", ""]])


def test_edge_data_with_empty_rows():
    # Test with DataFrame with empty rows
    df = pd.DataFrame([], columns=["A", "B"])
    codeflash_output = _highlight_between(df, "X", left=0, right=1, inclusive="both")
    result = codeflash_output  # 857μs -> 625μs (37.2% faster)
    expected = np.empty(df.shape, dtype=str)


# 3. Large Scale Test Cases


def test_large_scale_100x10_random():
    # Test with large DataFrame of random values
    np.random.seed(123)
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = 30
    right = 70
    codeflash_output = _highlight_between(
        df, "LARGE", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 881μs -> 740μs (19.1% faster)
    # Check count of highlighted cells
    highlighted_count = np.sum(result == "LARGE")
    # All values between 30 and 70 inclusive
    expected_count = np.sum((df.values >= 30) & (df.values <= 70))


def test_large_scale_500x2_sequence_bounds():
    # Test with sequence bounds for large DataFrame
    df = pd.DataFrame(np.arange(1000).reshape(500, 2))
    left = np.full(df.shape, 100)
    right = np.full(df.shape, 900)
    codeflash_output = _highlight_between(
        df, "SEQ", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 956μs -> 808μs (18.3% faster)
    # All values between 100 and 900 inclusive
    mask = (df.values >= 100) & (df.values <= 900)


def test_large_scale_1000x1_scalar_bounds():
    # Test with 1000 rows, 1 column, scalar bounds
    df = pd.DataFrame(np.arange(1000).reshape(1000, 1))
    left = 500
    right = 600
    codeflash_output = _highlight_between(
        df, "S", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 883μs -> 700μs (26.1% faster)
    mask = (df.values >= 500) & (df.values <= 600)


def test_large_scale_100x10_none_bounds():
    # Test with large DataFrame and no bounds (should always highlight)
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    codeflash_output = _highlight_between(
        df, "ALL", left=None, right=None, inclusive="both"
    )
    result = codeflash_output  # 31.1μs -> 30.1μs (3.36% faster)


def test_large_scale_100x10_nan_bounds():
    # Test with large DataFrame and all bounds as NaN
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = np.full(df.shape, np.nan)
    right = np.full(df.shape, np.nan)
    codeflash_output = _highlight_between(
        df, "NONE", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 984μs -> 825μs (19.3% faster)


def test_large_scale_100x10_mixed_bounds():
    # Test with mixed left/right bounds arrays
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = np.random.randint(0, 50, size=df.shape)
    right = np.random.randint(50, 100, size=df.shape)
    codeflash_output = _highlight_between(
        df, "MIX", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 955μs -> 789μs (21.0% faster)
    mask = (df.values >= left) & (df.values <= right)


def test_large_scale_100x10_dataframe_bounds():
    # Test with DataFrame bounds
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = pd.DataFrame(
        np.random.randint(0, 50, size=df.shape), index=df.index, columns=df.columns
    )
    right = pd.DataFrame(
        np.random.randint(50, 100, size=df.shape), index=df.index, columns=df.columns
    )
    codeflash_output = _highlight_between(
        df, "DF", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.02ms -> 852μs (20.0% faster)
    mask = (df.values >= left.values) & (df.values <= right.values)


def test_large_scale_100x10_with_nans_in_data():
    # Test with large DataFrame with NaNs in data
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)).astype(float))
    df.iloc[0:10, 0:5] = np.nan
    codeflash_output = _highlight_between(
        df, "NAN", left=10, right=90, inclusive="both"
    )
    result = codeflash_output  # 866μs -> 724μs (19.6% faster)
    mask = (df.values >= 10) & (df.values <= 90)
    # NaNs should not be highlighted
    mask = mask & ~np.isnan(df.values)


def test_large_scale_100x10_string_props():
    # Test with large DataFrame and string props
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    codeflash_output = _highlight_between(
        df, "highlight", left=20, right=80, inclusive="both"
    )
    result = codeflash_output  # 898μs -> 735μs (22.1% faster)
    mask = (df.values >= 20) & (df.values <= 80)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_highlight_between-mja1pcb6 and push.

Codeflash Static Badge

The optimized code achieves a **21% speedup** through several key performance improvements in the pandas styling functions:

**Key Optimizations:**

1. **Smart Type-Specific Handling in `_validate_apply_axis_arg`**: 
   - Added explicit fast path for `np.ndarray` inputs to avoid redundant `np.asarray()` calls
   - Uses `astype(copy=False)` for dtype conversion when possible, reducing unnecessary memory copies
   - Changed `dtype = {"dtype": dtype} if dtype else {}` to `dtype_kw = {"dtype": dtype} if dtype is not None else {}` for cleaner null checking

2. **Reduced Memory Allocations in `_highlight_between`**:
   - Pre-validates and converts bounds (`left_array`/`right_array`) only once, avoiding repeated validation calls
   - Converts pandas DataFrame/Series masks to numpy arrays with `to_numpy(dtype=bool, copy=False)` to ensure all boolean operations work on efficient numpy arrays
   - Performs final `np.where()` operation only once on pre-computed numpy boolean masks instead of mixed pandas/numpy types

3. **Minimized Branching Overhead**:
   - Streamlined the conditional logic flow to reduce the number of isinstance checks and branches
   - Early return patterns in `_validate_apply_axis_arg` eliminate unnecessary processing for common cases

**Performance Impact**: The optimizations are particularly effective for test cases with:
- **Scalar bounds** (23-30% faster) - benefits from reduced validation overhead
- **None bounds** (42-56% faster) - avoids unnecessary array conversions entirely  
- **Large DataFrames** (15-22% faster) - reduced memory allocations and more efficient numpy operations scale well

The improvements target the most common styling operations while maintaining full backward compatibility and preserving all error handling behavior.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 17, 2025 13:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant