⚡️ Speed up function `_highlight_between` by 22% #412

codeflash-ai · 2025-12-17T13:26:44Z

📄 22% (0.22x) speedup for `_highlight_between` in `pandas/io/formats/style.py`

⏱️ Runtime : 53.9 milliseconds → 44.3 milliseconds (best of 10 runs)

📝 Explanation and details

The optimized code achieves a 21% speedup through several key performance improvements in the pandas styling functions:

Key Optimizations:

Smart Type-Specific Handling in _validate_apply_axis_arg:
- Added explicit fast path for np.ndarray inputs to avoid redundant np.asarray() calls
- Uses astype(copy=False) for dtype conversion when possible, reducing unnecessary memory copies
- Changed dtype = {"dtype": dtype} if dtype else {} to dtype_kw = {"dtype": dtype} if dtype is not None else {} for cleaner null checking
Reduced Memory Allocations in _highlight_between:
- Pre-validates and converts bounds (left_array/right_array) only once, avoiding repeated validation calls
- Converts pandas DataFrame/Series masks to numpy arrays with to_numpy(dtype=bool, copy=False) to ensure all boolean operations work on efficient numpy arrays
- Performs final np.where() operation only once on pre-computed numpy boolean masks instead of mixed pandas/numpy types
Minimized Branching Overhead:
- Streamlined the conditional logic flow to reduce the number of isinstance checks and branches
- Early return patterns in _validate_apply_axis_arg eliminate unnecessary processing for common cases

Performance Impact: The optimizations are particularly effective for test cases with:

Scalar bounds (23-30% faster) - benefits from reduced validation overhead
None bounds (42-56% faster) - avoids unnecessary array conversions entirely
Large DataFrames (15-22% faster) - reduced memory allocations and more efficient numpy operations scale well

The improvements target the most common styling operations while maintaining full backward compatibility and preserving all error handling behavior.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 71 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import numpy as np
import pandas as pd

# imports
import pytest
from pandas.io.formats.style import _highlight_between

# unit tests

# ----------- BASIC TEST CASES -----------


def test_basic_scalar_bounds_inclusive_both():
    # All values between 1 and 5, inclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "x", left=2, right=5, inclusive="both")
    out = codeflash_output  # 873μs -> 707μs (23.4% faster)
    expected = np.array([["", "x", "x"], ["x", "x", ""]])


def test_basic_scalar_bounds_inclusive_neither():
    # Strict between
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "y", left=2, right=5, inclusive="neither")
    out = codeflash_output  # 866μs -> 696μs (24.4% faster)
    expected = np.array([["", "", "y"], ["y", "", ""]])


def test_basic_scalar_bounds_inclusive_left():
    # Left inclusive, right exclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
    codeflash_output = _highlight_between(df, "z", left=2, right=5, inclusive="left")
    out = codeflash_output  # 859μs -> 685μs (25.5% faster)
    expected = np.array([["", "z", "z"], ["z", "", ""]])


def test_basic_scalar_bounds_inclusive_right():
    # Left exclusive, right inclusive
    df = pd.DataFrame([[1, 2, 3], [4, 5, 5]])
    codeflash_output = _highlight_between(df, "w", left=2, right=5, inclusive="right")
    out = codeflash_output  # 861μs -> 696μs (23.6% faster)
    expected = np.array([["", "", "w"], ["w", "w", "w"]])


def test_basic_sequence_bounds():
    # left and right as arrays
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [[0, 1], [2, 3]]
    right = [[2, 3], [4, 5]]
    codeflash_output = _highlight_between(
        df, "a", left=left, right=right, inclusive="both"
    )
    out = codeflash_output  # 947μs -> 802μs (18.0% faster)
    expected = np.array([["a", "a"], ["a", "a"]])


def test_basic_dataframe_bounds():
    # left and right as DataFrames
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[0, 2], [3, 1]], index=df.index, columns=df.columns)
    right = pd.DataFrame([[1, 3], [5, 4]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "b", left=left, right=right, inclusive="both"
    )
    out = codeflash_output  # 984μs -> 841μs (16.9% faster)
    expected = np.array([["b", "b"], ["b", "b"]])


def test_basic_props_empty_string():
    # props is empty string, should fill with empty string where condition met
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "", left=2, right=3, inclusive="both")
    out = codeflash_output  # 874μs -> 685μs (27.7% faster)
    expected = np.array([["", ""], ["", ""]])


def test_basic_all_highlighted():
    # All values highlighted
    df = pd.DataFrame([[1, 1], [1, 1]])
    codeflash_output = _highlight_between(df, "c", left=1, right=1, inclusive="both")
    out = codeflash_output  # 880μs -> 684μs (28.6% faster)
    expected = np.full((2, 2), "c")


def test_basic_none_highlighted():
    # No values highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "d", left=10, right=20, inclusive="both")
    out = codeflash_output  # 864μs -> 664μs (30.1% faster)


def test_basic_left_none():
    # left is None, only right bound
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "e", left=None, right=2, inclusive="both")
    out = codeflash_output  # 606μs -> 388μs (56.2% faster)
    expected = np.array([["e", "e"], ["", ""]])


def test_basic_right_none():
    # right is None, only left bound
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "f", left=2, right=None, inclusive="both")
    out = codeflash_output  # 579μs -> 407μs (42.0% faster)
    expected = np.array([["", "f"], ["f", "f"]])


def test_basic_both_none():
    # Both bounds None, everything highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "g", left=None, right=None, inclusive="both"
    )
    out = codeflash_output  # 23.3μs -> 21.3μs (9.58% faster)


def test_edge_empty_dataframe():
    # Empty DataFrame
    df = pd.DataFrame([])
    codeflash_output = _highlight_between(
        df, "x", left=None, right=None, inclusive="both"
    )
    out = codeflash_output  # 31.1μs -> 29.4μs (5.89% faster)


def test_edge_nan_values():
    # DataFrame with NaN
    df = pd.DataFrame([[1, np.nan], [np.nan, 4]])
    codeflash_output = _highlight_between(df, "i", left=1, right=4, inclusive="both")
    out = codeflash_output  # 905μs -> 736μs (23.0% faster)
    expected = np.array([["i", ""], ["", "i"]])


def test_edge_inf_values():
    # DataFrame with inf/-inf
    df = pd.DataFrame([[np.inf, -np.inf], [1, 2]])
    codeflash_output = _highlight_between(df, "j", left=0, right=3, inclusive="both")
    out = codeflash_output  # 864μs -> 663μs (30.4% faster)
    expected = np.array([["", ""], ["j", "j"]])


def test_edge_left_greater_than_right():
    # left > right, nothing should be highlighted
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "k", left=5, right=2, inclusive="both")
    out = codeflash_output  # 873μs -> 663μs (31.7% faster)


def test_edge_left_right_same():
    # left == right, only those values highlighted
    df = pd.DataFrame([[1, 2], [2, 1]])
    codeflash_output = _highlight_between(df, "l", left=2, right=2, inclusive="both")
    out = codeflash_output  # 875μs -> 691μs (26.5% faster)
    expected = np.array([["", "l"], ["l", ""]])


def test_edge_non_numeric_data():
    # DataFrame with strings, should work if bounds are strings
    df = pd.DataFrame([["a", "b"], ["c", "d"]])
    codeflash_output = _highlight_between(
        df, "m", left="b", right="d", inclusive="both"
    )
    out = codeflash_output  # 853μs -> 660μs (29.2% faster)
    expected = np.array([["", "m"], ["m", "m"]])


def test_edge_invalid_inclusive_value():
    # Invalid inclusive value
    df = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "n", left=1, right=4, inclusive="invalid"
        )  # 3.95μs -> 4.04μs (2.18% slower)


def test_edge_shape_mismatch_left():
    # left shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [1, 2, 3]
    with pytest.raises(ValueError):
        _highlight_between(
            df, "o", left=left, right=4, inclusive="both"
        )  # 10.5μs -> 10.7μs (2.16% slower)


def test_edge_shape_mismatch_right():
    # right shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = [1, 2, 3]
    with pytest.raises(ValueError):
        _highlight_between(
            df, "p", left=1, right=right, inclusive="both"
        )  # 10.7μs -> 11.0μs (2.98% slower)


def test_edge_props_special_characters():
    # props is a special string
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "<b>", left=2, right=3, inclusive="both")
    out = codeflash_output  # 884μs -> 748μs (18.1% faster)
    expected = np.array([["", "<b>"], ["<b>", ""]])


def test_edge_data_is_series():
    # data is Series
    s = pd.Series([1, 2, 3])
    codeflash_output = _highlight_between(s, "s", left=2, right=3, inclusive="both")
    out = codeflash_output  # 609μs -> 563μs (8.15% faster)
    expected = np.array(["", "s", "s"])


def test_edge_left_is_dataframe_right_is_scalar():
    # left is DataFrame, right is scalar
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[1, 2], [3, 4]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(df, "t", left=left, right=4, inclusive="both")
    out = codeflash_output  # 967μs -> 820μs (17.9% faster)
    expected = np.array([["t", "t"], ["t", "t"]])


def test_edge_left_is_scalar_right_is_dataframe():
    # left is scalar, right is DataFrame
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = pd.DataFrame([[2, 3], [4, 5]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "u", left=1, right=right, inclusive="both"
    )
    out = codeflash_output  # 951μs -> 795μs (19.7% faster)
    expected = np.array([["u", "u"], ["u", "u"]])


def test_edge_string_bounds_with_nan():
    # DataFrame with strings and NaN
    df = pd.DataFrame([["a", np.nan], ["c", "d"]])
    codeflash_output = _highlight_between(
        df, "v", left="b", right="d", inclusive="both"
    )
    out = codeflash_output  # 831μs -> 695μs (19.5% faster)
    expected = np.array([["", ""], ["v", "v"]])


# ----------- LARGE SCALE TEST CASES -----------


def test_large_scale_100x100_all_highlighted():
    # 100x100 DataFrame, all highlighted
    df = pd.DataFrame(np.full((100, 100), 5))
    codeflash_output = _highlight_between(df, "X", left=5, right=5, inclusive="both")
    out = codeflash_output  # 1.01ms -> 862μs (16.9% faster)


def test_large_scale_100x100_none_highlighted():
    # 100x100 DataFrame, none highlighted
    df = pd.DataFrame(np.full((100, 100), 0))
    codeflash_output = _highlight_between(df, "Y", left=1, right=2, inclusive="both")
    out = codeflash_output  # 1.00ms -> 851μs (17.7% faster)


def test_large_scale_100x100_diagonal_highlighted():
    # 100x100, only diagonal highlighted
    arr = np.zeros((100, 100))
    np.fill_diagonal(arr, 1)
    df = pd.DataFrame(arr)
    codeflash_output = _highlight_between(df, "Z", left=1, right=1, inclusive="both")
    out = codeflash_output  # 1.01ms -> 860μs (17.8% faster)
    expected = np.full((100, 100), "")
    for i in range(100):
        expected[i, i] = "Z"


def test_large_scale_1000x1_random_data():
    # 1000x1, random data
    np.random.seed(0)
    df = pd.DataFrame(np.random.randint(0, 100, size=(1000, 1)))
    codeflash_output = _highlight_between(df, "B", left=10, right=20, inclusive="both")
    out = codeflash_output  # 897μs -> 730μs (22.9% faster)
    # Check that all highlighted values are in 10..20
    highlighted = df.values[(out == "B").flatten()]


def test_large_scale_500x2_nan_inf():
    # 500x2, with NaN and inf
    arr = np.full((500, 2), np.nan)
    arr[0, 0] = np.inf
    arr[-1, -1] = -np.inf
    arr[250, 0] = 5
    arr[250, 1] = 10
    df = pd.DataFrame(arr)
    codeflash_output = _highlight_between(df, "D", left=5, right=10, inclusive="both")
    out = codeflash_output  # 973μs -> 842μs (15.6% faster)
    # All others should be ""
    arr[250, 0] = arr[250, 1] = np.nan  # Remove highlights and check again
    codeflash_output = _highlight_between(df, "D", left=5, right=10, inclusive="both")
    out2 = codeflash_output  # 725μs -> 551μs (31.7% faster)


def test_large_scale_100x100_string_data():
    # 100x100, string data
    df = pd.DataFrame([["str" + str(i % 5) for i in range(100)] for _ in range(100)])
    codeflash_output = _highlight_between(
        df, "E", left="str1", right="str3", inclusive="both"
    )
    out = codeflash_output  # 1.85ms -> 1.69ms (9.48% faster)
    # Only columns where value is "str1", "str2", "str3" are highlighted
    for i in range(100):
        for j in range(100):
            if df.iloc[i, j] in ("str1", "str2", "str3"):
                pass
            else:
                pass


def test_large_scale_100x100_props_long_string():
    # 100x100, props is a long string
    df = pd.DataFrame(np.full((100, 100), 42))
    long_str = "long" * 25
    codeflash_output = _highlight_between(
        df, long_str, left=42, right=42, inclusive="both"
    )
    out = codeflash_output  # 1.76ms -> 1.62ms (8.93% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np
import pandas as pd

# imports
import pytest
from pandas.io.formats.style import _highlight_between

# --------- UNIT TESTS ----------

# 1. Basic Test Cases


def test_basic_scalar_bounds_inclusive_both():
    # Test with simple DataFrame and scalar bounds, inclusive='both'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "H", left=2, right=3, inclusive="both")
    result = codeflash_output  # 893μs -> 725μs (23.2% faster)
    expected = np.array([["", "H"], ["H", ""]])


def test_basic_scalar_bounds_inclusive_neither():
    # Test with simple DataFrame and scalar bounds, inclusive='neither'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "X", left=2, right=3, inclusive="neither")
    result = codeflash_output  # 888μs -> 699μs (27.0% faster)
    expected = np.array([["", ""], ["", ""]])  # No value strictly between 2 and 3


def test_basic_scalar_bounds_inclusive_left():
    # Test with inclusive='left'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "L", left=2, right=3, inclusive="left")
    result = codeflash_output  # 892μs -> 712μs (25.2% faster)
    expected = np.array([["", "L"], ["", ""]])


def test_basic_scalar_bounds_inclusive_right():
    # Test with inclusive='right'
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "R", left=2, right=3, inclusive="right")
    result = codeflash_output  # 880μs -> 712μs (23.6% faster)
    expected = np.array([["", ""], ["R", ""]])


def test_basic_sequence_bounds():
    # Test with sequence bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [[0, 2], [2, 3]]
    right = [[2, 3], [4, 5]]
    codeflash_output = _highlight_between(
        df, "S", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 946μs -> 796μs (18.8% faster)
    expected = np.array([["S", "S"], ["S", "S"]])


def test_basic_ndarray_bounds():
    # Test with ndarray bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = np.array([[0, 2], [2, 3]])
    right = np.array([[2, 3], [4, 5]])
    codeflash_output = _highlight_between(
        df, "N", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 952μs -> 786μs (21.2% faster)
    expected = np.array([["N", "N"], ["N", "N"]])


def test_basic_dataframe_bounds():
    # Test with DataFrame bounds
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.DataFrame([[0, 2], [2, 3]], index=df.index, columns=df.columns)
    right = pd.DataFrame([[2, 3], [4, 5]], index=df.index, columns=df.columns)
    codeflash_output = _highlight_between(
        df, "D", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.01ms -> 861μs (16.8% faster)
    expected = np.array([["D", "D"], ["D", "D"]])


def test_basic_none_left_right():
    # Test with left=None, right=None (should always highlight)
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "A", left=None, right=None, inclusive="both"
    )
    result = codeflash_output  # 24.5μs -> 22.7μs (7.99% faster)
    expected = np.full(df.shape, "A")


def test_basic_none_left_only():
    # Test with left=None, right=3
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "B", left=None, right=3, inclusive="both")
    result = codeflash_output  # 632μs -> 432μs (46.2% faster)
    expected = np.array([["B", "B"], ["B", ""]])


def test_basic_none_right_only():
    # Test with left=2, right=None
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "C", left=2, right=None, inclusive="both")
    result = codeflash_output  # 584μs -> 408μs (42.9% faster)
    expected = np.array([["", "C"], ["C", "C"]])


def test_basic_with_nan_values():
    # Test with NaN values
    df = pd.DataFrame([[1, np.nan], [3, 4]])
    codeflash_output = _highlight_between(df, "Z", left=1, right=4, inclusive="both")
    result = codeflash_output  # 1.10ms -> 971μs (13.2% faster)
    expected = np.array([["Z", ""], ["Z", "Z"]])


def test_basic_empty_dataframe():
    # Test with empty DataFrame
    df = pd.DataFrame([])
    codeflash_output = _highlight_between(df, "E", left=0, right=1, inclusive="both")
    result = codeflash_output  # 404μs -> 331μs (22.1% faster)
    expected = np.empty(df.shape, dtype=str)


def test_basic_with_string_props():
    # Test with props as string
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(
        df, "style", left=2, right=3, inclusive="both"
    )
    result = codeflash_output  # 877μs -> 698μs (25.6% faster)
    expected = np.array([["", "style"], ["style", ""]])


# 2. Edge Test Cases


def test_edge_invalid_inclusive_value():
    # Test with invalid inclusive value
    df = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "X", left=2, right=3, inclusive="invalid"
        )  # 3.96μs -> 4.15μs (4.70% slower)


def test_edge_left_wrong_shape():
    # Test with left shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = [1, 2, 3]  # shape mismatch
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Y", left=left, right=4, inclusive="both"
        )  # 10.8μs -> 11.4μs (5.32% slower)


def test_edge_right_wrong_shape():
    # Test with right shape mismatch
    df = pd.DataFrame([[1, 2], [3, 4]])
    right = [1, 2, 3]  # shape mismatch
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Y", left=1, right=right, inclusive="both"
        )  # 10.8μs -> 11.7μs (7.96% slower)


def test_edge_left_series_with_dataframe():
    # Test with left as Series and data as DataFrame (should raise)
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = pd.Series([1, 2])
    with pytest.raises(ValueError):
        _highlight_between(
            df, "Q", left=left, right=4, inclusive="both"
        )  # 9.61μs -> 9.84μs (2.34% slower)


def test_edge_right_dataframe_with_series():
    # Test with right as DataFrame and data as Series (should raise)
    s = pd.Series([1, 2])
    right = pd.DataFrame([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        _highlight_between(
            s, "Q", left=0, right=right, inclusive="both"
        )  # 7.95μs -> 8.54μs (6.94% slower)


def test_edge_all_nan_bounds():
    # Test with all bounds as NaN
    df = pd.DataFrame([[1, 2], [3, 4]])
    left = np.full(df.shape, np.nan)
    right = np.full(df.shape, np.nan)
    codeflash_output = _highlight_between(
        df, "NAN", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 972μs -> 818μs (18.8% faster)
    expected = np.full(df.shape, "")


def test_edge_props_empty_string():
    # Test with empty string as props
    df = pd.DataFrame([[1, 2], [3, 4]])
    codeflash_output = _highlight_between(df, "", left=2, right=3, inclusive="both")
    result = codeflash_output  # 852μs -> 687μs (24.1% faster)
    expected = np.array([["", ""], ["", ""]])


def test_edge_props_special_characters():
    # Test with props as special characters
    df = pd.DataFrame([[2, 3], [4, 5]])
    codeflash_output = _highlight_between(df, "@!#", left=3, right=4, inclusive="both")
    result = codeflash_output  # 869μs -> 699μs (24.3% faster)
    expected = np.array([["", "@!#"], ["@!#", ""]])


def test_edge_data_with_inf():
    # Test with infinite values
    df = pd.DataFrame([[np.inf, -np.inf], [0, 1]])
    codeflash_output = _highlight_between(df, "I", left=-1, right=2, inclusive="both")
    result = codeflash_output  # 854μs -> 708μs (20.7% faster)
    expected = np.array([["", ""], ["I", "I"]])


def test_edge_data_with_boolean_values():
    # Test with boolean values
    df = pd.DataFrame([[True, False], [True, True]])
    codeflash_output = _highlight_between(
        df, "B", left=True, right=True, inclusive="both"
    )
    result = codeflash_output  # 946μs -> 768μs (23.2% faster)
    expected = np.array([["B", ""], ["B", "B"]])


def test_edge_data_with_datetime():
    # Test with datetime values
    dt1 = pd.Timestamp("2020-01-01")
    dt2 = pd.Timestamp("2020-01-02")
    dt3 = pd.Timestamp("2020-01-03")
    df = pd.DataFrame([[dt1, dt2], [dt3, dt1]])
    left = pd.Timestamp("2020-01-01")
    right = pd.Timestamp("2020-01-02")
    codeflash_output = _highlight_between(
        df, "T", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 903μs -> 732μs (23.4% faster)
    expected = np.array([["T", "T"], ["", "T"]])


def test_edge_data_with_period_dtype():
    # Test with pandas Period dtype
    df = pd.DataFrame(
        [[pd.Period("2020-01", freq="M"), pd.Period("2020-02", freq="M")]]
    )
    left = pd.Period("2020-01", freq="M")
    right = pd.Period("2020-02", freq="M")
    codeflash_output = _highlight_between(
        df, "P", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.13ms -> 1.02ms (11.4% faster)
    expected = np.array([["P", "P"]])


def test_edge_data_with_timedelta():
    # Test with timedelta values
    df = pd.DataFrame([[pd.Timedelta(days=1), pd.Timedelta(days=2)]])
    left = pd.Timedelta(days=1)
    right = pd.Timedelta(days=2)
    codeflash_output = _highlight_between(
        df, "TD", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 907μs -> 723μs (25.4% faster)
    expected = np.array([["TD", "TD"]])


def test_edge_data_with_none_values():
    # Test with None values in data
    df = pd.DataFrame([[None, 2], [3, None]])
    codeflash_output = _highlight_between(df, "N", left=2, right=3, inclusive="both")
    result = codeflash_output  # 869μs -> 706μs (23.2% faster)
    expected = np.array([["", "N"], ["N", ""]])


def test_edge_data_with_empty_rows():
    # Test with DataFrame with empty rows
    df = pd.DataFrame([], columns=["A", "B"])
    codeflash_output = _highlight_between(df, "X", left=0, right=1, inclusive="both")
    result = codeflash_output  # 857μs -> 625μs (37.2% faster)
    expected = np.empty(df.shape, dtype=str)


# 3. Large Scale Test Cases


def test_large_scale_100x10_random():
    # Test with large DataFrame of random values
    np.random.seed(123)
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = 30
    right = 70
    codeflash_output = _highlight_between(
        df, "LARGE", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 881μs -> 740μs (19.1% faster)
    # Check count of highlighted cells
    highlighted_count = np.sum(result == "LARGE")
    # All values between 30 and 70 inclusive
    expected_count = np.sum((df.values >= 30) & (df.values <= 70))


def test_large_scale_500x2_sequence_bounds():
    # Test with sequence bounds for large DataFrame
    df = pd.DataFrame(np.arange(1000).reshape(500, 2))
    left = np.full(df.shape, 100)
    right = np.full(df.shape, 900)
    codeflash_output = _highlight_between(
        df, "SEQ", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 956μs -> 808μs (18.3% faster)
    # All values between 100 and 900 inclusive
    mask = (df.values >= 100) & (df.values <= 900)


def test_large_scale_1000x1_scalar_bounds():
    # Test with 1000 rows, 1 column, scalar bounds
    df = pd.DataFrame(np.arange(1000).reshape(1000, 1))
    left = 500
    right = 600
    codeflash_output = _highlight_between(
        df, "S", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 883μs -> 700μs (26.1% faster)
    mask = (df.values >= 500) & (df.values <= 600)


def test_large_scale_100x10_none_bounds():
    # Test with large DataFrame and no bounds (should always highlight)
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    codeflash_output = _highlight_between(
        df, "ALL", left=None, right=None, inclusive="both"
    )
    result = codeflash_output  # 31.1μs -> 30.1μs (3.36% faster)


def test_large_scale_100x10_nan_bounds():
    # Test with large DataFrame and all bounds as NaN
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = np.full(df.shape, np.nan)
    right = np.full(df.shape, np.nan)
    codeflash_output = _highlight_between(
        df, "NONE", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 984μs -> 825μs (19.3% faster)


def test_large_scale_100x10_mixed_bounds():
    # Test with mixed left/right bounds arrays
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = np.random.randint(0, 50, size=df.shape)
    right = np.random.randint(50, 100, size=df.shape)
    codeflash_output = _highlight_between(
        df, "MIX", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 955μs -> 789μs (21.0% faster)
    mask = (df.values >= left) & (df.values <= right)


def test_large_scale_100x10_dataframe_bounds():
    # Test with DataFrame bounds
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    left = pd.DataFrame(
        np.random.randint(0, 50, size=df.shape), index=df.index, columns=df.columns
    )
    right = pd.DataFrame(
        np.random.randint(50, 100, size=df.shape), index=df.index, columns=df.columns
    )
    codeflash_output = _highlight_between(
        df, "DF", left=left, right=right, inclusive="both"
    )
    result = codeflash_output  # 1.02ms -> 852μs (20.0% faster)
    mask = (df.values >= left.values) & (df.values <= right.values)


def test_large_scale_100x10_with_nans_in_data():
    # Test with large DataFrame with NaNs in data
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)).astype(float))
    df.iloc[0:10, 0:5] = np.nan
    codeflash_output = _highlight_between(
        df, "NAN", left=10, right=90, inclusive="both"
    )
    result = codeflash_output  # 866μs -> 724μs (19.6% faster)
    mask = (df.values >= 10) & (df.values <= 90)
    # NaNs should not be highlighted
    mask = mask & ~np.isnan(df.values)


def test_large_scale_100x10_string_props():
    # Test with large DataFrame and string props
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 10)))
    codeflash_output = _highlight_between(
        df, "highlight", left=20, right=80, inclusive="both"
    )
    result = codeflash_output  # 898μs -> 735μs (22.1% faster)
    mask = (df.values >= 20) & (df.values <= 80)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_highlight_between-mja1pcb6 and push.

The optimized code achieves a **21% speedup** through several key performance improvements in the pandas styling functions: **Key Optimizations:** 1. **Smart Type-Specific Handling in `_validate_apply_axis_arg`**: - Added explicit fast path for `np.ndarray` inputs to avoid redundant `np.asarray()` calls - Uses `astype(copy=False)` for dtype conversion when possible, reducing unnecessary memory copies - Changed `dtype = {"dtype": dtype} if dtype else {}` to `dtype_kw = {"dtype": dtype} if dtype is not None else {}` for cleaner null checking 2. **Reduced Memory Allocations in `_highlight_between`**: - Pre-validates and converts bounds (`left_array`/`right_array`) only once, avoiding repeated validation calls - Converts pandas DataFrame/Series masks to numpy arrays with `to_numpy(dtype=bool, copy=False)` to ensure all boolean operations work on efficient numpy arrays - Performs final `np.where()` operation only once on pre-computed numpy boolean masks instead of mixed pandas/numpy types 3. **Minimized Branching Overhead**: - Streamlined the conditional logic flow to reduce the number of isinstance checks and branches - Early return patterns in `_validate_apply_axis_arg` eliminate unnecessary processing for common cases **Performance Impact**: The optimizations are particularly effective for test cases with: - **Scalar bounds** (23-30% faster) - benefits from reduced validation overhead - **None bounds** (42-56% faster) - avoids unnecessary array conversions entirely - **Large DataFrames** (15-22% faster) - reduced memory allocations and more efficient numpy operations scale well The improvements target the most common styling operations while maintaining full backward compatibility and preserving all error handling behavior.

codeflash-ai bot requested a review from mashraf-222 December 17, 2025 13:26

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_highlight_between` by 22% #412

⚡️ Speed up function `_highlight_between` by 22% #412

Uh oh!

codeflash-ai bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _highlight_between by 22% #412

Are you sure you want to change the base?

⚡️ Speed up function _highlight_between by 22% #412

Uh oh!

Conversation

codeflash-ai bot commented Dec 17, 2025

📄 22% (0.22x) speedup for _highlight_between in pandas/io/formats/style.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_highlight_between` by 22% #412

⚡️ Speed up function `_highlight_between` by 22% #412

📄 22% (0.22x) speedup for `_highlight_between` in `pandas/io/formats/style.py`