Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 8% (0.08x) speedup for broadcast_shapes in keras/src/ops/operation_utils.py

⏱️ Runtime : 1.01 milliseconds 935 microseconds (best of 113 runs)

📝 Explanation and details

The optimized code achieves a 7% speedup by eliminating redundant memory allocations and improving loop efficiency through several key optimizations:

Primary optimizations:

  1. Eliminated unnecessary list conversions: The original code immediately converted input shapes to lists (list(shape1)), but the optimized version keeps them as tuples until output generation, avoiding early memory allocations.

  2. Improved padding strategy: Instead of creating lists with [1] * diff + shape, the optimized version uses tuple unpacking ((*pad, *shape)) which is significantly faster for shape extension operations.

  3. Pre-allocated output with exact size: Rather than copying shape1 and modifying it, the optimized version creates [None] * len_ upfront and fills it directly, eliminating intermediate list operations.

  4. Loop variable localization: By assigning s1 = shape1, s2 = shape2, and out = output_shape before the loop, the code avoids repeated variable lookups during the hot loop execution.

  5. Reduced redundant length calculations: The optimized version calculates lengths once and reuses them, avoiding repeated len() calls.

Performance impact analysis:
The test results show consistent improvements across most cases, with particularly strong gains for:

  • Different rank broadcasting (16-37% faster)
  • Scalar broadcasting (32-37% faster)
  • Large-scale operations (up to 22% faster for error cases)

Hot path relevance:
Based on the function references, broadcast_shapes is called in critical tensor operation paths like _vectorize_parse_input_dimensions and take_along_axis. These are fundamental operations that can be invoked thousands of times in ML workloads, making even a 7% improvement significant for overall training/inference performance.

The optimizations are particularly effective for common ML scenarios involving tensor broadcasting with different ranks and singleton dimensions, which are frequent in neural network operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 87 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from keras.src.ops.operation_utils import broadcast_shapes

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_basic_equal_shapes():
    # Both shapes are equal
    codeflash_output = broadcast_shapes((3, 4), (3, 4)) # 2.63μs -> 2.32μs (13.2% faster)

def test_basic_broadcast_singleton_dim():
    # Broadcasting singleton dimension
    codeflash_output = broadcast_shapes((5, 3), (1, 3)) # 2.48μs -> 2.24μs (10.7% faster)
    codeflash_output = broadcast_shapes((1, 3), (5, 3)) # 1.07μs -> 892ns (20.0% faster)

def test_basic_broadcast_to_higher_rank():
    # Broadcasting lower rank to higher rank
    codeflash_output = broadcast_shapes((3,), (2, 3)) # 3.13μs -> 2.72μs (15.1% faster)
    codeflash_output = broadcast_shapes((2, 3), (3,)) # 1.70μs -> 1.23μs (37.4% faster)

def test_basic_broadcast_scalar():
    # Broadcasting a scalar (empty shape)
    codeflash_output = broadcast_shapes((), (2, 3)) # 2.49μs -> 2.45μs (1.88% faster)
    codeflash_output = broadcast_shapes((2, 3), ()) # 1.72μs -> 1.30μs (32.7% faster)

def test_basic_broadcast_multiple_singletons():
    # Broadcasting with multiple singleton dimensions
    codeflash_output = broadcast_shapes((1, 1, 5), (3, 4, 1)) # 2.20μs -> 2.12μs (3.78% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_edge_empty_shapes():
    # Both shapes are empty (scalars)
    codeflash_output = broadcast_shapes((), ()) # 1.41μs -> 1.54μs (8.51% slower)

def test_edge_one_empty_one_nonempty():
    # One shape is empty, one is not
    codeflash_output = broadcast_shapes((), (7,)) # 2.61μs -> 2.41μs (8.00% faster)
    codeflash_output = broadcast_shapes((7,), ()) # 1.46μs -> 1.12μs (30.7% faster)

def test_edge_none_dimension():
    # None dimensions are handled
    codeflash_output = broadcast_shapes((None, 3), (5, 1)) # 2.32μs -> 2.14μs (8.50% faster)
    codeflash_output = broadcast_shapes((None, 1), (1, 4)) # 1.20μs -> 1.08μs (11.1% faster)
    codeflash_output = broadcast_shapes((None, 1), (1, 1)) # 664ns -> 614ns (8.14% faster)

def test_edge_none_and_singleton():
    # None and singleton dims
    codeflash_output = broadcast_shapes((None,), (1,)) # 1.95μs -> 1.91μs (1.88% faster)
    codeflash_output = broadcast_shapes((1,), (None,)) # 886ns -> 783ns (13.2% faster)

def test_edge_none_and_none():
    # Both shapes have None
    codeflash_output = broadcast_shapes((None,), (None,)) # 1.90μs -> 1.84μs (3.27% faster)
    codeflash_output = broadcast_shapes((None, 1), (None, 1)) # 1.30μs -> 1.11μs (17.2% faster)

def test_edge_incompatible_shapes():
    # Incompatible shapes should raise ValueError
    with pytest.raises(ValueError):
        broadcast_shapes((2, 3), (3, 4)) # 4.45μs -> 4.19μs (6.40% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((1, 2, 3), (1, 3, 4)) # 2.85μs -> 2.58μs (10.4% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((5,), (6,)) # 2.16μs -> 1.82μs (18.7% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((None, 2), (3, 3)) # 2.42μs -> 2.13μs (13.4% faster)

def test_edge_mixed_types():
    # Accept both lists and tuples as input
    codeflash_output = broadcast_shapes([2, 3], (1, 3)) # 2.33μs -> 2.12μs (10.3% faster)
    codeflash_output = broadcast_shapes((1, 3), [2, 3]) # 1.22μs -> 1.14μs (7.21% faster)
    codeflash_output = broadcast_shapes([], []) # 794ns -> 760ns (4.47% faster)

def test_edge_negative_and_zero_dimensions():
    # Negative and zero dimensions are not allowed by broadcasting rules, but let's check behavior
    with pytest.raises(ValueError):
        broadcast_shapes((0, 3), (2, 3))
    with pytest.raises(ValueError):
        broadcast_shapes((2, -1), (2, 1))

def test_edge_large_singleton_broadcast():
    # Broadcasting large singleton to large dimension
    codeflash_output = broadcast_shapes((1, 100), (50, 1)) # 2.74μs -> 2.72μs (0.846% faster)

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_broadcast_high_rank():
    # Broadcasting shapes with high rank (length 1000)
    shape1 = [1] * 1000
    shape2 = [2] * 1000
    # Expected: [2]*1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 49.3μs -> 50.2μs (1.61% slower)

def test_large_broadcast_mixed_dims():
    # Broadcasting shapes with mixed dims, high rank
    shape1 = [1 if i % 2 == 0 else 10 for i in range(500)]
    shape2 = [20 if i % 2 == 0 else 1 for i in range(500)]
    expected = [20 if i % 2 == 0 else 10 for i in range(500)]
    codeflash_output = broadcast_shapes(shape1, shape2) # 34.0μs -> 29.9μs (13.7% faster)

def test_large_broadcast_incompatible():
    # Broadcasting shapes with incompatible dims, high rank
    shape1 = [2] * 999 + [3]
    shape2 = [2] * 999 + [4]
    with pytest.raises(ValueError):
        broadcast_shapes(shape1, shape2) # 168μs -> 137μs (22.6% faster)

def test_large_broadcast_none_dims():
    # Broadcasting shapes with None dims, high rank
    shape1 = [None] * 1000
    shape2 = [1] * 1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 76.5μs -> 70.9μs (7.91% faster)

def test_large_broadcast_none_and_values():
    # Broadcasting shapes with None and values, high rank
    shape1 = [None if i % 2 == 0 else 1 for i in range(1000)]
    shape2 = [1 if i % 2 == 0 else 5 for i in range(1000)]
    expected = [None if i % 2 == 0 else 5 for i in range(1000)]
    codeflash_output = broadcast_shapes(shape1, shape2) # 63.8μs -> 60.6μs (5.25% faster)

def test_large_broadcast_scalar_and_large_shape():
    # Broadcasting scalar to large shape
    shape1 = ()
    shape2 = [3]*1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 49.6μs -> 57.7μs (14.0% slower)

def test_large_broadcast_large_to_scalar():
    # Broadcasting large shape to scalar
    shape1 = [4]*1000
    shape2 = ()
    codeflash_output = broadcast_shapes(shape1, shape2) # 81.8μs -> 76.1μs (7.58% faster)

def test_large_broadcast_all_singleton():
    # Broadcasting all singleton dims
    shape1 = [1]*1000
    shape2 = [1]*1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 49.4μs -> 50.0μs (1.19% slower)

def test_large_broadcast_all_none():
    # Broadcasting all None dims
    shape1 = [None]*1000
    shape2 = [None]*1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 83.6μs -> 72.3μs (15.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from keras.src.ops.operation_utils import broadcast_shapes

# unit tests

# ----------------
# Basic Test Cases
# ----------------

def test_broadcast_shapes_equal_shapes():
    # Both shapes are the same
    codeflash_output = broadcast_shapes((2, 3), (2, 3)) # 3.13μs -> 2.89μs (8.35% faster)
    codeflash_output = broadcast_shapes([4, 5, 6], [4, 5, 6]) # 1.57μs -> 1.24μs (26.6% faster)

def test_broadcast_shapes_with_ones():
    # Broadcasting with ones
    codeflash_output = broadcast_shapes((1, 3), (5, 3)) # 2.35μs -> 2.13μs (10.7% faster)
    codeflash_output = broadcast_shapes((5, 3), (1, 3)) # 1.25μs -> 979ns (28.1% faster)
    codeflash_output = broadcast_shapes((1, 1), (7, 8)) # 729ns -> 604ns (20.7% faster)
    codeflash_output = broadcast_shapes((7, 8), (1, 1)) # 781ns -> 630ns (24.0% faster)

def test_broadcast_shapes_different_ranks():
    # Broadcasting with different ranks (e.g. (3,) and (2, 3))
    codeflash_output = broadcast_shapes((3,), (2, 3)) # 3.06μs -> 2.62μs (16.7% faster)
    codeflash_output = broadcast_shapes((2, 3), (3,)) # 1.64μs -> 1.21μs (34.9% faster)
    codeflash_output = broadcast_shapes((1,), (2, 3, 4)) # 1.23μs -> 1.22μs (0.572% faster)
    codeflash_output = broadcast_shapes((2, 3, 4), (1,)) # 1.26μs -> 1.01μs (25.4% faster)

def test_broadcast_shapes_scalar():
    # Broadcasting with scalar (empty shape)
    codeflash_output = broadcast_shapes((), (5, 6)) # 2.45μs -> 2.36μs (3.60% faster)
    codeflash_output = broadcast_shapes((5, 6), ()) # 1.63μs -> 1.20μs (36.5% faster)
    codeflash_output = broadcast_shapes((), ()) # 791ns -> 853ns (7.27% slower)

def test_broadcast_shapes_with_none():
    # Broadcasting with None dimensions
    codeflash_output = broadcast_shapes((None, 3), (2, 1)) # 2.11μs -> 2.02μs (4.76% faster)
    codeflash_output = broadcast_shapes((2, None), (1, 4)) # 1.20μs -> 921ns (30.6% faster)
    codeflash_output = broadcast_shapes((None, 1), (None, 5)) # 874ns -> 743ns (17.6% faster)
    codeflash_output = broadcast_shapes((None, 1), (3, 1)) # 696ns -> 578ns (20.4% faster)
    codeflash_output = broadcast_shapes((None, 1), (1, 1)) # 712ns -> 632ns (12.7% faster)
    codeflash_output = broadcast_shapes((1, None), (1, 7)) # 673ns -> 601ns (12.0% faster)

# ----------------
# Edge Test Cases
# ----------------

def test_broadcast_shapes_incompatible():
    # Shapes that cannot be broadcasted
    with pytest.raises(ValueError):
        broadcast_shapes((2, 3), (3, 2)) # 4.45μs -> 4.21μs (5.75% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((4,), (5,)) # 2.52μs -> 2.21μs (13.6% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((1, 2, 3), (2, 1, 4)) # 2.70μs -> 2.40μs (12.8% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((None, 2), (3, 3)) # 2.33μs -> 2.10μs (10.7% faster)

def test_broadcast_shapes_empty_lists():
    # Both shapes are empty lists
    codeflash_output = broadcast_shapes([], []) # 1.42μs -> 1.52μs (7.03% slower)
    # One shape is empty
    codeflash_output = broadcast_shapes([], [5, 6]) # 2.14μs -> 1.97μs (8.63% faster)
    codeflash_output = broadcast_shapes([7, 8], []) # 1.50μs -> 1.27μs (17.7% faster)

def test_broadcast_shapes_all_ones():
    # All dimensions are 1
    codeflash_output = broadcast_shapes((1, 1, 1), (1, 1, 1)) # 2.00μs -> 1.86μs (7.70% faster)
    codeflash_output = broadcast_shapes((1, 1, 1), (1,)) # 1.58μs -> 1.48μs (7.18% faster)

def test_broadcast_shapes_all_none():
    # All dimensions are None
    codeflash_output = broadcast_shapes((None, None), (None, None)) # 2.02μs -> 1.93μs (4.45% faster)
    codeflash_output = broadcast_shapes((None, None), (1, 1)) # 1.23μs -> 1.01μs (21.7% faster)
    codeflash_output = broadcast_shapes((1, 1), (None, None)) # 784ns -> 605ns (29.6% faster)

def test_broadcast_shapes_mixed_none_and_one():
    # Mixture of None and 1
    codeflash_output = broadcast_shapes((None, 1), (1, None)) # 1.94μs -> 1.91μs (1.31% faster)
    codeflash_output = broadcast_shapes((None, 1), (1, 7)) # 1.03μs -> 754ns (36.3% faster)
    codeflash_output = broadcast_shapes((1, None), (5, 1)) # 844ns -> 705ns (19.7% faster)

def test_broadcast_shapes_with_zero():
    # Zero dimension (e.g. for empty arrays)
    codeflash_output = broadcast_shapes((0, 3), (1, 3)) # 2.12μs -> 2.01μs (5.62% faster)
    codeflash_output = broadcast_shapes((1, 0), (2, 1)) # 1.13μs -> 950ns (18.6% faster)
    with pytest.raises(ValueError):
        broadcast_shapes((0, 3), (2, 3)) # 3.38μs -> 2.98μs (13.2% faster)

def test_broadcast_shapes_non_tuple_list_inputs():
    # Accept both tuple and list inputs
    codeflash_output = broadcast_shapes([2, 3], (1, 3)) # 2.16μs -> 2.09μs (3.20% faster)
    codeflash_output = broadcast_shapes((1, 3), [2, 3]) # 1.23μs -> 1.07μs (15.0% faster)

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_broadcast_shapes_large_rank():
    # Broadcasting large rank shapes (up to 1000 dims)
    shape1 = [1] * 1000
    shape2 = [2] * 1000
    # Should broadcast to all 2s
    codeflash_output = broadcast_shapes(shape1, shape2) # 49.0μs -> 49.5μs (1.09% slower)

    # Broadcasting with one shape shorter
    shape3 = [3] * 500
    shape4 = [1] * 500 + [3] * 500
    codeflash_output = broadcast_shapes(shape3, shape4) # 83.0μs -> 74.2μs (12.0% faster)

def test_broadcast_shapes_large_values():
    # Large values in dimensions
    shape1 = [1, 999999, 1]
    shape2 = [888888, 1, 777777]
    codeflash_output = broadcast_shapes(shape1, shape2) # 2.21μs -> 2.06μs (7.19% faster)

def test_broadcast_shapes_large_and_mixed_none():
    # Large shape with None in random positions
    shape1 = [None if i % 2 == 0 else 1 for i in range(100)]
    shape2 = [1 if i % 2 == 0 else 5 for i in range(100)]
    expected = [None if i % 2 == 0 else 5 for i in range(100)]
    codeflash_output = broadcast_shapes(shape1, shape2) # 8.21μs -> 7.44μs (10.3% faster)

def test_broadcast_shapes_performance():
    # Performance test: should not take too long for 1000 dims
    shape1 = [1] * 1000
    shape2 = [1] * 1000
    codeflash_output = broadcast_shapes(shape1, shape2) # 49.6μs -> 50.0μs (0.772% slower)

# --------------------------
# Additional Robustness Tests
# --------------------------

@pytest.mark.parametrize("s1,s2,expected", [
    # Basic broadcasting
    ((2, 3), (1, 3), [2, 3]),
    ((1, 3), (2, 3), [2, 3]),
    # Scalar
    ((), (4, 5), [4, 5]),
    ((4, 5), (), [4, 5]),
    # None and one
    ((None, 1), (1, 6), [None, 6]),
    # Large shape
    ([1]*500, [2]*500, [2]*500),
])
def test_broadcast_shapes_parametrized(s1, s2, expected):
    # Parametrized test for various cases
    codeflash_output = broadcast_shapes(s1, s2) # 40.3μs -> 40.4μs (0.027% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-broadcast_shapes-mjafiihs and push.

Codeflash Static Badge

The optimized code achieves a **7% speedup** by eliminating redundant memory allocations and improving loop efficiency through several key optimizations:

**Primary optimizations:**

1. **Eliminated unnecessary list conversions**: The original code immediately converted input shapes to lists (`list(shape1)`), but the optimized version keeps them as tuples until output generation, avoiding early memory allocations.

2. **Improved padding strategy**: Instead of creating lists with `[1] * diff + shape`, the optimized version uses tuple unpacking (`(*pad, *shape)`) which is significantly faster for shape extension operations.

3. **Pre-allocated output with exact size**: Rather than copying `shape1` and modifying it, the optimized version creates `[None] * len_` upfront and fills it directly, eliminating intermediate list operations.

4. **Loop variable localization**: By assigning `s1 = shape1`, `s2 = shape2`, and `out = output_shape` before the loop, the code avoids repeated variable lookups during the hot loop execution.

5. **Reduced redundant length calculations**: The optimized version calculates lengths once and reuses them, avoiding repeated `len()` calls.

**Performance impact analysis:**
The test results show consistent improvements across most cases, with particularly strong gains for:
- Different rank broadcasting (16-37% faster)
- Scalar broadcasting (32-37% faster) 
- Large-scale operations (up to 22% faster for error cases)

**Hot path relevance:**
Based on the function references, `broadcast_shapes` is called in critical tensor operation paths like `_vectorize_parse_input_dimensions` and `take_along_axis`. These are fundamental operations that can be invoked thousands of times in ML workloads, making even a 7% improvement significant for overall training/inference performance.

The optimizations are particularly effective for common ML scenarios involving tensor broadcasting with different ranks and singleton dimensions, which are frequent in neural network operations.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 17, 2025 19:53
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant