Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 20% (0.20x) speedup for compute_conv_output_shape in keras/src/ops/operation_utils.py

⏱️ Runtime : 998 microseconds 834 microseconds (best of 132 runs)

📝 Explanation and details

The optimized code achieves a 19% speedup through several key memory and computation efficiency improvements:

Key Optimizations:

  1. Reduced NumPy array allocations: The original code created np.array(spatial_shape) early and mutated it for None values. The optimized version uses a mutable Python list (tmp_spatial_shape) for None handling, then creates NumPy arrays only when needed for vectorized math operations.

  2. More efficient array creation: Replaced np.array() calls with np.fromiter() which is more efficient for creating arrays from iterables when the size is known, avoiding intermediate list creation.

  3. Eliminated redundant conversions: The original code converted the entire output_spatial_shape array to integers via list comprehension [int(i) for i in output_spatial_shape]. The optimized version uses .tolist() to convert NumPy arrays to Python lists once, then converts to integers in a single pass.

  4. Pre-computed dimensions: Cached len(input_shape) as ndim and len(spatial_shape) as spatial_ndim to avoid repeated len() calls.

Performance Impact:

The function is called from critical paths in Keras convolutional layers (base_conv.py, base_depthwise_conv.py, base_separable_conv.py) during shape computation, which happens frequently during model construction and inference. The optimizations are particularly effective for:

  • Edge cases with None dimensions: Up to 28.9% faster (e.g., test_large_2d_conv_channels_last_none_dim)
  • Complex scenarios: 17-25% improvements for multi-dimensional convolutions and cases with None spatial dimensions
  • Error cases: Significantly faster error detection (253% improvement in test_edge_negative_output_size)

The optimizations maintain identical behavior while reducing memory allocations and computational overhead, making convolution shape computation more efficient across all Keras convolutional layer types.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 43 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import numpy as np
# imports
import pytest
from keras.src.ops.operation_utils import compute_conv_output_shape

# unit tests

# ----------- BASIC TEST CASES -----------

def test_basic_1d_conv_valid():
    # 1D conv, batch size 8, length 32, 3 input channels, 16 filters, kernel=3, stride=1, valid padding
    codeflash_output = compute_conv_output_shape((8, 32, 3), 16, (3,), strides=1, padding="valid"); out = codeflash_output # 23.7μs -> 22.0μs (7.38% faster)

def test_basic_1d_conv_same():
    # 1D conv, batch size 4, length 10, 2 input channels, 5 filters, kernel=3, stride=1, same padding
    codeflash_output = compute_conv_output_shape((4, 10, 2), 5, (3,), strides=1, padding="same"); out = codeflash_output # 19.1μs -> 17.5μs (8.68% faster)

def test_basic_2d_conv_channels_last():
    # 2D conv, batch size 2, 28x28, 1 input channel, 8 filters, kernel=3x3, stride=1, valid padding
    codeflash_output = compute_conv_output_shape((2, 28, 28, 1), 8, (3, 3), strides=1, padding="valid"); out = codeflash_output # 23.6μs -> 21.0μs (12.4% faster)

def test_basic_2d_conv_channels_first():
    # 2D conv, batch size 2, 1 input channel, 32x32, 8 filters, kernel=5x5, stride=2, same padding
    codeflash_output = compute_conv_output_shape((2, 1, 32, 32), 8, (5, 5), strides=2, padding="same", data_format="channels_first"); out = codeflash_output # 18.8μs -> 17.7μs (5.77% faster)

def test_basic_3d_conv():
    # 3D conv, batch size 1, 10x10x10, 4 input channels, 6 filters, kernel=3x3x3, stride=1, valid padding
    codeflash_output = compute_conv_output_shape((1, 10, 10, 10, 4), 6, (3, 3, 3), strides=1, padding="valid"); out = codeflash_output # 23.8μs -> 20.4μs (17.0% faster)

def test_basic_dilation():
    # 2D conv, batch size 1, 10x10, 3 input channels, 7 filters, kernel=3x3, stride=1, dilation=2, valid padding
    codeflash_output = compute_conv_output_shape((1, 10, 10, 3), 7, (3, 3), strides=1, padding="valid", dilation_rate=2); out = codeflash_output # 21.9μs -> 20.1μs (9.27% faster)

def test_basic_tuple_strides_and_dilation():
    # 2D conv, batch size 1, 20x10, 3 input channels, 5 filters, kernel=3x3, stride=(2,1), dilation=(1,2), valid padding
    codeflash_output = compute_conv_output_shape((1, 20, 10, 3), 5, (3, 3), strides=(2, 1), padding="valid", dilation_rate=(1, 2)); out = codeflash_output # 21.8μs -> 19.8μs (9.83% faster)

# ----------- EDGE TEST CASES -----------

def test_edge_none_batch():
    # Batch size is None, should propagate None in output
    codeflash_output = compute_conv_output_shape((None, 28, 28, 3), 10, (3, 3), strides=1, padding="valid"); out = codeflash_output # 22.3μs -> 19.8μs (12.9% faster)

def test_edge_none_spatial():
    # Spatial dim is None, should propagate None in output
    codeflash_output = compute_conv_output_shape((4, None, 28, 3), 5, (3, 3), strides=1, padding="valid"); out = codeflash_output # 26.2μs -> 20.8μs (25.6% faster)

def test_edge_none_multiple_spatial():
    # Multiple spatial dims are None
    codeflash_output = compute_conv_output_shape((4, None, None, 3), 7, (3, 3), strides=1, padding="valid"); out = codeflash_output # 24.5μs -> 20.3μs (20.3% faster)

def test_edge_invalid_kernel_shape():
    # Kernel shape does not match input rank
    with pytest.raises(ValueError):
        compute_conv_output_shape((4, 28, 28, 3), 5, (3,), strides=1, padding="valid") # 3.87μs -> 3.95μs (2.02% slower)

def test_edge_invalid_dilation_length():
    # Dilation tuple length does not match spatial dims
    with pytest.raises(ValueError):
        compute_conv_output_shape((1, 10, 10, 3), 5, (3, 3), strides=1, padding="valid", dilation_rate=(1, 2, 3)) # 4.30μs -> 4.35μs (1.20% slower)

def test_edge_invalid_padding():
    # Padding is not valid or same or causal
    with pytest.raises(ValueError):
        compute_conv_output_shape((1, 10, 10, 3), 5, (3, 3), strides=1, padding="foo") # 10.6μs -> 9.74μs (8.91% faster)

def test_edge_negative_output_size():
    # Kernel and dilation too large for input
    with pytest.raises(ValueError):
        compute_conv_output_shape((1, 4, 4, 3), 5, (5, 5), strides=1, padding="valid", dilation_rate=2) # 102μs -> 29.0μs (253% faster)

def test_edge_large_stride_greater_than_input():
    # Stride greater than input size, valid padding
    codeflash_output = compute_conv_output_shape((1, 5, 5, 3), 2, (3, 3), strides=6, padding="valid"); out = codeflash_output # 37.8μs -> 32.7μs (15.6% faster)

def test_edge_large_stride_same_padding():
    # Stride greater than input size, same padding
    codeflash_output = compute_conv_output_shape((1, 5, 5, 3), 2, (3, 3), strides=6, padding="same"); out = codeflash_output # 21.2μs -> 19.7μs (7.39% faster)

# ----------- LARGE SCALE TEST CASES -----------

def test_large_1d_conv():
    # Large 1D input, batch size 1, length 1000, 8 input channels, 16 filters, kernel=5, stride=2, valid padding
    codeflash_output = compute_conv_output_shape((1, 1000, 8), 16, (5,), strides=2, padding="valid"); out = codeflash_output # 23.2μs -> 21.4μs (8.36% faster)

def test_large_2d_conv():
    # Large 2D input, batch size 2, 512x512, 3 input channels, 32 filters, kernel=7x7, stride=1, same padding
    codeflash_output = compute_conv_output_shape((2, 512, 512, 3), 32, (7, 7), strides=1, padding="same"); out = codeflash_output # 19.5μs -> 18.2μs (7.17% faster)

def test_large_3d_conv():
    # Large 3D input, batch size 1, 64x64x64, 4 input channels, 8 filters, kernel=3x3x3, stride=2, valid padding
    codeflash_output = compute_conv_output_shape((1, 64, 64, 64, 4), 8, (3, 3, 3), strides=2, padding="valid"); out = codeflash_output # 23.8μs -> 21.4μs (11.2% faster)

def test_large_channels_first():
    # Large input, channels_first format, batch size 4, 16 input channels, 128x128, 32 filters, kernel=3x3, stride=1, valid padding
    codeflash_output = compute_conv_output_shape((4, 16, 128, 128), 32, (3, 3), strides=1, padding="valid", data_format="channels_first"); out = codeflash_output # 22.8μs -> 20.5μs (11.2% faster)

def test_large_dilation():
    # Large input with large dilation
    codeflash_output = compute_conv_output_shape((2, 256, 256, 3), 16, (5, 5), strides=1, padding="valid", dilation_rate=4); out = codeflash_output # 22.0μs -> 19.4μs (13.4% faster)

def test_large_tuple_stride_and_dilation():
    # Large 2D input, tuple stride and dilation
    codeflash_output = compute_conv_output_shape((1, 512, 256, 3), 8, (7, 7), strides=(2,4), padding="valid", dilation_rate=(2,3)); out = codeflash_output # 22.3μs -> 20.2μs (10.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np
# imports
import pytest  # used for our unit tests
from keras.src.ops.operation_utils import compute_conv_output_shape

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_basic_1d_conv_channels_last_valid():
    # 1D conv, channels_last, valid padding, stride 1
    # input: (batch, length, channels)
    input_shape = (2, 10, 3)
    filters = 5
    kernel_size = (3,)
    expected = (2, 8, 5)  # (10 - 3 + 1 = 8)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 21.7μs -> 20.3μs (6.87% faster)

def test_basic_1d_conv_channels_last_same():
    # 1D conv, channels_last, same padding, stride 1
    input_shape = (2, 10, 3)
    filters = 5
    kernel_size = (3,)
    expected = (2, 10, 5)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, padding="same"); result = codeflash_output # 18.4μs -> 17.7μs (4.17% faster)

def test_basic_2d_conv_channels_last_valid():
    # 2D conv, channels_last, valid padding, stride 1
    input_shape = (1, 28, 28, 1)
    filters = 32
    kernel_size = (3, 3)
    expected = (1, 26, 26, 32)  # (28-3+1=26)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 22.4μs -> 19.8μs (12.9% faster)

def test_basic_2d_conv_channels_first_valid():
    # 2D conv, channels_first, valid padding, stride 1
    input_shape = (1, 1, 28, 28)
    filters = 32
    kernel_size = (3, 3)
    expected = (1, 32, 26, 26)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, data_format="channels_first"); result = codeflash_output # 21.7μs -> 20.0μs (8.32% faster)

def test_basic_2d_conv_channels_last_stride_2():
    # 2D conv, channels_last, stride 2
    input_shape = (1, 28, 28, 1)
    filters = 32
    kernel_size = (3, 3)
    strides = 2
    expected = (1, 13, 13, 32)  # floor((28-3+1)/2) = floor(26/2) = 13
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides); result = codeflash_output # 21.5μs -> 20.4μs (5.17% faster)

def test_basic_3d_conv_channels_last_valid():
    # 3D conv, channels_last, valid padding, stride 1
    input_shape = (2, 16, 16, 16, 4)
    filters = 8
    kernel_size = (3, 3, 3)
    expected = (2, 14, 14, 14, 8)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 22.5μs -> 20.1μs (11.5% faster)

def test_basic_3d_conv_channels_first_same():
    # 3D conv, channels_first, same padding
    input_shape = (2, 4, 16, 16, 16)
    filters = 8
    kernel_size = (3, 3, 3)
    expected = (2, 8, 16, 16, 16)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, padding="same", data_format="channels_first"); result = codeflash_output # 19.6μs -> 17.8μs (10.3% faster)

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_edge_kernel_size_equals_input():
    # Kernel size equals input spatial dimension, valid padding
    input_shape = (1, 5, 1)
    filters = 2
    kernel_size = (5,)
    expected = (1, 1, 2)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 21.4μs -> 20.7μs (3.21% faster)

def test_edge_stride_larger_than_input():
    # Stride larger than input spatial dimension, valid padding
    input_shape = (1, 4, 1)
    filters = 2
    kernel_size = (2,)
    strides = 5
    expected = (1, 1, 2)  # floor((4-2+1)/5)+1 = floor(3/5)+1 = 0+1=1
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides); result = codeflash_output # 36.0μs -> 32.4μs (11.2% faster)

def test_edge_invalid_padding():
    # Invalid padding value should raise
    input_shape = (1, 10, 1)
    filters = 2
    kernel_size = (3,)
    with pytest.raises(ValueError):
        compute_conv_output_shape(input_shape, filters, kernel_size, padding="foobar") # 12.8μs -> 11.7μs (9.48% faster)

def test_edge_invalid_dilation_length():
    # Dilation tuple of wrong length should raise
    input_shape = (1, 10, 1)
    filters = 2
    kernel_size = (3,)
    dilation_rate = (2, 3)  # only 1 spatial dim
    with pytest.raises(ValueError):
        compute_conv_output_shape(input_shape, filters, kernel_size, dilation_rate=dilation_rate) # 4.31μs -> 4.55μs (5.30% slower)

def test_edge_none_spatial_dim():
    # None spatial dimension should propagate to output
    input_shape = (1, None, 1)
    filters = 2
    kernel_size = (3,)
    expected = (1, None, 2)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 36.2μs -> 30.8μs (17.5% faster)

def test_edge_kernel_shape_vs_input_shape_mismatch():
    # Kernel shape length mismatch with input shape should raise
    input_shape = (1, 10, 1)
    filters = 2
    kernel_size = (3, 3)  # 2D kernel for 1D input
    with pytest.raises(ValueError):
        compute_conv_output_shape(input_shape, filters, kernel_size) # 3.85μs -> 3.90μs (1.21% slower)

def test_edge_causal_padding():
    # Causal padding should behave like same for shape calculation
    input_shape = (1, 10, 1)
    filters = 2
    kernel_size = (3,)
    expected = (1, 10, 2)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, padding="causal"); result = codeflash_output # 33.4μs -> 30.5μs (9.51% faster)

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_1d_conv_channels_last():
    # Large 1D input, stride 1, valid padding
    input_shape = (8, 1000, 3)
    filters = 16
    kernel_size = (5,)
    expected = (8, 996, 16)  # 1000-5+1=996
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size); result = codeflash_output # 24.8μs -> 22.1μs (12.4% faster)

def test_large_2d_conv_channels_last():
    # Large 2D input, stride 2, same padding
    input_shape = (4, 512, 512, 3)
    filters = 32
    kernel_size = (3, 3)
    strides = 2
    expected = (4, 256, 256, 32)  # floor((512-1)/2)+1 = 256
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides, padding="same"); result = codeflash_output # 20.8μs -> 19.8μs (4.92% faster)

def test_large_3d_conv_channels_first():
    # Large 3D input, channels_first, valid padding, stride 1
    input_shape = (2, 4, 64, 64, 64)
    filters = 8
    kernel_size = (3, 3, 3)
    expected = (2, 8, 62, 62, 62)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, data_format="channels_first"); result = codeflash_output # 24.3μs -> 21.4μs (13.6% faster)

def test_large_dilation_and_stride():
    # Large input, large dilation and stride
    input_shape = (1, 1000, 3)
    filters = 4
    kernel_size = (5,)
    strides = 10
    dilation_rate = 3
    # effective kernel size = 3*(5-1)+1 = 13
    # output = floor((1000-13)/10)+1 = floor(987/10)+1 = 98+1=99
    expected = (1, 99, 4)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides, dilation_rate=dilation_rate); result = codeflash_output # 22.0μs -> 20.0μs (9.87% faster)

def test_large_2d_conv_channels_last_none_dim():
    # Large 2D input with None spatial dimension
    input_shape = (8, None, 512, 3)
    filters = 32
    kernel_size = (3, 3)
    strides = 2
    expected = (8, None, 255, 32)  # None propagates, floor((512-3+1)/2)=255
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides); result = codeflash_output # 27.5μs -> 21.4μs (28.9% faster)

def test_large_2d_conv_channels_first_stride_tuple():
    # Large 2D input, channels_first, stride tuple
    input_shape = (4, 3, 512, 512)
    filters = 32
    kernel_size = (3, 3)
    strides = (2, 4)
    expected = (4, 32, 255, 127)  # floor((512-3+1)/2)=255, floor((512-3+1)/4)=127
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, strides=strides, data_format="channels_first"); result = codeflash_output # 23.9μs -> 20.2μs (18.6% faster)

def test_large_3d_conv_channels_last_same():
    # Large 3D input, channels_last, same padding
    input_shape = (2, 100, 100, 100, 8)
    filters = 16
    kernel_size = (5, 5, 5)
    expected = (2, 100, 100, 100, 16)
    codeflash_output = compute_conv_output_shape(input_shape, filters, kernel_size, padding="same"); result = codeflash_output # 20.0μs -> 18.5μs (8.41% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-compute_conv_output_shape-mjag63o1 and push.

Codeflash Static Badge

The optimized code achieves a **19% speedup** through several key memory and computation efficiency improvements:

**Key Optimizations:**

1. **Reduced NumPy array allocations**: The original code created `np.array(spatial_shape)` early and mutated it for None values. The optimized version uses a mutable Python list (`tmp_spatial_shape`) for None handling, then creates NumPy arrays only when needed for vectorized math operations.

2. **More efficient array creation**: Replaced `np.array()` calls with `np.fromiter()` which is more efficient for creating arrays from iterables when the size is known, avoiding intermediate list creation.

3. **Eliminated redundant conversions**: The original code converted the entire `output_spatial_shape` array to integers via list comprehension `[int(i) for i in output_spatial_shape]`. The optimized version uses `.tolist()` to convert NumPy arrays to Python lists once, then converts to integers in a single pass.

4. **Pre-computed dimensions**: Cached `len(input_shape)` as `ndim` and `len(spatial_shape)` as `spatial_ndim` to avoid repeated `len()` calls.

**Performance Impact:**

The function is called from critical paths in Keras convolutional layers (`base_conv.py`, `base_depthwise_conv.py`, `base_separable_conv.py`) during shape computation, which happens frequently during model construction and inference. The optimizations are particularly effective for:

- **Edge cases with None dimensions**: Up to 28.9% faster (e.g., `test_large_2d_conv_channels_last_none_dim`)
- **Complex scenarios**: 17-25% improvements for multi-dimensional convolutions and cases with None spatial dimensions
- **Error cases**: Significantly faster error detection (253% improvement in `test_edge_negative_output_size`)

The optimizations maintain identical behavior while reducing memory allocations and computational overhead, making convolution shape computation more efficient across all Keras convolutional layer types.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 17, 2025 20:11
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant