Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 26% (0.26x) speedup for _depth in keras/src/applications/mobilenet_v3.py

⏱️ Runtime : 433 microseconds 345 microseconds (best of 250 runs)

📝 Explanation and details

The optimization improves performance by 25% through three key changes that reduce computational overhead:

1. Eliminates float division: The original code uses divisor / 2 which creates a float intermediate value. The optimized version uses divisor // 2 (integer division), keeping all arithmetic in the integer domain until the final float comparison.

2. Reduces function call overhead: Instead of using Python's max() function to ensure the result is at least min_value, the optimization uses a direct conditional check if new_v < min_value: new_v = min_value. This avoids the function call overhead of max().

3. Single type conversion: The original code calls int() within a complex expression, while the optimized version does one upfront conversion vd = int(v) and then uses integer arithmetic throughout the main calculation.

Performance impact in MobileNetV3: The function references show _depth() is called extensively during model construction - in _inverted_res_block(), _se_block(), and the main MobileNetV3() function. Since MobileNetV3 uses multiple inverted residual blocks (typically 11+ blocks), each calling _depth() multiple times for channel calculations, this 25% speedup compounds significantly during model initialization.

Test case performance: The optimization shows consistent 20-40% improvements across all test scenarios, with particularly strong gains (30-60%) on smaller input values that are common in neural network channel calculations. The optimization maintains identical behavior for edge cases like the 10% rounding rule and min_value constraints.

This optimization is especially valuable since _depth() is in the hot path of model construction where channel dimensions are calculated repeatedly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 993 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from keras.src.applications.mobilenet_v3 import _depth

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_basic_divisible_by_divisor():
    # v is already divisible by divisor
    codeflash_output = _depth(16, divisor=8) # 1.94μs -> 1.44μs (34.3% faster)
    codeflash_output = _depth(32, divisor=8) # 688ns -> 603ns (14.1% faster)
    codeflash_output = _depth(64, divisor=16) # 440ns -> 353ns (24.6% faster)

def test_basic_rounding_up():
    # v not divisible by divisor, should round to nearest
    codeflash_output = _depth(15, divisor=8) # 1.68μs -> 1.30μs (29.2% faster)
    codeflash_output = _depth(23, divisor=8) # 678ns -> 544ns (24.6% faster)
    codeflash_output = _depth(33, divisor=16) # 444ns -> 359ns (23.7% faster)

def test_basic_with_min_value():
    # min_value larger than calculated value
    codeflash_output = _depth(5, divisor=8, min_value=16) # 1.73μs -> 1.42μs (22.1% faster)
    codeflash_output = _depth(7, divisor=4, min_value=12) # 664ns -> 508ns (30.7% faster)
    # min_value smaller than calculated value
    codeflash_output = _depth(20, divisor=8, min_value=8) # 511ns -> 432ns (18.3% faster)

def test_basic_default_min_value():
    # min_value is None, should default to divisor
    codeflash_output = _depth(3, divisor=8) # 1.69μs -> 1.33μs (26.4% faster)
    codeflash_output = _depth(0, divisor=8) # 760ns -> 664ns (14.5% faster)

# -------------------------
# Edge Test Cases
# -------------------------

def test_edge_zero_and_negative():
    # v=0, should return min_value (which defaults to divisor)
    codeflash_output = _depth(0, divisor=8) # 1.65μs -> 1.29μs (27.7% faster)
    # negative v, should still return at least min_value
    codeflash_output = _depth(-5, divisor=8) # 1.03μs -> 852ns (21.1% faster)
    # negative v with larger min_value
    codeflash_output = _depth(-10, divisor=8, min_value=16) # 839ns -> 758ns (10.7% faster)

def test_edge_small_divisor():
    # divisor=1, should round to nearest integer >= min_value
    codeflash_output = _depth(3.7, divisor=1) # 1.74μs -> 1.39μs (24.9% faster)
    codeflash_output = _depth(2.2, divisor=1, min_value=3) # 876ns -> 847ns (3.42% faster)

def test_edge_large_min_value():
    # min_value much larger than v
    codeflash_output = _depth(5, divisor=8, min_value=64) # 1.69μs -> 1.33μs (26.2% faster)

def test_edge_round_down_more_than_10_percent():
    # If rounding down by more than 10%, should add divisor
    # e.g. v=23, divisor=8: (23+4)//8*8=27//8*8=3*8=24 (but 24<0.9*23? 24>20.7, so no)
    # Let's find a value where this triggers:
    # v=35, divisor=16: (35+8)//16*16=43//16*16=2*16=32, 32 < 0.9*35=31.5? No.
    # v=33, divisor=16: (33+8)//16*16=41//16*16=2*16=32, 32 < 0.9*33=29.7? No.
    # v=17, divisor=8: (17+4)//8*8=21//8*8=2*8=16, 16 < 0.9*17=15.3? No.
    # v=19, divisor=8: (19+4)//8*8=23//8*8=2*8=16, 16 < 0.9*19=17.1, so 16<17.1, so add divisor: 16+8=24
    codeflash_output = _depth(19, divisor=8) # 1.86μs -> 1.38μs (35.0% faster)

def test_edge_exactly_10_percent():
    # If rounding down is exactly 10%, should NOT add divisor
    # v = 20, divisor=8: (20+4)//8*8=24//8*8=3*8=24, 24 < 0.9*20=18, so no.
    # Try v=20, divisor=8: (20+4)//8*8=24//8*8=3*8=24, 24<18? No. Try v=18, divisor=8: (18+4)//8*8=22//8*8=2*8=16, 16<16.2, so 16<16.2, so add divisor: 16+8=24
    # Let's check v=20, divisor=8: (20+4)//8*8=24//8*8=3*8=24, 24<18? No.
    # Let's try v=10, divisor=8: (10+4)//8*8=14//8*8=1*8=8, 8<9, so 8<9, so add divisor: 8+8=16
    codeflash_output = _depth(10, divisor=8) # 1.82μs -> 1.38μs (31.9% faster)

def test_edge_float_inputs():
    # v is float, should still work
    codeflash_output = _depth(15.5, divisor=8) # 1.75μs -> 1.30μs (34.8% faster)
    codeflash_output = _depth(23.9, divisor=8) # 659ns -> 603ns (9.29% faster)
    codeflash_output = _depth(7.1, divisor=4, min_value=12) # 725ns -> 726ns (0.138% slower)

def test_edge_large_divisor():
    # divisor larger than v, should return min_value
    codeflash_output = _depth(5, divisor=32) # 1.76μs -> 1.30μs (35.3% faster)
    codeflash_output = _depth(30, divisor=32) # 747ns -> 670ns (11.5% faster)

def test_edge_min_value_zero():
    # min_value=0, should allow zero output if v is 0
    codeflash_output = _depth(0, divisor=8, min_value=0) # 1.76μs -> 1.35μs (30.7% faster)

def test_edge_min_value_less_than_divisor():
    # min_value < divisor, should use min_value if it's larger than calculated value
    codeflash_output = _depth(3, divisor=8, min_value=4) # 1.71μs -> 1.41μs (21.4% faster)

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_large_scale_many_inputs():
    # Test a range of values to ensure consistency and performance
    for v in range(1, 1000, 7):
        codeflash_output = _depth(v, divisor=8); result = codeflash_output # 60.8μs -> 48.7μs (24.8% faster)

def test_large_scale_high_divisor():
    # Test with a high divisor and large v
    for v in range(100, 1000, 111):
        codeflash_output = _depth(v, divisor=64); result = codeflash_output # 5.38μs -> 4.35μs (23.8% faster)

def test_large_scale_high_min_value():
    # Test with a high min_value
    for v in range(1, 1000, 99):
        codeflash_output = _depth(v, divisor=8, min_value=256); result = codeflash_output # 6.53μs -> 5.38μs (21.4% faster)

def test_large_scale_float_inputs():
    # Test with float v over a range
    for v in [float(i) + 0.5 for i in range(1, 1000, 101)]:
        codeflash_output = _depth(v, divisor=16); result = codeflash_output # 5.74μs -> 5.07μs (13.2% faster)

def test_large_scale_edge_rounding():
    # Test that rounding does not go down by more than 10%
    for v in range(10, 1000, 37):
        codeflash_output = _depth(v, divisor=8); result = codeflash_output # 13.0μs -> 10.4μs (24.7% faster)
        if result < 0.9 * v:
            # This should never happen due to the function's logic
            raise AssertionError(f"Rounding down by more than 10% for v={v}, result={result}")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest  # used for our unit tests
from keras.src.applications.mobilenet_v3 import _depth

# unit tests

# -------- BASIC TEST CASES --------

def test_basic_exact_multiple():
    # v is already a multiple of divisor
    codeflash_output = _depth(32, 8) # 1.50μs -> 1.11μs (34.8% faster)
    codeflash_output = _depth(64, 8) # 586ns -> 457ns (28.2% faster)
    codeflash_output = _depth(16, 4) # 373ns -> 283ns (31.8% faster)

def test_basic_rounding_up():
    # v is not a multiple, should round to nearest (down unless < 10% drop)
    codeflash_output = _depth(33, 8) # 1.43μs -> 1.00μs (42.7% faster)
    codeflash_output = _depth(35, 8) # 562ns -> 415ns (35.4% faster)
    codeflash_output = _depth(39, 8) # 372ns -> 277ns (34.3% faster)
    codeflash_output = _depth(41, 8) # 355ns -> 271ns (31.0% faster)

def test_basic_min_value_default():
    # min_value defaults to divisor
    codeflash_output = _depth(2, 8) # 1.48μs -> 1.08μs (37.3% faster)
    codeflash_output = _depth(7, 8) # 663ns -> 520ns (27.5% faster)
    codeflash_output = _depth(8, 8) # 367ns -> 279ns (31.5% faster)

def test_basic_min_value_explicit():
    # min_value is set explicitly
    codeflash_output = _depth(5, 8, 4) # 1.48μs -> 914ns (62.1% faster)
    codeflash_output = _depth(5, 8, 16) # 636ns -> 493ns (29.0% faster)
    codeflash_output = _depth(17, 8, 16) # 406ns -> 341ns (19.1% faster)
    codeflash_output = _depth(7, 8, 4) # 367ns -> 289ns (27.0% faster)

def test_basic_divisor_other():
    # divisor other than 8
    codeflash_output = _depth(22, 6) # 1.43μs -> 1.00μs (43.2% faster)
    codeflash_output = _depth(11, 5) # 543ns -> 448ns (21.2% faster)
    codeflash_output = _depth(13, 5) # 367ns -> 276ns (33.0% faster)
    codeflash_output = _depth(14, 7) # 346ns -> 274ns (26.3% faster)

# -------- EDGE TEST CASES --------

def test_edge_zero_and_negative():
    # v is zero or negative
    codeflash_output = _depth(0, 8) # 1.51μs -> 1.15μs (32.3% faster)
    codeflash_output = _depth(-5, 8) # 857ns -> 745ns (15.0% faster)
    codeflash_output = _depth(0, 8, 2) # 557ns -> 466ns (19.5% faster)
    codeflash_output = _depth(-10, 8, 2) # 522ns -> 469ns (11.3% faster)

def test_edge_small_v_large_divisor():
    # v is smaller than divisor
    codeflash_output = _depth(3, 8) # 1.45μs -> 1.07μs (35.5% faster)
    codeflash_output = _depth(3, 8, 2) # 737ns -> 610ns (20.8% faster)
    codeflash_output = _depth(1, 8, 16) # 406ns -> 303ns (34.0% faster)

def test_edge_min_value_greater_than_v():
    # min_value is greater than v
    codeflash_output = _depth(5, 8, 16) # 1.39μs -> 1.02μs (35.9% faster)
    codeflash_output = _depth(10, 8, 32) # 569ns -> 438ns (29.9% faster)

def test_edge_rounding_down_too_much():
    # Ensure that new_v < 0.9 * v triggers the adjustment
    # For v=23, divisor=8: int(23+4)//8*8 = 24, which is not < 0.9*23
    # For v=21, divisor=8: int(21+4)//8*8 = 24, which is not < 0.9*21
    # For v=15, divisor=8: int(15+4)//8*8 = 16, which is not < 0.9*15
    # For v=19, divisor=8: int(19+4)//8*8 = 24, but 24 is not < 0.9*19
    # For v=17, divisor=8: int(17+4)//8*8 = 16, 16 < 0.9*17=15.3, so should add 8 -> 24
    codeflash_output = _depth(17, 8) # 1.48μs -> 1.05μs (41.4% faster)
    codeflash_output = _depth(9, 8) # 789ns -> 641ns (23.1% faster)

def test_edge_large_min_value():
    # min_value much larger than v and divisor
    codeflash_output = _depth(2, 8, 128) # 1.36μs -> 1.02μs (32.9% faster)
    codeflash_output = _depth(100, 8, 128) # 644ns -> 495ns (30.1% faster)
    codeflash_output = _depth(130, 8, 128) # 433ns -> 372ns (16.4% faster)

def test_edge_large_divisor():
    # divisor larger than v
    codeflash_output = _depth(7, 16) # 1.34μs -> 1.10μs (21.9% faster)
    codeflash_output = _depth(15, 16) # 606ns -> 514ns (17.9% faster)
    codeflash_output = _depth(31, 32) # 362ns -> 272ns (33.1% faster)

def test_edge_min_value_not_multiple_of_divisor():
    # min_value not a multiple of divisor
    codeflash_output = _depth(5, 8, 7) # 1.43μs -> 975ns (46.7% faster)

def test_edge_float_inputs():
    # v as float, divisor as float
    codeflash_output = _depth(15.7, 8) # 1.56μs -> 1.10μs (40.9% faster)
    codeflash_output = _depth(15.7, 8.0) # 1.35μs -> 1.49μs (9.47% slower)
    codeflash_output = _depth(17.2, 8, 4) # 610ns -> 498ns (22.5% faster)
    codeflash_output = _depth(17.2, 8, 20) # 403ns -> 345ns (16.8% faster)

def test_edge_divisor_one():
    # divisor is 1, should round to nearest int >= min_value
    codeflash_output = _depth(5.7, 1) # 1.46μs -> 1.16μs (26.1% faster)
    codeflash_output = _depth(0.2, 1) # 759ns -> 727ns (4.40% faster)
    codeflash_output = _depth(0.2, 1, 3) # 451ns -> 406ns (11.1% faster)

def test_edge_min_value_zero():
    # min_value is zero
    codeflash_output = _depth(0, 8, 0) # 1.58μs -> 1.04μs (52.0% faster)
    codeflash_output = _depth(2, 8, 0) # 775ns -> 582ns (33.2% faster)

def test_edge_large_v_small_divisor():
    # Large v, small divisor
    codeflash_output = _depth(999, 2) # 1.59μs -> 1.20μs (33.1% faster)
    codeflash_output = _depth(1000, 2) # 661ns -> 530ns (24.7% faster)

# -------- LARGE SCALE TEST CASES --------

def test_large_scale_many_inputs():
    # Test a range of v values for consistency and monotonicity
    divisor = 8
    prev = None
    for v in range(1, 1000, 3):
        codeflash_output = _depth(v, divisor); result = codeflash_output # 121μs -> 94.7μs (27.9% faster)
        # Should be non-decreasing as v increases (monotonicity)
        if prev is not None:
            pass
        prev = result

def test_large_scale_varied_divisors():
    # Test a variety of divisors and min_values across a range of v
    for divisor in [2, 4, 8, 16, 32, 64, 100]:
        for min_value in [None, divisor, 2*divisor, 100]:
            for v in range(1, 1000, 97):
                codeflash_output = _depth(v, divisor, min_value); result = codeflash_output
                # Should always be >= min_value or divisor
                expected_min = divisor if min_value is None else min_value

def test_large_scale_min_value_greater_than_v():
    # All results must be at least min_value, even for large arrays
    min_value = 512
    for v in range(1, 500, 23):
        codeflash_output = _depth(v, 8, min_value); result = codeflash_output # 9.35μs -> 7.43μs (25.8% faster)

def test_large_scale_float_inputs():
    # Use floats for v and divisor
    for v in [i + 0.5 for i in range(1, 1000, 101)]:
        for divisor in [2.0, 8.0, 16.0]:
            codeflash_output = _depth(v, divisor); result = codeflash_output

def test_large_scale_edge_rounding_down():
    # Test that the "no more than 10% drop" rule is enforced for many values
    for v in range(10, 1000, 37):
        divisor = 8
        new_v = max(divisor, int(v + divisor / 2) // divisor * divisor)
        if new_v < 0.9 * v:
            codeflash_output = _depth(v, divisor)
        else:
            codeflash_output = _depth(v, divisor)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_depth-mjac2cmm and push.

Codeflash Static Badge

The optimization improves performance by 25% through three key changes that reduce computational overhead:

**1. Eliminates float division:** The original code uses `divisor / 2` which creates a float intermediate value. The optimized version uses `divisor // 2` (integer division), keeping all arithmetic in the integer domain until the final float comparison.

**2. Reduces function call overhead:** Instead of using Python's `max()` function to ensure the result is at least `min_value`, the optimization uses a direct conditional check `if new_v < min_value: new_v = min_value`. This avoids the function call overhead of `max()`.

**3. Single type conversion:** The original code calls `int()` within a complex expression, while the optimized version does one upfront conversion `vd = int(v)` and then uses integer arithmetic throughout the main calculation.

**Performance impact in MobileNetV3:** The function references show `_depth()` is called extensively during model construction - in `_inverted_res_block()`, `_se_block()`, and the main `MobileNetV3()` function. Since MobileNetV3 uses multiple inverted residual blocks (typically 11+ blocks), each calling `_depth()` multiple times for channel calculations, this 25% speedup compounds significantly during model initialization.

**Test case performance:** The optimization shows consistent 20-40% improvements across all test scenarios, with particularly strong gains (30-60%) on smaller input values that are common in neural network channel calculations. The optimization maintains identical behavior for edge cases like the 10% rounding rule and min_value constraints.

This optimization is especially valuable since `_depth()` is in the hot path of model construction where channel dimensions are calculated repeatedly.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 17, 2025 18:16
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant