Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 17, 2025

📄 105% (1.05x) speedup for rouwenhorst in quantecon/markov/approximation.py

⏱️ Runtime : 444 milliseconds 217 milliseconds (best of 38 runs)

📝 Explanation and details

The optimization replaces the recursive Python implementation of the transition matrix construction with a Numba JIT-compiled version, delivering a 2.05x speedup (105% faster).

Key optimizations applied:

  1. Numba JIT compilation: The core _row_build_mat_numba function uses @njit(cache=True) to compile the recursive matrix construction to machine code, eliminating Python interpreter overhead for the computationally intensive loops.

  2. Explicit loop construction: Instead of NumPy's fancy indexing (e.g., p1[:n-1, :n-1] = p * new_mat), the optimized version uses explicit nested loops that Numba can optimize more effectively.

  3. Memory layout optimization: Pre-allocating arrays with specific dtype (np.float64) and using manual element assignment provides better cache locality and avoids temporary array creation.

Why this leads to speedup:

  • The original code spent 99.4% of execution time in the recursive row_build_mat function performing array slicing and broadcasting operations
  • Numba compiles these operations to optimized machine code, eliminating Python function call overhead and NumPy's intermediate array allocations
  • The recursive nature amplifies the benefit since the optimization applies at every recursion level

Performance benefits by test case:

  • Small n (2-4 states): 1-33% faster due to JIT compilation overhead being relatively small
  • Medium n (20-100 states): 136-186% faster as recursion depth increases
  • Large n (500 states): 102-106% faster, hitting the sweet spot where JIT overhead is amortized

Workload impact:
Based on the function references, rouwenhorst is used in Markov chain approximation tests and setup methods that run repeatedly. The optimization particularly benefits:

  • Monte Carlo simulations requiring many Markov chains
  • Parameter sensitivity analysis with varying state space sizes
  • Academic/research workflows processing multiple AR(1) approximations

The optimization maintains identical behavior including the recursion limit check, ensuring backward compatibility while dramatically improving performance for the computational bottleneck.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 41 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from math import sqrt

import numpy as np
# imports
import pytest
from quantecon.markov.approximation import rouwenhorst

# Minimal MarkovChain class for testing
class MarkovChain:
    def __init__(self, P, s):
        self.P = P
        self.s = s
from quantecon.markov.approximation import rouwenhorst

# unit tests

# -------- BASIC TEST CASES --------
def test_basic_two_states():
    # Test with n=2, rho=0, sigma=1, mu=0
    codeflash_output = rouwenhorst(2, 0, 1, 0); mc = codeflash_output # 35.3μs -> 34.5μs (2.30% faster)

def test_basic_three_states():
    # Test with n=3, rho=0.5, sigma=1, mu=0
    codeflash_output = rouwenhorst(3, 0.5, 1, 0); mc = codeflash_output # 40.0μs -> 33.1μs (20.8% faster)
    # Check that each row sums to 1
    for i in range(3):
        pass

def test_basic_nonzero_mu():
    # Test with nonzero mu
    codeflash_output = rouwenhorst(4, 0.8, 2, mu=5); mc = codeflash_output # 43.9μs -> 33.0μs (33.1% faster)
    # States should be shifted by mu/(1 - rho)
    shift = 5 / (1 - 0.8)
    # Check each row sums to 1
    for i in range(4):
        pass

def test_basic_rho_negative():
    # Test with negative rho
    codeflash_output = rouwenhorst(3, -0.5, 1, 0); mc = codeflash_output # 37.8μs -> 31.9μs (18.4% faster)
    # p=q=(1+rho)/2=0.25
    expected_P = np.array([
        [0.25, 0.75, 0.  ],
        [0.375, 0.25, 0.375],
        [0.  , 0.75, 0.25]
    ])

# -------- EDGE TEST CASES --------
def test_edge_n_equals_2():
    # Smallest valid n
    codeflash_output = rouwenhorst(2, 0.9, 0.5); mc = codeflash_output # 31.5μs -> 31.1μs (1.47% faster)
    # p=q=(1+0.9)/2=0.95
    expected_P = np.array([[0.95, 0.05], [0.05, 0.95]])
    # States
    y_sd = sqrt(0.5**2 / (1 - 0.9**2))
    psi = y_sd * sqrt(2 - 1)

def test_edge_rho_near_1():
    # rho very close to 1
    rho = 0.999
    sigma = 1
    n = 5
    codeflash_output = rouwenhorst(n, rho, sigma); mc = codeflash_output # 47.8μs -> 32.8μs (45.6% faster)
    # States should be spread far apart (since y_sd is large)
    y_sd = sqrt(sigma**2 / (1 - rho**2))
    psi = y_sd * sqrt(n - 1)

def test_edge_sigma_zero():
    # sigma=0, so all states should be the same
    codeflash_output = rouwenhorst(3, 0.5, 0); mc = codeflash_output # 55.8μs -> 47.7μs (17.0% faster)
    # Transition matrix still valid
    for i in range(3):
        pass

def test_edge_invalid_n():
    # n < 2 should raise ValueError
    with pytest.raises(ValueError):
        rouwenhorst(1, 0.5, 1) # 13.7μs -> 16.5μs (16.7% slower)
    with pytest.raises(ValueError):
        rouwenhorst(0, 0.5, 1) # 10.5μs -> 11.1μs (5.64% slower)
    with pytest.raises(ValueError):
        rouwenhorst(-5, 0.5, 1) # 4.08μs -> 4.08μs (0.000% faster)

def test_edge_large_negative_mu():
    # Large negative mu
    codeflash_output = rouwenhorst(3, 0.5, 1, mu=-1000); mc = codeflash_output # 53.8μs -> 46.2μs (16.4% faster)
    shift = -1000 / (1 - 0.5)

# -------- LARGE SCALE TEST CASES --------
def test_large_scale_n_50():
    # Test with n=50
    codeflash_output = rouwenhorst(50, 0.9, 2); mc = codeflash_output # 482μs -> 185μs (160% faster)
    # Each row sums to 1
    for i in range(50):
        pass

def test_large_scale_n_500():
    # Test with n=500
    codeflash_output = rouwenhorst(500, 0.95, 1); mc = codeflash_output # 212ms -> 105ms (102% faster)
    # Each row sums to 1
    for i in range(500):
        pass

def test_large_scale_extreme_sigma():
    # Large sigma
    codeflash_output = rouwenhorst(20, 0.5, 100); mc = codeflash_output # 175μs -> 74.5μs (136% faster)

def test_large_scale_extreme_rho():
    # rho very close to 1, n=100
    codeflash_output = rouwenhorst(100, 0.9999, 1); mc = codeflash_output # 2.46ms -> 1.01ms (143% faster)
import warnings

import numpy as np
# imports
import pytest  # used for our unit tests
from quantecon.markov.approximation import rouwenhorst

# Function to test (copied as per instructions)
class MarkovChain:
    """
    Minimal MarkovChain class for testing.
    Stores transition matrix and state values.
    """
    def __init__(self, P, s):
        self.P = P
        self.state_values = s
from quantecon.markov.approximation import rouwenhorst

# unit tests

# ---- Basic Test Cases ----

def test_basic_two_states():
    # Test for n=2, rho=0.5, sigma=1.0, mu=0.0
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(2, 0.5, 1.0); mc = codeflash_output # 94.4μs -> 47.3μs (99.5% faster)
    # Transition matrix should be valid probabilities
    for row in mc.P:
        pass

def test_basic_three_states():
    # Test for n=3, rho=0.0, sigma=2.0, mu=0.0
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, 0.0, 2.0); mc = codeflash_output # 47.5μs -> 35.4μs (34.2% faster)
    # Transition matrix should be valid probabilities
    for row in mc.P:
        pass

def test_basic_mu_shift():
    # Test mu shifting the mean of state values
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(2, 0.5, 1.0, mu=0.0); mc1 = codeflash_output # 34.3μs -> 31.8μs (7.72% faster)
        codeflash_output = rouwenhorst(2, 0.5, 1.0, mu=2.0); mc2 = codeflash_output # 26.5μs -> 24.8μs (6.89% faster)
    # The mean of state_values should shift by mu/(1-rho)
    mean1 = np.mean(mc1.state_values)
    mean2 = np.mean(mc2.state_values)
    expected_shift = 2.0 / (1 - 0.5)

def test_basic_rho_extremes():
    # Test for rho near 1 and -1
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(2, 0.99, 1.0); mc_pos = codeflash_output # 31.4μs -> 30.0μs (4.58% faster)
        codeflash_output = rouwenhorst(2, -0.99, 1.0); mc_neg = codeflash_output # 25.9μs -> 24.2μs (7.24% faster)

# ---- Edge Test Cases ----

def test_edge_n_equals_2():
    # Edge case: n=2 (minimum allowed)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(2, 0.5, 1.0); mc = codeflash_output # 30.5μs -> 29.6μs (3.10% faster)

def test_edge_n_too_small():
    # Edge case: n=1 (should raise ValueError)
    with pytest.raises(ValueError):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            rouwenhorst(1, 0.5, 1.0)

def test_edge_sigma_zero():
    # Edge case: sigma=0 (no noise)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, 0.5, 0.0); mc = codeflash_output # 38.8μs -> 32.8μs (18.4% faster)

def test_edge_rho_zero():
    # Edge case: rho=0 (no autocorrelation)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(4, 0.0, 1.0); mc = codeflash_output # 42.5μs -> 31.2μs (36.0% faster)

def test_edge_rho_one_minus_epsilon():
    # Edge case: rho very close to 1, but not exactly 1
    rho = 1 - 1e-10
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, rho, 1.0); mc = codeflash_output # 36.6μs -> 30.2μs (21.1% faster)

def test_edge_rho_negative_one_plus_epsilon():
    # Edge case: rho very close to -1, but not exactly -1
    rho = -1 + 1e-10
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, rho, 1.0); mc = codeflash_output # 36.1μs -> 29.8μs (21.3% faster)

def test_edge_large_negative_mu():
    # Edge case: large negative mu
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, 0.5, 1.0, mu=-1000.0); mc = codeflash_output # 36.2μs -> 29.8μs (21.5% faster)

def test_edge_large_positive_mu():
    # Edge case: large positive mu
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(3, 0.5, 1.0, mu=1000.0); mc = codeflash_output # 35.5μs -> 29.8μs (19.5% faster)

def test_large_n_100():
    # Large n=100, moderate rho and sigma
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(100, 0.9, 1.0); mc = codeflash_output # 2.88ms -> 1.01ms (186% faster)
    # Probabilities should be valid
    for row in mc.P:
        pass

def test_large_n_500():
    # Large n=500, check performance and correctness
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(500, 0.8, 2.0); mc = codeflash_output # 216ms -> 105ms (106% faster)
    # Probabilities should be valid
    for row in mc.P:
        pass

def test_large_scale_state_values_monotonic():
    # For large n and rho=0, state values should be evenly spaced and monotonic
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(100, 0.0, 1.0); mc = codeflash_output # 2.96ms -> 1.06ms (179% faster)
    diffs = np.diff(mc.state_values)

def test_large_scale_state_values_shifted():
    # For large n and mu, mean should shift accordingly
    mu = 1000.0
    rho = 0.5
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(100, rho, 1.0, mu=mu); mc = codeflash_output # 2.39ms -> 948μs (152% faster)
    expected_mean = mu / (1 - rho)

# ---- Additional Robustness ----

@pytest.mark.parametrize("n,rho,sigma,mu", [
    (2, 0.1, 0.1, 0.0),
    (3, 0.9, 2.0, 1.0),
    (10, -0.5, 0.5, -1.0),
    (50, 0.7, 1.5, 10.0),
    (100, 0.0, 10.0, 0.0),
])
def test_parametrized_various_inputs(n, rho, sigma, mu):
    # Parametrized test for a variety of valid inputs
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        codeflash_output = rouwenhorst(n, rho, sigma, mu); mc = codeflash_output # 2.94ms -> 1.18ms (149% faster)
    # Probabilities valid
    for row in mc.P:
        pass

def test_warning_is_raised():
    # Ensure UserWarning is raised for API change
    with pytest.warns(UserWarning):
        rouwenhorst(2, 0.5, 1.0) # 44.0μs -> 41.1μs (6.89% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-rouwenhorst-mja1cn83 and push.

Codeflash Static Badge

The optimization replaces the recursive Python implementation of the transition matrix construction with a **Numba JIT-compiled version**, delivering a **2.05x speedup** (105% faster).

**Key optimizations applied:**

1. **Numba JIT compilation**: The core `_row_build_mat_numba` function uses `@njit(cache=True)` to compile the recursive matrix construction to machine code, eliminating Python interpreter overhead for the computationally intensive loops.

2. **Explicit loop construction**: Instead of NumPy's fancy indexing (e.g., `p1[:n-1, :n-1] = p * new_mat`), the optimized version uses explicit nested loops that Numba can optimize more effectively.

3. **Memory layout optimization**: Pre-allocating arrays with specific dtype (`np.float64`) and using manual element assignment provides better cache locality and avoids temporary array creation.

**Why this leads to speedup:**
- The original code spent 99.4% of execution time in the recursive `row_build_mat` function performing array slicing and broadcasting operations
- Numba compiles these operations to optimized machine code, eliminating Python function call overhead and NumPy's intermediate array allocations
- The recursive nature amplifies the benefit since the optimization applies at every recursion level

**Performance benefits by test case:**
- Small n (2-4 states): 1-33% faster due to JIT compilation overhead being relatively small
- Medium n (20-100 states): 136-186% faster as recursion depth increases 
- Large n (500 states): 102-106% faster, hitting the sweet spot where JIT overhead is amortized

**Workload impact:**
Based on the function references, `rouwenhorst` is used in Markov chain approximation tests and setup methods that run repeatedly. The optimization particularly benefits:
- Monte Carlo simulations requiring many Markov chains
- Parameter sensitivity analysis with varying state space sizes
- Academic/research workflows processing multiple AR(1) approximations

The optimization maintains identical behavior including the recursion limit check, ensuring backward compatibility while dramatically improving performance for the computational bottleneck.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 17, 2025 13:16
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants