From e946ac8650b8f7b024849912155e6c67367262ef Mon Sep 17 00:00:00 2001
From: "codeflash-ai[bot]"
 <148906541+codeflash-ai[bot]@users.noreply.github.com>
Date: Mon, 24 Nov 2025 23:42:10 +0000
Subject: [PATCH] Optimize describe

The optimized code achieves a **230% speedup** by replacing inefficient pandas operations with vectorized NumPy operations. The key optimizations are:

**What was optimized:**
1. **NaN filtering**: Replaced the slow list comprehension `[v for v in series if not pd.isna(v)]` with vectorized operations: `arr = series.to_numpy()`, `mask = ~pd.isna(arr)`, and `values = arr[mask]`
2. **Sorting**: Changed from Python's `sorted(values)` to NumPy's `np.sort(values)`
3. **Statistical calculations**: Replaced manual calculations with NumPy methods - `values.mean()` instead of `sum(values) / n`, and `((values - mean) ** 2).mean()` for variance

**Why it's faster:**
- **Vectorization**: NumPy operations are implemented in C and operate on entire arrays at once, avoiding Python's interpreter overhead for each element
- **Memory efficiency**: NumPy arrays have better memory layout and avoid the overhead of Python objects
- **Optimized algorithms**: NumPy's sorting and mathematical operations use highly optimized implementations

**Performance breakdown from profiling:**
- Original code spent 78.4% of time on the list comprehension (20.3ms out of 25.9ms total)
- Optimized version reduces this to just 49.9% across all NumPy operations (1.99ms out of 3.99ms total)
- The variance calculation improved from 17.6% to 15.4% of runtime while being more readable

**Test case performance:**
The optimization particularly benefits larger datasets - the large-scale test cases with 1000+ elements will see the most dramatic improvements due to the vectorized operations scaling much better than the original element-by-element processing.
---
 src/statistics/descriptive.py | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/statistics/descriptive.py b/src/statistics/descriptive.py
index 11882c7..b9a5383 100644
--- a/src/statistics/descriptive.py
+++ b/src/statistics/descriptive.py
@@ -5,8 +5,10 @@
 
 
 def describe(series: pd.Series) -> dict[str, float]:
-    values = [v for v in series if not pd.isna(v)]
-    n = len(values)
+    arr = series.to_numpy()
+    mask = ~pd.isna(arr)
+    values = arr[mask]
+    n = values.size
     if n == 0:
         return {
             "count": 0,
@@ -18,9 +20,9 @@ def describe(series: pd.Series) -> dict[str, float]:
             "75%": np.nan,
             "max": np.nan,
         }
-    sorted_values = sorted(values)
-    mean = sum(values) / n
-    variance = sum((x - mean) ** 2 for x in values) / n
+    sorted_values = np.sort(values)
+    mean = values.mean()
+    variance = ((values - mean) ** 2).mean()
     std = variance**0.5
 
     def percentile(p):