- 
                Notifications
    
You must be signed in to change notification settings  - Fork 467
 
feat(llmobs): success criteria assessment for custom evals #14792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           
  | 
    
          Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 242 ± 3 ms. The average import time from base is: 246 ± 4 ms. The import time difference between this PR and base is: -4.0 ± 0.2 ms. Import time breakdownThe following import paths have shrunk: 
             | 
    
          Performance SLOsComparing candidate yunkim/evals-success-criteria (b9840bb) with baseline main (8235d03) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 0.402µs (SLO: <10.000µs 📉 -96.0%) vs baseline: -0.2% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ add_inplace_aspectTime: ✅ 0.412µs (SLO: <10.000µs 📉 -95.9%) vs baseline: +1.4% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ add_inplace_noaspectTime: ✅ 0.317µs (SLO: <10.000µs 📉 -96.8%) vs baseline: -0.4% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ add_noaspectTime: ✅ 0.279µs (SLO: <10.000µs 📉 -97.2%) vs baseline: ~same Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +5.1% ✅ bytearray_aspectTime: ✅ 1.315µs (SLO: <10.000µs 📉 -86.9%) vs baseline: ~same Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ bytearray_extend_aspectTime: ✅ 1.458µs (SLO: <10.000µs 📉 -85.4%) vs baseline: +0.9% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ bytearray_extend_noaspectTime: ✅ 0.607µs (SLO: <10.000µs 📉 -93.9%) vs baseline: -1.4% Memory: ✅ 37.965MB (SLO: <39.000MB -2.7%) vs baseline: +4.5% ✅ bytearray_noaspectTime: ✅ 0.478µs (SLO: <10.000µs 📉 -95.2%) vs baseline: +0.4% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ bytes_aspectTime: ✅ 1.292µs (SLO: <10.000µs 📉 -87.1%) vs baseline: -0.2% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ bytes_noaspectTime: ✅ 0.492µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -0.6% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ bytesio_aspectTime: ✅ 1.350µs (SLO: <10.000µs 📉 -86.5%) vs baseline: -0.7% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ bytesio_noaspectTime: ✅ 0.505µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +1.4% Memory: ✅ 38.122MB (SLO: <39.000MB -2.3%) vs baseline: +5.2% ✅ capitalize_aspectTime: ✅ 0.738µs (SLO: <10.000µs 📉 -92.6%) vs baseline: +0.4% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.7% ✅ capitalize_noaspectTime: ✅ 0.435µs (SLO: <10.000µs 📉 -95.6%) vs baseline: -0.9% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ casefold_aspectTime: ✅ 0.736µs (SLO: <10.000µs 📉 -92.6%) vs baseline: -0.9% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.9% ✅ casefold_noaspectTime: ✅ 0.368µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.4% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +4.9% ✅ decode_aspectTime: ✅ 0.723µs (SLO: <10.000µs 📉 -92.8%) vs baseline: ~same Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ decode_noaspectTime: ✅ 0.416µs (SLO: <10.000µs 📉 -95.8%) vs baseline: -0.3% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +5.0% ✅ encode_aspectTime: ✅ 0.714µs (SLO: <10.000µs 📉 -92.9%) vs baseline: +1.4% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ encode_noaspectTime: ✅ 0.404µs (SLO: <10.000µs 📉 -96.0%) vs baseline: +1.1% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ format_aspectTime: ✅ 3.383µs (SLO: <10.000µs 📉 -66.2%) vs baseline: +0.1% Memory: ✅ 38.024MB (SLO: <39.000MB -2.5%) vs baseline: +4.8% ✅ format_map_aspectTime: ✅ 3.572µs (SLO: <10.000µs 📉 -64.3%) vs baseline: -1.1% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.9% ✅ format_map_noaspectTime: ✅ 0.780µs (SLO: <10.000µs 📉 -92.2%) vs baseline: +1.1% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ format_noaspectTime: ✅ 0.597µs (SLO: <10.000µs 📉 -94.0%) vs baseline: +0.5% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ index_aspectTime: ✅ 0.355µs (SLO: <10.000µs 📉 -96.4%) vs baseline: -0.9% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ index_noaspectTime: ✅ 0.279µs (SLO: <10.000µs 📉 -97.2%) vs baseline: -0.2% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.9% ✅ join_aspectTime: ✅ 1.397µs (SLO: <10.000µs 📉 -86.0%) vs baseline: +1.3% Memory: ✅ 38.004MB (SLO: <39.000MB -2.6%) vs baseline: +4.6% ✅ join_noaspectTime: ✅ 0.488µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -0.8% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +5.0% ✅ ljust_aspectTime: ✅ 2.584µs (SLO: <20.000µs 📉 -87.1%) vs baseline: +4.7% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ ljust_noaspectTime: ✅ 0.405µs (SLO: <10.000µs 📉 -96.0%) vs baseline: +0.6% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.1% ✅ lower_aspectTime: ✅ 2.186µs (SLO: <10.000µs 📉 -78.1%) vs baseline: -1.1% Memory: ✅ 38.122MB (SLO: <39.000MB -2.3%) vs baseline: +4.9% ✅ lower_noaspectTime: ✅ 0.366µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.2% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ lstrip_aspectTime: ✅ 2.606µs (SLO: <20.000µs 📉 -87.0%) vs baseline: 📈 +17.2% Memory: ✅ 38.122MB (SLO: <39.000MB -2.3%) vs baseline: +4.9% ✅ lstrip_noaspectTime: ✅ 0.384µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +1.3% Memory: ✅ 38.024MB (SLO: <39.000MB -2.5%) vs baseline: +4.6% ✅ modulo_aspectTime: ✅ 0.995µs (SLO: <10.000µs 📉 -90.1%) vs baseline: -0.8% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 1.538µs (SLO: <10.000µs 📉 -84.6%) vs baseline: +0.5% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ modulo_aspect_for_bytesTime: ✅ 0.981µs (SLO: <10.000µs 📉 -90.2%) vs baseline: +0.8% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 1.199µs (SLO: <10.000µs 📉 -88.0%) vs baseline: ~same Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ modulo_noaspectTime: ✅ 0.631µs (SLO: <10.000µs 📉 -93.7%) vs baseline: +0.7% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ replace_aspectTime: ✅ 5.413µs (SLO: <10.000µs 📉 -45.9%) vs baseline: 📈 +13.4% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ replace_noaspectTime: ✅ 0.460µs (SLO: <10.000µs 📉 -95.4%) vs baseline: -0.8% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ repr_aspectTime: ✅ 0.909µs (SLO: <10.000µs 📉 -90.9%) vs baseline: -0.2% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +4.9% ✅ repr_noaspectTime: ✅ 0.421µs (SLO: <10.000µs 📉 -95.8%) vs baseline: +1.7% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ rstrip_aspectTime: ✅ 1.992µs (SLO: <20.000µs 📉 -90.0%) vs baseline: +3.9% Memory: ✅ 38.024MB (SLO: <39.000MB -2.5%) vs baseline: +4.9% ✅ rstrip_noaspectTime: ✅ 0.386µs (SLO: <10.000µs 📉 -96.1%) vs baseline: +1.0% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +5.0% ✅ slice_aspectTime: ✅ 0.498µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.8% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +4.8% ✅ slice_noaspectTime: ✅ 0.450µs (SLO: <10.000µs 📉 -95.5%) vs baseline: ~same Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.6% ✅ stringio_aspectTime: ✅ 1.697µs (SLO: <10.000µs 📉 -83.0%) vs baseline: 📈 +10.2% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +5.0% ✅ stringio_noaspectTime: ✅ 0.728µs (SLO: <10.000µs 📉 -92.7%) vs baseline: +1.8% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ strip_aspectTime: ✅ 2.551µs (SLO: <20.000µs 📉 -87.2%) vs baseline: 📈 +14.9% Memory: ✅ 38.063MB (SLO: <39.000MB -2.4%) vs baseline: +4.7% ✅ strip_noaspectTime: ✅ 0.384µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +0.4% Memory: ✅ 37.965MB (SLO: <39.000MB -2.7%) vs baseline: +4.4% ✅ swapcase_aspectTime: ✅ 2.559µs (SLO: <10.000µs 📉 -74.4%) vs baseline: +5.4% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ swapcase_noaspectTime: ✅ 0.537µs (SLO: <10.000µs 📉 -94.6%) vs baseline: -0.1% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ title_aspectTime: ✅ 2.344µs (SLO: <10.000µs 📉 -76.6%) vs baseline: -1.1% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.8% ✅ title_noaspectTime: ✅ 0.498µs (SLO: <10.000µs 📉 -95.0%) vs baseline: -0.5% Memory: ✅ 38.083MB (SLO: <39.000MB -2.4%) vs baseline: +4.9% ✅ translate_aspectTime: ✅ 3.450µs (SLO: <10.000µs 📉 -65.5%) vs baseline: +7.4% Memory: ✅ 38.044MB (SLO: <39.000MB -2.5%) vs baseline: +4.7% ✅ translate_noaspectTime: ✅ 1.043µs (SLO: <10.000µs 📉 -89.6%) vs baseline: +0.2% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.0% ✅ upper_aspectTime: ✅ 2.356µs (SLO: <10.000µs 📉 -76.4%) vs baseline: +5.4% Memory: ✅ 38.103MB (SLO: <39.000MB -2.3%) vs baseline: +5.2% ✅ upper_noaspectTime: ✅ 0.375µs (SLO: <10.000µs 📉 -96.2%) vs baseline: -0.6% Memory: ✅ 38.004MB (SLO: <39.000MB -2.6%) vs baseline: +4.7% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 4.313µs (SLO: <10.000µs 📉 -56.9%) vs baseline: +0.3% Memory: ✅ 37.729MB (SLO: <39.000MB -3.3%) vs baseline: +4.9% ✅ ospathbasename_noaspectTime: ✅ 1.090µs (SLO: <10.000µs 📉 -89.1%) vs baseline: ~same Memory: ✅ 37.650MB (SLO: <39.000MB -3.5%) vs baseline: +4.9% ✅ ospathjoin_aspectTime: ✅ 6.926µs (SLO: <10.000µs 📉 -30.7%) vs baseline: 📈 +13.5% Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +4.8% ✅ ospathjoin_noaspectTime: ✅ 2.322µs (SLO: <10.000µs 📉 -76.8%) vs baseline: +0.5% Memory: ✅ 37.670MB (SLO: <39.000MB -3.4%) vs baseline: +4.9% ✅ ospathnormcase_aspectTime: ✅ 3.521µs (SLO: <10.000µs 📉 -64.8%) vs baseline: +0.7% Memory: ✅ 37.709MB (SLO: <39.000MB -3.3%) vs baseline: +5.0% ✅ ospathnormcase_noaspectTime: ✅ 0.570µs (SLO: <10.000µs 📉 -94.3%) vs baseline: ~same Memory: ✅ 37.670MB (SLO: <39.000MB -3.4%) vs baseline: +4.9% ✅ ospathsplit_aspectTime: ✅ 4.892µs (SLO: <10.000µs 📉 -51.1%) vs baseline: -0.4% Memory: ✅ 37.631MB (SLO: <39.000MB -3.5%) vs baseline: +4.8% ✅ ospathsplit_noaspectTime: ✅ 1.607µs (SLO: <10.000µs 📉 -83.9%) vs baseline: +0.5% Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +4.7% ✅ ospathsplitdrive_aspectTime: ✅ 3.668µs (SLO: <10.000µs 📉 -63.3%) vs baseline: +0.3% Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +4.8% ✅ ospathsplitdrive_noaspectTime: ✅ 0.699µs (SLO: <10.000µs 📉 -93.0%) vs baseline: -0.8% Memory: ✅ 37.709MB (SLO: <39.000MB -3.3%) vs baseline: +4.9% ✅ ospathsplitext_aspectTime: ✅ 4.537µs (SLO: <10.000µs 📉 -54.6%) vs baseline: -1.5% Memory: ✅ 37.709MB (SLO: <39.000MB -3.3%) vs baseline: +4.9% ✅ ospathsplitext_noaspectTime: ✅ 1.391µs (SLO: <10.000µs 📉 -86.1%) vs baseline: -0.6% Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +5.0% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.307µs (SLO: <20.000µs 📉 -83.5%) vs baseline: +6.4% Memory: ✅ 32.126MB (SLO: <34.000MB -5.5%) vs baseline: +5.0% ✅ 1-count-metrics-100-timesTime: ✅ 213.621µs (SLO: <250.000µs 📉 -14.6%) vs baseline: +0.7% Memory: ✅ 32.145MB (SLO: <34.000MB -5.5%) vs baseline: +5.0% ✅ 1-distribution-metric-1-timesTime: ✅ 3.201µs (SLO: <20.000µs 📉 -84.0%) vs baseline: +8.6% Memory: ✅ 32.106MB (SLO: <34.000MB -5.6%) vs baseline: +4.8% ✅ 1-distribution-metrics-100-timesTime: ✅ 195.139µs (SLO: <220.000µs 📉 -11.3%) vs baseline: +1.4% Memory: ✅ 32.067MB (SLO: <34.000MB -5.7%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.076µs (SLO: <20.000µs 📉 -89.6%) vs baseline: -1.4% Memory: ✅ 32.126MB (SLO: <34.000MB -5.5%) vs baseline: +5.2% ✅ 1-gauge-metrics-100-timesTime: ✅ 126.386µs (SLO: <150.000µs 📉 -15.7%) vs baseline: +0.4% Memory: ✅ 32.106MB (SLO: <34.000MB -5.6%) vs baseline: +4.9% ✅ 1-rate-metric-1-timesTime: ✅ 3.144µs (SLO: <20.000µs 📉 -84.3%) vs baseline: +1.2% Memory: ✅ 32.145MB (SLO: <34.000MB -5.5%) vs baseline: +4.9% ✅ 1-rate-metrics-100-timesTime: ✅ 214.736µs (SLO: <250.000µs 📉 -14.1%) vs baseline: +1.0% Memory: ✅ 32.145MB (SLO: <34.000MB -5.5%) vs baseline: +5.0% ✅ 100-count-metrics-100-timesTime: ✅ 21.761ms (SLO: <23.500ms -7.4%) vs baseline: +1.8% Memory: ✅ 32.145MB (SLO: <34.000MB -5.5%) vs baseline: +4.9% ✅ 100-distribution-metrics-100-timesTime: ✅ 1.983ms (SLO: <2.250ms 📉 -11.9%) vs baseline: -0.2% Memory: ✅ 32.165MB (SLO: <34.000MB -5.4%) vs baseline: +4.9% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.295ms (SLO: <1.550ms 📉 -16.4%) vs baseline: +0.5% Memory: ✅ 32.106MB (SLO: <34.000MB -5.6%) vs baseline: +4.7% ✅ 100-rate-metrics-100-timesTime: ✅ 2.217ms (SLO: <2.550ms 📉 -13.1%) vs baseline: +1.7% Memory: ✅ 32.047MB (SLO: <34.000MB -5.7%) vs baseline: +4.8% ✅ flush-1-metricTime: ✅ 4.687µs (SLO: <20.000µs 📉 -76.6%) vs baseline: 📈 +10.7% Memory: ✅ 32.165MB (SLO: <34.000MB -5.4%) vs baseline: +5.2% ✅ flush-100-metricsTime: ✅ 180.829µs (SLO: <250.000µs 📉 -27.7%) vs baseline: -0.4% Memory: ✅ 32.106MB (SLO: <34.000MB -5.6%) vs baseline: +4.9% ✅ flush-1000-metricsTime: ✅ 2.228ms (SLO: <2.500ms 📉 -10.9%) vs baseline: +0.8% Memory: ✅ 32.873MB (SLO: <34.500MB -4.7%) vs baseline: +4.7% 🟡 Near SLO Breach (5 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 20.478ms (SLO: <22.300ms -8.2%) vs baseline: +0.3% Memory: ✅ 65.488MB (SLO: <67.000MB -2.3%) vs baseline: +4.9% ✅ exception-replay-enabledTime: ✅ 1.348ms (SLO: <1.450ms -7.1%) vs baseline: -0.3% Memory: ✅ 64.591MB (SLO: <67.000MB -3.6%) vs baseline: +4.8% ✅ iastTime: ✅ 20.482ms (SLO: <22.250ms -7.9%) vs baseline: +0.2% Memory: ✅ 65.510MB (SLO: <67.000MB -2.2%) vs baseline: +4.9% ✅ profilerTime: ✅ 15.323ms (SLO: <16.550ms -7.4%) vs baseline: +0.4% Memory: ✅ 53.730MB (SLO: <54.500MB 🟡 -1.4%) vs baseline: +4.8% ✅ resource-renamingTime: ✅ 20.513ms (SLO: <21.750ms -5.7%) vs baseline: -0.3% Memory: ✅ 65.379MB (SLO: <67.000MB -2.4%) vs baseline: +4.8% ✅ span-code-originTime: ✅ 26.149ms (SLO: <28.200ms -7.3%) vs baseline: -0.1% Memory: ✅ 67.480MB (SLO: <69.500MB -2.9%) vs baseline: +4.6% ✅ tracerTime: ✅ 20.497ms (SLO: <21.750ms -5.8%) vs baseline: -0.2% Memory: ✅ 65.487MB (SLO: <67.000MB -2.3%) vs baseline: +4.9% ✅ tracer-and-profilerTime: ✅ 21.992ms (SLO: <23.500ms -6.4%) vs baseline: -0.2% Memory: ✅ 66.578MB (SLO: <67.500MB 🟡 -1.4%) vs baseline: +4.9% ✅ tracer-dont-create-db-spansTime: ✅ 19.332ms (SLO: <21.500ms 📉 -10.1%) vs baseline: ~same Memory: ✅ 65.478MB (SLO: <66.000MB 🟡 -0.8%) vs baseline: +4.9% ✅ tracer-minimalTime: ✅ 16.552ms (SLO: <17.500ms -5.4%) vs baseline: -0.5% Memory: ✅ 65.467MB (SLO: <66.000MB 🟡 -0.8%) vs baseline: +4.7% ✅ tracer-nativeTime: ✅ 20.452ms (SLO: <21.750ms -6.0%) vs baseline: -0.3% Memory: ✅ 71.334MB (SLO: <72.500MB 🟡 -1.6%) vs baseline: +4.8% ✅ tracer-no-cachesTime: ✅ 18.438ms (SLO: <19.650ms -6.2%) vs baseline: ~same Memory: ✅ 65.454MB (SLO: <67.000MB -2.3%) vs baseline: +4.9% ✅ tracer-no-databasesTime: ✅ 18.733ms (SLO: <20.100ms -6.8%) vs baseline: -0.2% Memory: ✅ 65.299MB (SLO: <67.000MB -2.5%) vs baseline: +4.6% ✅ tracer-no-middlewareTime: ✅ 20.183ms (SLO: <21.500ms -6.1%) vs baseline: -0.1% Memory: ✅ 65.467MB (SLO: <67.000MB -2.3%) vs baseline: +4.7% ✅ tracer-no-templatesTime: ✅ 20.297ms (SLO: <22.000ms -7.7%) vs baseline: ~same Memory: ✅ 65.461MB (SLO: <67.000MB -2.3%) vs baseline: +4.9% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 18.288ms (SLO: <19.850ms -7.9%) vs baseline: +1.4% Memory: ✅ 65.195MB (SLO: <66.500MB 🟡 -2.0%) vs baseline: +4.8% ✅ errortracking-enabled-userTime: ✅ 18.262ms (SLO: <19.400ms -5.9%) vs baseline: +1.2% Memory: ✅ 65.274MB (SLO: <66.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ tracer-enabledTime: ✅ 18.072ms (SLO: <19.450ms -7.1%) vs baseline: +0.3% Memory: ✅ 65.235MB (SLO: <66.500MB 🟡 -1.9%) vs baseline: +4.9% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.569ms (SLO: <4.750ms -3.8%) vs baseline: -0.3% Memory: ✅ 62.010MB (SLO: <65.000MB -4.6%) vs baseline: +5.0% ✅ appsec-postTime: ✅ 6.585ms (SLO: <6.750ms -2.4%) vs baseline: +0.2% Memory: ✅ 61.991MB (SLO: <65.000MB -4.6%) vs baseline: +4.8% ✅ appsec-telemetryTime: ✅ 4.555ms (SLO: <4.750ms -4.1%) vs baseline: -0.9% Memory: ✅ 62.049MB (SLO: <65.000MB -4.5%) vs baseline: +4.9% ✅ debuggerTime: ✅ 1.887ms (SLO: <2.000ms -5.7%) vs baseline: +1.3% Memory: ✅ 45.554MB (SLO: <47.000MB -3.1%) vs baseline: +5.2% ✅ iast-getTime: ✅ 1.861ms (SLO: <2.000ms -6.9%) vs baseline: -0.3% Memory: ✅ 42.408MB (SLO: <49.000MB 📉 -13.5%) vs baseline: +4.9% ✅ profilerTime: ✅ 1.906ms (SLO: <2.100ms -9.3%) vs baseline: -0.4% Memory: ✅ 46.478MB (SLO: <47.000MB 🟡 -1.1%) vs baseline: +4.8% ✅ resource-renamingTime: ✅ 3.381ms (SLO: <3.650ms -7.4%) vs baseline: -0.2% Memory: ✅ 52.199MB (SLO: <53.500MB -2.4%) vs baseline: +4.8% ✅ tracerTime: ✅ 3.363ms (SLO: <3.650ms -7.9%) vs baseline: -0.7% Memory: ✅ 52.258MB (SLO: <53.500MB -2.3%) vs baseline: +4.9% ✅ tracer-nativeTime: ✅ 3.370ms (SLO: <3.650ms -7.7%) vs baseline: -0.1% Memory: ✅ 58.260MB (SLO: <60.000MB -2.9%) vs baseline: +4.8% 🟡 otelspan - 22/22✅ add-eventTime: ✅ 42.662ms (SLO: <47.150ms -9.5%) vs baseline: +0.7% Memory: ✅ 44.463MB (SLO: <47.000MB -5.4%) vs baseline: +4.9% ✅ add-metricsTime: ✅ 318.139ms (SLO: <344.800ms -7.7%) vs baseline: ~same Memory: ✅ 596.460MB (SLO: <600.000MB 🟡 -0.6%) vs baseline: +5.0% ✅ add-tagsTime: ✅ 291.973ms (SLO: <314.000ms -7.0%) vs baseline: +1.9% Memory: ✅ 597.205MB (SLO: <600.000MB 🟡 -0.5%) vs baseline: +4.9% ✅ get-contextTime: ✅ 80.926ms (SLO: <92.350ms 📉 -12.4%) vs baseline: +0.3% Memory: ✅ 39.910MB (SLO: <46.500MB 📉 -14.2%) vs baseline: +4.9% ✅ is-recordingTime: ✅ 38.974ms (SLO: <44.500ms 📉 -12.4%) vs baseline: -0.2% Memory: ✅ 43.931MB (SLO: <47.500MB -7.5%) vs baseline: +4.8% ✅ record-exceptionTime: ✅ 58.884ms (SLO: <67.650ms 📉 -13.0%) vs baseline: ~same Memory: ✅ 40.275MB (SLO: <47.000MB 📉 -14.3%) vs baseline: +4.7% ✅ set-statusTime: ✅ 45.646ms (SLO: <50.400ms -9.4%) vs baseline: +1.9% Memory: ✅ 43.854MB (SLO: <47.000MB -6.7%) vs baseline: +4.6% ✅ startTime: ✅ 38.246ms (SLO: <43.450ms 📉 -12.0%) vs baseline: +0.2% Memory: ✅ 43.986MB (SLO: <47.000MB -6.4%) vs baseline: +5.0% ✅ start-finishTime: ✅ 82.675ms (SLO: <88.000ms -6.1%) vs baseline: -0.2% Memory: ✅ 34.544MB (SLO: <46.500MB 📉 -25.7%) vs baseline: +4.9% ✅ start-finish-telemetryTime: ✅ 85.648ms (SLO: <89.000ms -3.8%) vs baseline: +1.7% Memory: ✅ 34.524MB (SLO: <46.500MB 📉 -25.8%) vs baseline: +4.7% ✅ update-nameTime: ✅ 40.183ms (SLO: <45.150ms 📉 -11.0%) vs baseline: +0.3% Memory: ✅ 44.210MB (SLO: <47.000MB -5.9%) vs baseline: +4.8% 🟡 span - 26/26✅ add-eventTime: ✅ 21.104ms (SLO: <22.500ms -6.2%) vs baseline: +2.6% Memory: ✅ 50.354MB (SLO: <53.000MB -5.0%) vs baseline: +4.9% ✅ add-metricsTime: ✅ 91.053ms (SLO: <93.500ms -2.6%) vs baseline: -0.3% Memory: ✅ 660.808MB (SLO: <961.000MB 📉 -31.2%) vs baseline: +4.8% ✅ add-tagsTime: ✅ 149.171ms (SLO: <155.000ms -3.8%) vs baseline: +0.6% Memory: ✅ 661.512MB (SLO: <962.500MB 📉 -31.3%) vs baseline: +4.8% ✅ get-contextTime: ✅ 19.425ms (SLO: <20.500ms -5.2%) vs baseline: -0.4% Memory: ✅ 49.113MB (SLO: <53.000MB -7.3%) vs baseline: +4.8% ✅ is-recordingTime: ✅ 19.685ms (SLO: <20.500ms -4.0%) vs baseline: -0.2% Memory: ✅ 49.189MB (SLO: <53.000MB -7.2%) vs baseline: +4.8% ✅ record-exceptionTime: ✅ 38.221ms (SLO: <40.000ms -4.4%) vs baseline: -0.4% Memory: ✅ 42.752MB (SLO: <53.000MB 📉 -19.3%) vs baseline: +4.9% ✅ set-statusTime: ✅ 21.354ms (SLO: <22.000ms -2.9%) vs baseline: +0.7% Memory: ✅ 49.217MB (SLO: <53.000MB -7.1%) vs baseline: +5.1% ✅ startTime: ✅ 19.202ms (SLO: <20.500ms -6.3%) vs baseline: -0.6% Memory: ✅ 49.182MB (SLO: <53.000MB -7.2%) vs baseline: +4.7% ✅ start-finishTime: ✅ 51.547ms (SLO: <52.500ms 🟡 -1.8%) vs baseline: ~same Memory: ✅ 32.126MB (SLO: <34.000MB -5.5%) vs baseline: +4.5% ✅ start-finish-telemetryTime: ✅ 52.822ms (SLO: <54.500ms -3.1%) vs baseline: ~same Memory: ✅ 32.204MB (SLO: <34.000MB -5.3%) vs baseline: +5.2% ✅ start-finish-traceid128Time: ✅ 54.993ms (SLO: <56.000ms 🟡 -1.8%) vs baseline: +0.3% Memory: ✅ 32.126MB (SLO: <34.000MB -5.5%) vs baseline: +4.9% ✅ start-traceid128Time: ✅ 19.726ms (SLO: <22.500ms 📉 -12.3%) vs baseline: -0.4% Memory: ✅ 49.110MB (SLO: <53.000MB -7.3%) vs baseline: +5.0% ✅ update-nameTime: ✅ 20.267ms (SLO: <22.000ms -7.9%) vs baseline: +0.4% Memory: ✅ 49.752MB (SLO: <53.000MB -6.1%) vs baseline: +4.7% 
 | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm if we're good with the variable name!
d9f1a40    to
    8d1e5d0      
    Compare
  
    ## Description (public change) Adds `reasoning` as an argument to `submit_evaluation_for()` and `submit_evaluation()`. This arg is used to denote an explanation behind the evaluation results (i.e. why was the span marked as toxic?) (internal change - not facing users) Also changes how the `assessment` field is stored on the evaluation object (#14792 added it as a nested `success_criteria` object) to a top-level field on the evaluation object. This isn't breaking (since this hasn't been officially released on our product backend) nor a user-facing change. <!-- Provide an overview of the change and motivation for the change --> ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
Description
MLOB-4072
Adds
success_asssesmentas an argument tosubmit_evaluation_for(). This arg is used to denote whether or not the submitted evaluation is correct/valid (particularly in the context of evaluations using LLM-as-a-judge). Provided values must be"pass"or"fail".Testing
Risks
Additional Notes