⚡️ Speed up function cudnn_ok by 67%
#218
Open
+15
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 67% (0.67x) speedup for
cudnn_okinkeras/src/backend/tensorflow/rnn.py⏱️ Runtime :
827 microseconds→494 microseconds(best of5runs)📝 Explanation and details
The optimization replaces repeated local imports with module-level caching, delivering a 67% speedup by eliminating expensive import overhead in hot path functions.
Key optimization: The original code imports
keras.src.activationsandkeras.src.opson every function call within_do_gru_arguments_support_cudnnand_do_lstm_arguments_support_cudnn. The optimized version introduces a cached import mechanism using module-level globals that import these modules only once per process lifetime.Performance impact: Line profiler data shows the import statements consumed 75% of execution time in the original functions. The optimized version reduces this to around 50% by eliminating redundant import lookups, with the cached approach showing ~3x faster per-hit times for the import operations.
Why this matters in practice: The function references show
cudnn_okis called during GRU layer initialization, specifically when determining CUDNN compatibility. Since RNN layers are frequently instantiated during model construction and potentially during training loops, this import overhead accumulates significantly. The test results demonstrate consistent 47-73% speedups across various parameter combinations, with particularly strong gains (61-67%) in the large-scale parametric tests that simulate real-world usage patterns.Thread safety: The optimization is safe because Python's import system is inherently thread-safe and the global caching pattern using lazy initialization (
if _activations is None) is a standard Python idiom for module-level optimization.This optimization is especially beneficial for workloads that create multiple RNN layers or repeatedly check CUDNN compatibility, transforming what was previously an import-bound operation into a fast cached lookup.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-cudnn_ok-mjajtt4mand push.