Skip to content

Conversation

@kimm240
Copy link
Contributor

@kimm240 kimm240 commented Jul 29, 2025

This PR introduces an operator fusion for the common conv2d followed by reshape, add, and relu sequence, commonly found in deep learning models (e.g., convolution + bias + activation pattern in PyTorch). This optimization significantly improves performance and efficiency by reducing overhead and optimizing memory usage.

  1. Performance Improvement:

    • Reduced Kernel Launch Overhead: Previously, conv2d, reshape, add, and relu each required separate kernel calls. By fusing these four operations into a single, unified DNNL kernel (e.g., dnnl_fused_conv2d_bias_relu), the overhead from multiple kernel launches is significantly reduced. This is evident from src/runtime/contrib/dnnl/dnnl.cc:154-158, where all operations are handled by a single execute call.
    • Decreased Memory Bandwidth Consumption: Intermediate results of individual operations (e.g., conv_out, bias_add) traditionally required costly memory write-backs and reads. Fusion allows these intermediate values to be processed directly in registers or cache, reducing unnecessary memory accesses, and thus decreasing memory bandwidth usage and overall execution time.
  2. Increased Efficiency:

    • Leveraging Compiler Optimizations: By utilizing TVM's FuseOpsByPattern and MergeCompositeFunctions passes, this change generates a composite operation optimized for specific backends (like DNNL). This ensures that common patterns from frontends like PyTorch are automatically recognized within the TVM graph and mapped to high-performance fused kernels provided by libraries like DNNL.
    • Simplified IR Module: Compilers' Intermediate Representation (IR) becomes less complex as multiple operation nodes are condensed into a single composite node. This simplification enhances efficiency in subsequent optimization and code generation stages.

This fusion is achieved through a two-stage transformation within the TVM Relax framework:

  1. Pattern Recognition and Composite Function Creation (FuseConv2dReshapeAddRelu Pass):

    • The FuseConv2dReshapeAddRelu class, registered as a tvm.transform.module_pass, transforms the IRModule.
    • The _conv2d_reshape_add_relu_pattern() helper function defines the specific sequence: conv2d -> reshape (applied to bias) -> add -> relu using TVM's Declarative Pattern Language (DPL). This includes matching input tensors (data, weight, bias, shape) using wildcard() and identifying operation sequence with is_op().
    • The relax.transform.FuseOpsByPattern pass identifies this pattern in the input IRModule. Upon detection, the operation sequence is encapsulated into a new Relax function with {"Composite": "dnnl.conv2d_reshape_add_relu", "Primitive": True} attributes, marking it as a logical "composite" unit.
  2. Composite Function Merging and Codegen Attribute Assignment (MergeCompositeFunctions Pass):

    • Following the FuseConv2dReshapeAddRelu pass, the MergeCompositeFunctions pass is applied via tvm.ir.transform.Sequential.
    • This pass identifies functions marked with the Composite attribute and transforms them into external functions bearing the {"Codegen": "dnnl"} attribute. This Codegen attribute indicates that the composite operation should be offloaded to a specific TVM backend, such as DNNL.
    • Consequently, during graph execution, the fused function with the Codegen attribute will be mapped and executed by an optimized, single DNNL kernel, for instance, dnnl_fused_conv2d_bias_relu (defined in src/runtime/contrib/dnnl/dnnl.cc:199-207).

This implementation successfully enables the fusion of the conv2d + reshape + add + relu pattern. This ensures that common convolution + bias + activation patterns originating from frontends like PyTorch are now fully optimized and executed as a single, highly efficient DNNL kernel within TVM.


To verify this fusion, you can directly run the specific test case:

python tests/python/relax/test_conv2d_reshape_add_relu.py

… fusion

This PR introduces an operator fusion for the common `conv2d` followed by `reshape`, `add`, and `relu` sequence, commonly found in deep learning models (e.g., convolution + bias + activation pattern in PyTorch). This optimization significantly improves performance and efficiency by reducing overhead and optimizing memory usage.

1.  **Performance Improvement:**
    * **Reduced Kernel Launch Overhead:** Previously, `conv2d`, `reshape`, `add`, and `relu` each required separate kernel calls. By fusing these four operations into a single, unified DNNL kernel (e.g., `dnnl_fused_conv2d_bias_relu`), the overhead from multiple kernel launches is significantly reduced. This is evident from `src/runtime/contrib/dnnl/dnnl.cc:154-158`, where all operations are handled by a single `execute` call.
    * **Decreased Memory Bandwidth Consumption:** Intermediate results of individual operations (e.g., `conv_out`, `bias_add`) traditionally required costly memory write-backs and reads. Fusion allows these intermediate values to be processed directly in registers or cache, reducing unnecessary memory accesses, and thus decreasing memory bandwidth usage and overall execution time.

2.  **Increased Efficiency:**
    * **Leveraging Compiler Optimizations:** By utilizing TVM's `FuseOpsByPattern` and `MergeCompositeFunctions` passes, this change generates a composite operation optimized for specific backends (like DNNL). This ensures that common patterns from frontends like PyTorch are automatically recognized within the TVM graph and mapped to high-performance fused kernels provided by libraries like DNNL.
    * **Simplified IR Module:** Compilers' Intermediate Representation (IR) becomes less complex as multiple operation nodes are condensed into a single composite node. This simplification enhances efficiency in subsequent optimization and code generation stages.

This fusion is achieved through a two-stage transformation within the TVM Relax framework:

1.  **Pattern Recognition and Composite Function Creation (`FuseConv2dReshapeAddRelu` Pass):**
    * The `FuseConv2dReshapeAddRelu` class, registered as a `tvm.transform.module_pass`, transforms the `IRModule`.
    * The `_conv2d_reshape_add_relu_pattern()` helper function defines the specific sequence: `conv2d` -> `reshape` (applied to bias) -> `add` -> `relu` using TVM's Declarative Pattern Language (DPL). This includes matching input tensors (`data`, `weight`, `bias`, `shape`) using `wildcard()` and identifying operation sequence with `is_op()`.
    * The `relax.transform.FuseOpsByPattern` pass identifies this pattern in the input `IRModule`. Upon detection, the operation sequence is encapsulated into a new Relax function with `{"Composite": "dnnl.conv2d_reshape_add_relu", "Primitive": True}` attributes, marking it as a logical "composite" unit.

2.  **Composite Function Merging and Codegen Attribute Assignment (`MergeCompositeFunctions` Pass):**
    * Following the `FuseConv2dReshapeAddRelu` pass, the `MergeCompositeFunctions` pass is applied via `tvm.ir.transform.Sequential`.
    * This pass identifies functions marked with the `Composite` attribute and transforms them into external functions bearing the `{"Codegen": "dnnl"}` attribute. This `Codegen` attribute indicates that the composite operation should be offloaded to a specific TVM backend, such as DNNL.
    * Consequently, during graph execution, the fused function with the `Codegen` attribute will be mapped and executed by an optimized, single DNNL kernel, for instance, `dnnl_fused_conv2d_bias_relu` (defined in `src/runtime/contrib/dnnl/dnnl.cc:199-207`).

This implementation successfully enables the fusion of the `conv2d + reshape + add + relu` pattern. This ensures that common convolution + bias + activation patterns originating from frontends like PyTorch are now fully optimized and executed as a single, highly efficient DNNL kernel within TVM.

---

To verify this fusion, you can directly run the specific test case:

python tests/python/relax/test_conv2d_reshape_add_relu.py
@yongwww
Copy link
Member

yongwww commented Aug 25, 2025

seems it is dup with #18173

@yongwww yongwww closed this Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants