Currently, due to **D2H and stream sync**, the dynamicemb( `MP`) and dense (torchrec DP) is serialized, the [side stream](https://github.com/NVIDIA/recsys-examples/blob/main/examples/commons/modules/embedding.py#L397-L404) takes no any effect. See below timeline <img width="908" height="661" alt="Image" src="https://github.com/user-attachments/assets/683c76cb-41c9-4323-95ac-48086f8b1d16" />