hello,i think there's an inconsistency between training and sampling:
During training, the noise prediction for the current frame uses the previous latent state z_{t-1}and the current noise level k_t;
But during sampling, the noise prediction network receives z_t^new (a latent that already incorporates the current noisy frame) and the target noise level K_{m,t}, rather than the actual current noise level K_{m+1,t}.
Why?