Models were accidentally trained with MoE instead of dense layers. Need to verify the MoE routers are functioning properly.
Background:
- Intended to use all dense layers
- Config change was lost, models trained with 1 initial dense + MoE layers
- Training seemed to work fine
Tasks:
Question: Did the accidental MoE help or hurt the results?
Models were accidentally trained with MoE instead of dense layers. Need to verify the MoE routers are functioning properly.
Background:
Tasks:
Question: Did the accidental MoE help or hurt the results?