Grouter: Decoupling Routing from Representation for Accelerated MoE Training
arXiv:2603.06626v1 Announce Type: new Abstract: Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while …
Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan
10 views