Introduction

TL;DR

  • MoE activates only a small subset of expert FFNs per token (conditional computation), scaling total capacity without proportional per-token compute.
  • In Transformers, the mainstream pattern is replacing the dense FFN/MLP with an MoE FFN (router + experts).
  • Production bottlenecks often come from routing imbalance, capacity overflow (drops), all-to-all communication, and memory bandwidth; serving requires observability and cluster tuning.

Why it matters: MoE is a combined model + distributed-systems problem, not just a modeling trick.


What MoE Is (Router + Experts)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart LR
  X[Input tokens] --> R[Router / Gating]
  R -->|Top-k| D[Dispatch]
  D --> E1[Expert 1 (FFN)]
  D --> E2[Expert 2 (FFN)]
  D --> EN[Expert N (FFN)]
  E1 --> C[Combine (weighted sum)]
  E2 --> C
  EN --> C
  C --> Y[Output tokens]

MoE was popularized as a sparsely-gated expert layer enabling conditional computation at scale.

Why it matters: Router + dispatch/collect are the most common sources of operational issues (imbalance, drops, communication).


Where MoE Fits in a Transformer (FFN Replacement)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart TB
  subgraph Dense[Dense Transformer Block]
    A[Self-Attention] --> N1[Add & Norm]
    N1 --> F[Dense FFN] --> N2[Add & Norm]
  end

  subgraph MoE[MoE Transformer Block]
    A2[Self-Attention] --> M1[Add & Norm]
    M1 --> MF[MoE FFN (Router + Experts)] --> M2[Add & Norm]
  end

Switch Transformer and the JMLR version describe MoE adoption through FFN replacement plus simplified routing.

Why it matters: Anchoring MoE at the FFN makes it easier to reason about compute, memory, and communication costs.


Routing Choices: Top-1 vs Top-2

Top-1 (Switch-style)

Switch emphasizes Top-1 routing to simplify training and reduce costs.

1
2
3
4
flowchart LR
  T[Token] --> R[Router]
  R -->|Top-1| E[One Expert]
  E --> O[Output]

Why it matters: Top-1 is often the most operationally friendly baseline for serving.

Top-2 (Mixtral-style)

Mistral’s Mixtral announcement and paper describe routing to two experts per token at each layer (Top-2).

1
2
3
4
5
6
7
flowchart LR
  T[Token] --> R[Router]
  R -->|Top-2| E1[Expert A]
  R -->|Top-2| E2[Expert B]
  E1 --> C[Combine]
  E2 --> C
  C --> O[Output]

Why it matters: Top-2 can increase system cost (communication/memory), so routing decisions must consider serving constraints.

Routing alternative: Expert Choice

Expert Choice routing proposes experts selecting tokens to improve load balancing.

Why it matters: If imbalance/drops dominate, routing algorithm changes can be more effective than only tuning capacity.


Serving Mixtral: vLLM vs TensorRT-LLM (at a glance)

  • vLLM Mixtral docs describe tensor-parallel MoE with sharded experts and a fused MoE kernel.
  • TensorRT-LLM docs explain Expert Parallelism vs Tensor Parallelism, and Triton’s TensorRT-LLM backend lists TP/PP/EP support.
1
2
3
4
5
flowchart LR
  U[Client] --> GW[Gateway]
  GW --> RT[Runtime/Engine]
  RT -->|vLLM| V[vLLM: sharded experts + fused MoE]
  RT -->|TensorRT-LLM| T[TensorRT-LLM: EP/TP strategies]

Why it matters: In MoE, the serving engine’s MoE kernels and parallelism strategy often define real throughput/latency.


Monitoring Template (separate the failure modes)

  • routing imbalance: expert_token_fraction, router_entropy
  • overflow/drops: token_drop_rate, expert_overflow_tokens_total
  • performance/comm: TTFT p95, decode p95, tokens/sec, optional all-to-all latency p95

Why it matters: Without separate metrics for imbalance/drops/comm, MoE optimization becomes guesswork.


Kubernetes Checks (Topology / CPU / RDMA / NCCL)

  • Topology Manager coordinates locality optimizations.
  • CPU Manager policies can be tuned (static policy and options).
  • GPUDirect RDMA enables direct GPU to peer device data exchange via PCIe.
  • NCCL provides environment variables for tuning and configuration.

Why it matters: EP/all-to-all traffic makes MoE especially sensitive to topology and communication configuration, often impacting p95 latency.


Conclusion

  • MoE scales model capacity via sparse, token-level expert activation.
  • In Transformers, MoE is most commonly applied by replacing FFN/MLP blocks.
  • Top-1 vs Top-2 is a cost/complexity tradeoff; imbalance, drops, and communication dominate operations.
  • Serving success depends on engine support (vLLM/TensorRT-LLM), monitoring, and Kubernetes topology tuning.

Summary

  • MoE = Router + Experts with sparse activation per token.
  • FFN replacement is the mainstream MoE placement in Transformers.
  • Production bottlenecks are often imbalance/drops/communication, not just FLOPs.

#MoE #MixtureOfExperts #SwitchTransformer #Mixtral #vLLM #TensorRTLLM #ExpertParallelism #Kubernetes #NCCL #GPUDirectRDMA

References

  • (Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017-01-23)[https://arxiv.org/abs/1701.06538]
  • (Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022)[https://jmlr.org/papers/v23/21-0998.html]
  • (Mixtral of experts, 2023-12-11)[https://mistral.ai/news/mixtral-of-experts]
  • (Mixtral of Experts, 2024-01-08)[https://arxiv.org/abs/2401.04088]
  • (Mixture of Experts Explained, 2023-12-11)[https://huggingface.co/blog/moe]
  • (Mixture-of-Experts with Expert Choice Routing, 2022-02-18)[https://arxiv.org/abs/2202.09368]
  • (DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training, 2022)[https://proceedings.mlr.press/v162/rajbhandari22a.html]
  • (vLLM Mixtral docs, 2025)[https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mixtral/]
  • (Expert Parallelism in TensorRT-LLM, 2025-09-15)[https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html]
  • (Kubernetes Topology Manager, 2025-10-21)[https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/]
  • (Kubernetes CPU Manager Policies, 2025-10-17)[https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/]
  • (GPUDirect RDMA and GPUDirect Storage, latest)[https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]
  • (NCCL Environment Variables, 2025)[https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html]