MoE (Mixture of Experts) Explained with Diagrams: Routing, Mixtral Serving, Monitoring, and Kubernetes Checks

Introduction

TL;DR

MoE activates only a small subset of expert FFNs per token (conditional computation), scaling total capacity without proportional per-token compute.
In Transformers, the mainstream pattern is replacing the dense FFN/MLP with an MoE FFN (router + experts).
Production bottlenecks often come from routing imbalance, capacity overflow (drops), all-to-all communication, and memory bandwidth; serving requires observability and cluster tuning.

Why it matters: MoE is a combined model + distributed-systems problem, not just a modeling trick.

What MoE Is (Router + Experts)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart LR
  X[Input tokens] --> R[Router / Gating]
  R -->|Top-k| D[Dispatch]
  D --> E1[Expert 1 (FFN)]
  D --> E2[Expert 2 (FFN)]
  D --> EN[Expert N (FFN)]
  E1 --> C[Combine (weighted sum)]
  E2 --> C
  EN --> C
  C --> Y[Output tokens]

MoE was popularized as a sparsely-gated expert layer enabling conditional computation at scale.

Why it matters: Router + dispatch/collect are the most common sources of operational issues (imbalance, drops, communication).

Where MoE Fits in a Transformer (FFN Replacement)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flowchart TB
  subgraph Dense[Dense Transformer Block]
    A[Self-Attention] --> N1[Add & Norm]
    N1 --> F[Dense FFN] --> N2[Add & Norm]
  end

  subgraph MoE[MoE Transformer Block]
    A2[Self-Attention] --> M1[Add & Norm]
    M1 --> MF[MoE FFN (Router + Experts)] --> M2[Add & Norm]
  end

Switch Transformer and the JMLR version describe MoE adoption through FFN replacement plus simplified routing.

Why it matters: Anchoring MoE at the FFN makes it easier to reason about compute, memory, and communication costs.

Routing Choices: Top-1 vs Top-2

Top-1 (Switch-style)

Switch emphasizes Top-1 routing to simplify training and reduce costs.

1
2
3
4
flowchart LR
  T[Token] --> R[Router]
  R -->|Top-1| E[One Expert]
  E --> O[Output]

Why it matters: Top-1 is often the most operationally friendly baseline for serving.

Top-2 (Mixtral-style)

Mistral’s Mixtral announcement and paper describe routing to two experts per token at each layer (Top-2).

1
2
3
4
5
6
7
flowchart LR
  T[Token] --> R[Router]
  R -->|Top-2| E1[Expert A]
  R -->|Top-2| E2[Expert B]
  E1 --> C[Combine]
  E2 --> C
  C --> O[Output]

Why it matters: Top-2 can increase system cost (communication/memory), so routing decisions must consider serving constraints.

Routing alternative: Expert Choice

Expert Choice routing proposes experts selecting tokens to improve load balancing.

Why it matters: If imbalance/drops dominate, routing algorithm changes can be more effective than only tuning capacity.

Serving Mixtral: vLLM vs TensorRT-LLM (at a glance)

vLLM Mixtral docs describe tensor-parallel MoE with sharded experts and a fused MoE kernel.
TensorRT-LLM docs explain Expert Parallelism vs Tensor Parallelism, and Triton’s TensorRT-LLM backend lists TP/PP/EP support.

1
2
3
4
5
flowchart LR
  U[Client] --> GW[Gateway]
  GW --> RT[Runtime/Engine]
  RT -->|vLLM| V[vLLM: sharded experts + fused MoE]
  RT -->|TensorRT-LLM| T[TensorRT-LLM: EP/TP strategies]

Why it matters: In MoE, the serving engine’s MoE kernels and parallelism strategy often define real throughput/latency.

Monitoring Template (separate the failure modes)

routing imbalance: expert_token_fraction, router_entropy
overflow/drops: token_drop_rate, expert_overflow_tokens_total
performance/comm: TTFT p95, decode p95, tokens/sec, optional all-to-all latency p95

Why it matters: Without separate metrics for imbalance/drops/comm, MoE optimization becomes guesswork.

Kubernetes Checks (Topology / CPU / RDMA / NCCL)

Topology Manager coordinates locality optimizations.
CPU Manager policies can be tuned (static policy and options).
GPUDirect RDMA enables direct GPU to peer device data exchange via PCIe.
NCCL provides environment variables for tuning and configuration.

Why it matters: EP/all-to-all traffic makes MoE especially sensitive to topology and communication configuration, often impacting p95 latency.

Conclusion

MoE scales model capacity via sparse, token-level expert activation.
In Transformers, MoE is most commonly applied by replacing FFN/MLP blocks.
Top-1 vs Top-2 is a cost/complexity tradeoff; imbalance, drops, and communication dominate operations.
Serving success depends on engine support (vLLM/TensorRT-LLM), monitoring, and Kubernetes topology tuning.

Summary

MoE = Router + Experts with sparse activation per token.
FFN replacement is the mainstream MoE placement in Transformers.
Production bottlenecks are often imbalance/drops/communication, not just FLOPs.

Recommended Hashtags

#MoE #MixtureOfExperts #SwitchTransformer #Mixtral #vLLM #TensorRTLLM #ExpertParallelism #Kubernetes #NCCL #GPUDirectRDMA

References

(Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017-01-23)[https://arxiv.org/abs/1701.06538]
(Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022)[https://jmlr.org/papers/v23/21-0998.html]
(Mixtral of experts, 2023-12-11)[https://mistral.ai/news/mixtral-of-experts]
(Mixtral of Experts, 2024-01-08)[https://arxiv.org/abs/2401.04088]
(Mixture of Experts Explained, 2023-12-11)[https://huggingface.co/blog/moe]
(Mixture-of-Experts with Expert Choice Routing, 2022-02-18)[https://arxiv.org/abs/2202.09368]
(DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training, 2022)[https://proceedings.mlr.press/v162/rajbhandari22a.html]
(vLLM Mixtral docs, 2025)[https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mixtral/]
(Expert Parallelism in TensorRT-LLM, 2025-09-15)[https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html]
(Kubernetes Topology Manager, 2025-10-21)[https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/]
(Kubernetes CPU Manager Policies, 2025-10-17)[https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/]
(GPUDirect RDMA and GPUDirect Storage, latest)[https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html]
(NCCL Environment Variables, 2025)[https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html]

Introduction#

TL;DR#

What MoE Is (Router + Experts)#

Where MoE Fits in a Transformer (FFN Replacement)#

Routing Choices: Top-1 vs Top-2#

Top-1 (Switch-style)#

Top-2 (Mixtral-style)#

Routing alternative: Expert Choice#

Serving Mixtral: vLLM vs TensorRT-LLM (at a glance)#

Monitoring Template (separate the failure modes)#

Kubernetes Checks (Topology / CPU / RDMA / NCCL)#

Conclusion#

Summary#

Recommended Hashtags#

References#