Question 1

What is the difference between MoE and a regular transformer?

Accepted Answer

A regular (dense) transformer uses all its parameters for every token. An MoE transformer has multiple expert sub-networks per layer and uses a router to select which experts process each token. This lets MoE models have more total parameters (and knowledge) while using less compute per token than a dense model of equivalent quality.

Question 2

Do MoE models have any downsides?

Accepted Answer

Yes. MoE models require more memory because all expert weights must be loaded, even though only some are active. They can also suffer from load balancing issues where some experts are overused and others are underused. Quantized MoE models partially address the memory issue but add complexity. For API-based usage, these are the provider's problem — you just see faster, cheaper inference.

Question 3

Should I choose an MoE model for my email agent?

Accepted Answer

If you're self-hosting and need high throughput, MoE models offer excellent performance per compute dollar. If you're using API providers, MoE is often running behind the scenes already. The main practical question is whether the model performs well on your specific email tasks — run benchmarks on your actual workload rather than choosing architecture first.

Question 4

What is the router in a Mixture of Experts model?

Accepted Answer

The router (or gate) is a small neural network that examines each input token and decides which expert sub-networks should process it. It outputs a probability distribution over experts, and the top-k experts (usually 1-2) are selected for that token. The router is trained alongside the experts during model training.

Question 5

How many experts are typically active per token?

Accepted Answer

Most MoE models activate 1-2 experts per token out of 8-64 total experts per layer. For example, Mixtral activates 2 of 8 experts. This sparse activation is what makes MoE efficient — you get the knowledge capacity of all experts but only pay the compute cost of the active ones.

Question 6

What are some popular MoE models?

Accepted Answer

Mixtral 8x7B and 8x22B by Mistral, DeepSeek-V2 and V3, and Grok by xAI are well-known MoE models. Many API providers also use MoE architectures internally without publicizing it, as the architecture enables better performance-per-cost ratios at scale.

Question 7

Can MoE models be quantized?

Accepted Answer

Yes. MoE models can be quantized using the same techniques as dense models (GPTQ, GGUF, AWQ). Quantization is especially useful for MoE because it directly addresses their main drawback — high memory requirements. A 4-bit quantized MoE model can fit in a fraction of the memory while retaining most of its quality.

Question 8

Why do MoE models use more memory than dense models?

Accepted Answer

All expert weights must be loaded into memory even though only a subset is active for any given token. Different tokens route to different experts, so every expert needs to be ready. A model with 8 experts of 7B parameters each requires memory for all 47B parameters, even though per-token compute only uses about 14B worth.

Question 9

How does MoE affect inference latency for email processing?

Accepted Answer

MoE typically reduces per-token latency compared to a dense model of equivalent quality because fewer parameters are active per computation. For email agents processing high volumes of messages, this translates to faster classification, extraction, and response generation without sacrificing output quality.

Question 10

Is Mixture of Experts the same as an ensemble of models?

Accepted Answer

No. An ensemble runs multiple complete models and combines their outputs. MoE is a single model with specialized sub-networks inside it, where a learned router selects which sub-networks process each input. MoE is more efficient because only a fraction of the model runs per token, unlike ensembles where every model runs fully.

Mixture of Experts

What is mixture of experts?#

Why it matters for AI agents#

Frequently asked questions

Related terms

Transformer

Inference