Question 1

How is distillation different from fine-tuning?

Accepted Answer

Fine-tuning adapts a model to a new task using labeled examples (input-output pairs). Distillation trains a smaller model to mimic a larger model's behavior, which inherently includes the larger model's reasoning patterns and nuanced predictions. Distillation typically produces a different (smaller) model, while fine-tuning modifies an existing model. You can also combine them — fine-tune a small model using a teacher's outputs as training data.

Question 2

How much smaller can a distilled model be?

Accepted Answer

It depends on the task. For focused tasks like email classification or entity extraction, a student model can be 5-10x smaller than the teacher with minimal quality loss. For open-ended generation tasks, the gap narrows — you may only achieve 2-3x compression before quality degrades noticeably. The key is that distillation works best when the target task is well-defined and narrow.

Question 3

Can I distill from proprietary models like GPT-4 or Claude?

Accepted Answer

You can use their outputs as training data for a student model, and many open-source models were created this way. However, most API providers' terms of service restrict or prohibit using outputs to train competing models. Check the specific terms of the model you're using. For internal use cases (training a model for your own email agent), the practical risk is low, but the legal landscape is still evolving.

Question 4

What is the two-tier model pattern for email agents?

Accepted Answer

Use a distilled student model for routine, high-volume tasks like email classification and simple responses, and route complex or uncertain cases to a larger teacher model. This pattern typically handles 80-90% of email volume at low cost while maintaining quality on edge cases. The blended cost is far lower than using the teacher model for everything.

Question 5

How much data do you need to distill a model for email tasks?

Accepted Answer

For focused tasks like email classification, 5,000-10,000 teacher-labeled examples are often sufficient. For more complex generation tasks like drafting customer replies, 20,000-50,000 examples produce better results. The data should cover the full range of scenarios the student model will encounter, including edge cases and error conditions.

Question 6

Does distillation reduce hallucination?

Accepted Answer

Not inherently. A distilled model learns from the teacher's outputs, including any hallucinations. However, distillation on a narrow, well-defined task can reduce hallucination because the student model learns domain-specific patterns rather than relying on broad, potentially unreliable knowledge. Combining distillation with grounded retrieval (RAG) provides the strongest hallucination reduction.

Question 7

What is the difference between distillation and quantization?

Accepted Answer

Distillation trains a new, smaller model to replicate a larger model's behavior. Quantization compresses an existing model by reducing the precision of its weights (e.g., from 32-bit to 4-bit). Distillation changes the model architecture. Quantization changes the number representation. Both reduce size and cost, and they can be combined for maximum compression.

Question 8

How do you measure whether distillation was successful?

Accepted Answer

Compare the student model against the teacher on your task-specific benchmark. Measure accuracy, response quality, hallucination rate, and latency. A successful distillation produces a student that matches 90-95% of the teacher's quality at a fraction of the cost and latency. If the quality gap is too large, you may need more training data or a larger student architecture.

Question 9

Can distillation help email agents respond faster?

Accepted Answer

Yes. Distilled models are smaller and generate tokens faster than their teacher models. For email agents where response latency matters — such as real-time inbox triage or auto-reply systems — a distilled model can reduce response time from seconds to milliseconds while maintaining acceptable quality for routine messages.

Question 10

What open-source tools support model distillation?

Accepted Answer

Popular tools include Hugging Face TRL (Transformer Reinforcement Learning), OpenRLHF, and distillation features built into frameworks like PyTorch and TensorFlow. These tools provide training pipelines that handle the teacher-student training loop, soft-target loss functions, and evaluation against teacher baselines.

Distillation

What is distillation?#

Why it matters for AI agents#

Frequently asked questions

Related terms

Fine-Tuning

Quantization

Inference