Question 1

What does inference mean in AI?

Accepted Answer

Inference is the process of running input through a trained AI model to produce output. Every time you send a prompt to an LLM and get a response, that's inference. It's the model applying what it learned during training to new data.

Question 2

Why is inference important for AI costs?

Accepted Answer

LLM APIs charge per token processed during inference. The more tokens in your input (context) and output, the higher the cost. For AI agents processing thousands of requests, optimizing inference efficiency directly impacts operating costs.

Question 3

What is the difference between training and inference?

Accepted Answer

Training is the process of teaching a model by feeding it data — it happens once and is very expensive. Inference is using the trained model to generate responses — it happens on every API call and costs much less per request, but adds up at scale.

Question 4

What is inference latency and why does it matter?

Accepted Answer

Inference latency is the time between sending a request and receiving a response. For email agents, high latency can delay responses to customers and slow down automated workflows. Latency depends on model size, input length, output length, and server load.

Question 5

How do you reduce inference costs for AI agents?

Accepted Answer

Reduce input token count by trimming unnecessary context. Use smaller models for simple tasks like classification. Cache responses for repeated queries. Batch similar requests when possible. Choose the right model size for each task rather than using the largest model for everything.

Question 6

What is the difference between input tokens and output tokens in inference?

Accepted Answer

Input tokens are the tokens in your prompt (system instructions, context, user message). Output tokens are what the model generates in its response. Most APIs price output tokens higher than input tokens because generation is more computationally intensive than processing input.

Question 7

What is batch inference?

Accepted Answer

Batch inference processes multiple requests together rather than one at a time. It can be more cost-effective and efficient for tasks that don't need real-time responses, like classifying a batch of emails overnight. Some API providers offer discounted pricing for batch inference jobs.

Question 8

How does context window size affect inference?

Accepted Answer

Larger context windows allow more input data per inference call but increase cost and latency proportionally. An agent that includes 100K tokens of context in every call pays significantly more than one that includes only the relevant 5K tokens. Context engineering is the practice of optimizing what goes into each call.

Question 9

Can inference be run locally instead of through an API?

Accepted Answer

Yes. Open-source models can be run locally on GPUs, eliminating per-call API costs and reducing latency. Local inference gives you full control over the model and data privacy. The trade-off is the upfront cost of hardware and the operational overhead of managing model deployments.

Question 10

What is streaming inference?

Accepted Answer

Streaming inference returns output tokens as they are generated rather than waiting for the complete response. This reduces perceived latency because the user or agent starts receiving data immediately. For email agents, streaming is useful when drafting long responses that benefit from progressive rendering.

Inference

What is inference?#

Why inference costs matter#

Inference vs training#

Frequently asked questions

Related terms

Tokens

Context Window

Fine-Tuning