Question 1

How do I count tokens before sending a request?

Accepted Answer

Most AI providers offer tokenizer tools or libraries. OpenAI provides tiktoken, and Anthropic provides a token counting API. You can also estimate by dividing your word count by 0.75. For precise counts, use the provider's official tokenizer since different models tokenize text differently.

Question 2

Why do tokens cost money?

Accepted Answer

Each token requires computational work during inference. The model must process every input token to understand context, and generating each output token requires a forward pass through the neural network. More tokens mean more GPU compute time, which providers pass on as per-token pricing.

Question 3

How can AI agents reduce token costs?

Accepted Answer

Write concise system prompts, use RAG to retrieve only relevant context instead of including everything, summarize long conversation histories, set output length limits, cache repeated prompts where the API supports it, and use structured formats for inter-agent communication. Small optimizations compound at high volume.

Question 4

What is the difference between input tokens and output tokens?

Accepted Answer

Input tokens are the text you send to the model (system prompt, user message, context). Output tokens are the text the model generates in response. Output tokens are typically 3-5x more expensive than input tokens because each output token requires a sequential forward pass through the model, while input tokens are processed in parallel.

Question 5

How many tokens does a typical email use?

Accepted Answer

A short business email is roughly 100-300 tokens. A long email with formatting and signatures might be 500-1,000 tokens. When you add the system prompt (500-2,000 tokens) and any retrieved context (500-2,000 tokens), processing a single email typically costs 1,500-5,000 total tokens per inference call.

Question 6

What is a context window in relation to tokens?

Accepted Answer

The context window is the maximum number of tokens a model can process in a single request, including both input and output. A 128K context window means the combined total of your prompt, context, and generated response cannot exceed roughly 128,000 tokens. Exceeding this limit causes the request to fail.

Question 7

Do different languages use different amounts of tokens?

Accepted Answer

Yes. English is generally the most token-efficient language because most tokenizers are optimized for English text. Other languages, especially those with non-Latin scripts like Chinese, Japanese, or Arabic, often require more tokens per word or concept, increasing both costs and context window consumption.

Question 8

What is prompt caching and how does it save tokens?

Accepted Answer

Prompt caching lets you reuse the processed representation of a static prompt prefix across multiple requests, so you don't pay full input token costs each time. For agents with a consistent system prompt processing many requests, this can reduce input token costs significantly since the cached prefix is processed only once.

Question 9

How do tokens affect AI agent latency?

Accepted Answer

Output tokens are the primary latency driver because they're generated sequentially, one at a time. Input tokens affect latency less because they're processed in parallel. For email agents where response time matters, keeping output concise (shorter replies, structured JSON) directly reduces the time to complete each request.

Question 10

Are tokens the same across different AI models?

Accepted Answer

No. Each model family uses its own tokenizer, so the same text may produce different token counts across models. A sentence that's 20 tokens in GPT-4 might be 22 tokens in Claude. Always use the specific model's tokenizer for accurate cost estimates rather than assuming cross-model equivalence.

Tokens

What is a Token?#

Why It Matters for AI Agents#

Frequently asked questions

Related terms

Inference

Context Window