Question 1

Why are transformers better than previous architectures?

Accepted Answer

Transformers process all tokens in parallel (during training) and use attention to connect any two positions regardless of distance. Previous architectures like RNNs processed tokens sequentially, making them slow to train and prone to "forgetting" early parts of long sequences. Transformers also scale better — larger models consistently produce better results.

Question 2

Are all LLMs based on transformers?

Accepted Answer

Nearly all major LLMs today use the transformer architecture or close variants. Some researchers are exploring alternatives like state-space models (Mamba) and hybrid architectures that mix attention with other mechanisms. But as of now, transformers remain the dominant architecture for language models and the foundation of every commercially deployed LLM.

Question 3

How does the transformer architecture affect agent costs?

Accepted Answer

Transformer inference costs are driven by two factors: the number of input tokens (processed in parallel) and the number of output tokens (generated sequentially). Input tokens are cheaper per token than output tokens. For email agents, this means loading long email threads as context is relatively affordable, but generating lengthy responses is where costs accumulate. Optimizing output length is often more impactful than trimming input context.

Question 4

What is self-attention in transformers?

Accepted Answer

Self-attention is the mechanism that lets each token in a sequence compute a relevance score with every other token. This allows the model to understand relationships between distant words, like connecting a pronoun at the end of a paragraph to the noun it refers to at the beginning. It is the core innovation that makes transformers effective.

Question 5

What does 'Attention Is All You Need' refer to?

Accepted Answer

It is the title of the 2017 Google research paper that introduced the transformer architecture. The paper demonstrated that attention mechanisms alone, without recurrence or convolution, could achieve state-of-the-art performance on language tasks. This paper is the foundation of all modern LLMs.

Question 6

What is the difference between encoder and decoder transformers?

Accepted Answer

Encoder transformers process input bidirectionally and are used for understanding tasks like classification and entity extraction (e.g., BERT). Decoder transformers generate text autoregressively, one token at a time, and power most modern LLMs (e.g., GPT, Claude). Some models use both (e.g., the original T5 architecture).

Question 7

Why do larger transformer models perform better?

Accepted Answer

Larger transformers have more parameters, allowing them to capture more nuanced patterns in language. Research has shown consistent scaling laws: performance improves predictably with more parameters, more training data, and more compute. This is why the AI industry has invested heavily in training increasingly large models.

Question 8

Why does context window size matter for email agents?

Accepted Answer

The context window determines how much text an agent can process in a single request. Email agents often need to handle long threads, attached documents, and conversation history. A larger context window lets the agent process more of this information at once without truncation, leading to more informed and accurate responses.

Question 9

What causes transformer inference latency?

Accepted Answer

Output generation is the main source of latency because each token is generated sequentially, with each new token requiring a full forward pass through the model. Input processing is faster because it runs in parallel. For email agents, this means short classification responses are fast, while longer drafted replies take proportionally longer.

Question 10

Will transformers be replaced by other architectures?

Accepted Answer

Research into alternatives like state-space models (Mamba), RWKV, and hybrid architectures is active, with some showing promising efficiency gains for long sequences. However, transformers remain dominant for quality and have massive ecosystem investment. Any replacement would need to match transformer quality while offering significant practical advantages.

Transformer

What is a transformer?#

Why it matters for AI agents#

Frequently asked questions

Related terms

Attention Mechanism

Tokens

Inference