
Retry, fallback, and dead letter patterns for agent email
Most agents retry blindly or fail silently. Here's the three-layer pattern that handles transient errors, urgent messages, and permanent failures.
Your agent sent a password reset request and received the verification email. Then it tried to forward the confirmation to your user. The send failed. The agent retried immediately. Failed again. Retried five more times in under two seconds. Now your sending reputation is damaged, the message is still undelivered, and nobody logged what happened.
Most agents handle delivery errors in one of two ways: retry blindly at machine speed, or give up silently. Both are destructive. Blind retries burn your domain reputation with receiving servers that interpret rapid-fire sends as spam. Silent failures mean lost messages that nobody notices until a user complains days later.
There's a three-layer pattern borrowed from message queue architecture that solves this: exponential backoff for transient errors, fallback channels for time-sensitive messages, and dead letter capture for permanent failures. If your agent sends any volume of email, it needs all three.
Why agents handle email failure differently than humans#
When a human's email bounces, they read the error, wait, maybe fix the address, and try again later. The whole cycle takes minutes because a person is in the loop.
Agents operate at machine speed with no built-in patience. When a send returns an error, the default behavior in most agent frameworks is to retry the tool call immediately. An agent can fire 50 retries in the time it takes a human to read the error message, and receiving servers interpret that burst as a spam bot. Agents also don't distinguish between a 421 ("try again later") and a 550 ("permanently rejected"). One should trigger a delayed retry. The other should never be retried at all. And when retries are exhausted, most agents move on to their next task. The failed message disappears from context with no log, no alert, no recovery path.
The fix isn't teaching your agent to be more careful. It's giving it a structured error-handling pipeline with clear rules for each failure type.
Layer 1: exponential backoff for transient errors#
SMTP 4xx status codes are temporary rejections. The receiving server is saying "not right now." Common causes include rate limiting, greylisting (where the server deliberately rejects first-time senders to filter bots), and temporary outages.
The correct response is to wait and retry, with each wait period longer than the last. This is exponential backoff, the same pattern used in TCP congestion control and API rate limiting across distributed systems.
import time
import random
def send_with_backoff(send_fn, message, max_attempts=5, base_delay=2.0):
for attempt in range(max_attempts):
result = send_fn(message)
if result.status < 400:
return result # Success
if result.status >= 500:
Permanent failure — route to dead letter, don't retry#
return route_to_dead_letter(message, result)
4xx: transient error, back off and retry#
delay = base_delay * (2 ** attempt) jitter = delay * 0.2 * (random.random() - 0.5) time.sleep(delay + jitter)
All retries exhausted#
return route_to_dead_letter(message, result) Start with a 1-2 second base delay. Double it on each attempt. Add random jitter of about ±20% to prevent thundering herds, where multiple agents retry at the exact same moment and overwhelm the server again. Cap total attempts at 5-7 and set a maximum delay ceiling around five minutes so your agent isn't sleeping for half an hour on attempt eight.
The jitter is the part people skip, and it matters more than you'd think. If your agent manages 10 inboxes and all of them hit a rate limit at the same time, synchronized retries will trigger the same rate limit again. Randomized jitter spreads the retries across a time window so each one has a better chance of succeeding.
Layer 2: fallback channels for time-sensitive messages#
Some emails can't survive a retry cycle. A verification code expires in 10 minutes. A meeting invite needs to arrive before the meeting starts. A password reset link has a 15-minute TTL.
For these, waiting and retrying isn't enough. You need a fallback path that activates when the primary send fails and the message has a time constraint.
async function sendWithFallback(message: Message): Promise<SendResult> {
const result = await primarySend(message);
if (result.success) return result;
if (message.priority === "time-sensitive" && result.isTransient) {
// Skip the backoff queue, try an alternate path immediately
const fallbackResult = await secondarySend(message);
if (fallbackResult.success) return fallbackResult;
}
// Log for visibility even if both paths fail
await alertQueue.push({
message,
primaryError: result.error,
timestamp: Date.now(),
});
return result;
}
The fallback doesn't have to be another email provider. It could be a webhook notification to your application, a Slack message, or a log entry in your monitoring dashboard. The point is that time-sensitive failures should never vanish into a silent retry loop where nobody sees them until the window has already closed.
How you define "time-sensitive" depends on your use case. A simple approach: tag messages at creation time with a TTL. If the TTL is shorter than your backoff cycle would take, the fallback activates instead of the retry queue.
Layer 3: dead letter queues for permanent failures#
A 550 status code means the message was permanently rejected. The recipient address doesn't exist, your domain failed authentication, or the content triggered a policy block. Retrying the exact same message produces the exact same result and damages your sender reputation in the process.
If you've dealt with 550 denied by policy errors before, you know the error code alone doesn't always tell the full story. Your dead letter queue should capture the complete SMTP response, not just the status number.
class DeadLetterQueue:
def __init__(self, storage):
self.storage = storage
def capture(self, message, error, attempts):
self.storage.insert({
"message_id": message.id,
"to": message.to,
"subject": message.subject,
"error_code": error.code,
"error_detail": error.message,
"attempts": attempts,
"captured_at": datetime.utcnow(),
"resolved": False,
})
def get_unresolved(self, limit=50):
return self.storage.find(
{"resolved": False},
sort="captured_at",
limit=limit,
)
The dead letter queue serves three purposes. First, visibility: you can see which messages failed and why, instead of discovering problems when users complain a week later. Second, pattern analysis: clusters of DLQ entries with the same error code reveal systemic issues like missing DNS records or blocklist entries you didn't know about. Third, recovery: some "permanent" failures are actually fixable. A DNS record gets added, a blocklist entry gets removed, and the DLQ lets you re-attempt those sends after the root cause is resolved.
Putting the three layers together#
The decision flow is straightforward:
- Send the message.
- If it succeeds, done.
- If it's a 4xx and the message is time-sensitive, try the fallback channel.
- If it's a 4xx and not time-sensitive (or the fallback also failed), enter the exponential backoff loop.
- If it's a 5xx, route directly to the dead letter queue. Do not retry.
- If the backoff loop exhausts all attempts, route to the dead letter queue.
Every failed message ends up somewhere visible. Nothing disappears silently.
What this looks like with managed infrastructure#
Building all three layers from scratch is doable but tedious. You need retry scheduling, a persistence layer for the DLQ, monitoring, and error classification that correctly maps SMTP codes to "transient" vs. "permanent."
This is where managed email infrastructure earns its keep. LobsterMail's sending pipeline handles transient retry logic and error classification at the infrastructure level. Your agent gets back structured error responses it can act on, not raw SMTP codes it has to interpret. If you want your agent to send email without building a retry pipeline from scratch, and the infrastructure handles transient retries automatically.
Even with managed sending, you still want the fallback and dead letter layers in your own application code. No email provider can decide what "time-sensitive" means for your specific use case, and your DLQ should live in your system where you control the data and the recovery logic.
The agents that handle email reliably aren't the ones that never encounter errors. They're the ones with clear rules for what happens when errors occur: patience for the temporary stuff, urgency for the time-sensitive stuff, and a dead letter queue for everything that can't be saved on the first pass. Start with the backoff loop, add the DLQ, then layer in fallbacks as your agent's email patterns become clearer. If you're running into common agent email setup issues, fix those first, because no retry pattern compensates for broken authentication or a burned sending domain.
Frequently asked questions
What is a dead letter queue in email?
A dead letter queue (DLQ) is a storage area for email messages that permanently failed to send. Instead of discarding them or retrying forever, the DLQ captures the message and its error details so you can inspect, debug, and optionally re-send them after fixing the root cause.
How many times should an agent retry a failed email?
Five to seven attempts with exponential backoff is the standard recommendation. Start with a 1-2 second delay, double it each attempt, and add random jitter. Going beyond seven retries rarely helps and risks making the deliverability problem worse.
What is the difference between a 4xx and 5xx SMTP error?
A 4xx error is temporary: the server is saying "try again later." A 5xx error is permanent: the message was rejected and retrying won't change the outcome. Your agent should only retry on 4xx errors. 5xx errors should go straight to the dead letter queue.
Does exponential backoff improve email deliverability?
Yes. Receiving servers interpret rapid-fire retries as spam bot behavior. Exponential backoff spaces out retries so each attempt looks like normal sending activity. It also gives temporary issues (rate limits, greylisting) time to resolve on their own.
What is jitter in retry logic and why does it matter?
Jitter adds a small random offset to each retry delay. Without it, multiple agents that hit the same error at the same time will retry simultaneously, causing a "thundering herd" that overwhelms the server again. Jitter of ±20% of the delay spreads retries across a time window.
Should my agent retry on a 550 error?
No. A 550 error is a permanent rejection. Retrying the same message wastes resources and damages your sender reputation. Route 550 failures to a dead letter queue, identify the cause, fix it, and then re-send if appropriate.
How does LobsterMail handle email retries?
LobsterMail handles transient (4xx) retries at the infrastructure level with built-in exponential backoff. Permanent failures are returned as structured error responses your agent can route to its own dead letter queue or fallback logic. See the sending docs for details.
What is a thundering herd problem in email sending?
A thundering herd happens when many agents or processes retry failed sends at the exact same moment, creating a traffic spike that triggers the same rate limit or error they were trying to recover from. Adding random jitter to retry delays prevents this.
Do I need a dead letter queue if I use a managed email service?
Yes, for permanent failures. Managed services handle transient retries automatically, but 5xx rejections are returned to your application. Without a DLQ, those permanently failed messages vanish silently. Your DLQ captures them for inspection and potential recovery.
What is a fallback channel for email?
A fallback channel is an alternate delivery path for time-sensitive messages that can't wait through a full retry cycle. It could be a secondary email provider, a webhook to your app, or a notification system. It activates when the primary send fails and the message has a short TTL.
What is the best base delay for exponential backoff?
One to two seconds works well for email. Shorter delays (under 500ms) risk looking like automated spam. Longer initial delays (10+ seconds) add unnecessary latency to messages that might succeed on the second try. Double the base on each subsequent attempt.
Can I recover messages from a dead letter queue?
Yes, that's one of the main purposes. If a batch of messages failed because of a missing DNS record or a temporary blocklist entry, you can fix the root cause and replay the messages from the DLQ. Mark each entry as resolved after successful re-send to keep your queue clean.


