
how stateless email workers scale horizontally (and where they break)
Stateless email workers let agents scale horizontally, but deduplication, retry logic, and idempotency get tricky. Here's how to architect them right.
Your agent sends 50 emails per minute today. Next week it needs to send 5,000. The architecture you chose on day one determines whether scaling up is a config change or a full rewrite.
Stateless email workers are the standard answer to this scaling problem. Strip out per-message session data, let any worker instance pick up any job from the queue, and add more instances as volume grows. It sounds clean. In practice, email introduces a set of problems that generic "just add more workers" advice quietly ignores.
I've seen agent teams hit walls at 500 messages per hour because they assumed horizontal scale was automatic. It's not. Email has unique constraints around deduplication, rate limiting, and retry logic that require deliberate architecture decisions. This article covers how stateless email workers actually scale, where they fail, and what changes when the workers are serving autonomous agents instead of human-triggered workflows.
Stateless vs stateful email workers#
A stateless email worker holds no per-message session data between executions. Any instance can process any job from the queue, which means load balancing is simple round-robin and horizontal scale is near-linear. A stateful worker, by contrast, maintains session context (connection state, conversation threading, retry counters) across multiple operations on the same message or thread.
Here's how they compare across the dimensions that actually matter for email at volume:
| Dimension | Stateless email worker | Stateful email worker |
|---|---|---|
| Scaling model | Horizontal, add instances freely | Vertical, or horizontal with sticky sessions |
| Session storage | None (external queue + database) | In-memory or local store per instance |
| Load balancer type | Round-robin, random, least-connections | Session-affinity / sticky routing |
| Duplicate-send risk | Higher without idempotency keys | Lower (instance tracks own state) |
| Retry handling | External retry queue + dead-letter | Internal retry loop with backoff |
| Throughput ceiling | Near-linear with instance count | Bound by single-instance memory |
| Infrastructure cost | Low per instance, higher coordination | Higher per instance, simpler coordination |
The short version: stateless wins on raw scalability. Stateful wins on simplicity per individual message. Most high-volume email systems end up stateless because the throughput ceiling on stateful workers is too low for production loads.
How stateless email workers scale horizontally#
The basic pattern is a job queue (Redis, RabbitMQ, SQS, or a similar broker) sitting in front of N identical worker instances. Each worker pulls a job, processes it by sending or receiving an email, marks it complete, and moves on. No worker knows or cares what any other worker is doing.
Adding a second worker roughly doubles your throughput. Adding a tenth gives you close to 10x. The queue becomes the bottleneck, not the workers, and modern message brokers handle millions of messages per second. Asynchronous messaging is what makes this work: it decouples the "decide to send" step from the "actually send" step, so producers and consumers scale independently.
But email isn't a generic job. Three problems show up reliably as you scale past a few hundred messages per minute.
Deduplication under concurrent dequeue#
When two workers pull the same job within milliseconds of each other (which happens more often than you'd expect at high concurrency), you get duplicate sends. The recipient receives two identical emails. For transactional messages, this is annoying. For agent-generated outreach, it's a credibility killer.
The fix is idempotency keys. Every email job gets a unique ID at enqueue time. Before sending, the worker checks a shared store to see if that ID has already been claimed. If yes, skip. If no, claim it atomically and proceed.
const claimed = await redis.set(
`email:idem:${jobId}`,
workerId,
'NX',
'EX',
3600
);
if (!claimed) {
// Another worker already handled this job
return ack(job);
}
await sendEmail(job.payload);
await ack(job);
The NX flag ensures only one worker wins the race. The EX flag sets a TTL so keys don't accumulate forever. This pattern, claim-before-send, is the foundation of exactly-once email delivery in a horizontally scaled system.
Retry storms without shared state#
When a send fails (temporary SMTP error, rate limit hit, network blip), a stateless worker has no memory of previous attempts. Without external tracking, it retries immediately. Or worse, the job goes back in the queue and every available worker retries it at once.
The solution is a retry counter stored outside the worker. Each attempt increments the counter, and the worker calculates exponential backoff based on the attempt number:
const attempts = await redis.incr(`email:retry:${jobId}`);
if (attempts > MAX_RETRIES) {
await moveToDeadLetter(job);
return;
}
const backoffMs = Math.min(1000 * Math.pow(2, attempts), 300000);
await requeueWithDelay(job, backoffMs);
After a configurable maximum (usually 5 to 8 attempts), the job moves to a dead-letter queue for review or alerting. Without this external state, scaling from 5 workers to 50 doesn't improve reliability. It amplifies failures.
Throttle-aware sending#
Email providers enforce rate limits. Gmail accepts roughly 2,000 messages per rolling 24-hour window per sender. If you have 20 stateless workers all sending through the same account, each worker needs to respect a shared rate limit, not just its own local counter.
This requires a centralized rate limiter (typically a Redis-backed sliding window or token bucket) that every worker checks before sending. Without it, scaling from 5 to 50 workers doesn't increase throughput. It just gets you rate-limited or blocked faster, and recovering from a provider-level block can take days.
When stateful workers make more sense#
Stateless isn't always the right call. If your agent manages long-running email conversations, think multi-turn threads where context evolves over hours or days, a stateful worker that holds conversation history in memory can make decisions faster and with less external I/O.
The trade-off is clear: you gain per-message efficiency but lose horizontal scalability. For agents handling fewer than a few hundred concurrent conversations, stateful workers are simpler to build and debug. Beyond that threshold, memory pressure on individual instances becomes the ceiling.
A hybrid approach works well in practice. Use stateless workers for the send/receive pipeline (high volume, embarrassingly parallel) and stateful workers for conversation management (lower volume, context-dependent). Most production agent email systems that I've seen end up with some version of this split.
Agent-first email infrastructure#
Traditional email infrastructure assumes a human configures SMTP credentials, sets up DNS records, warms up sending domains, and manages worker pools. Agents can do all of this, but the setup overhead eats time that should go toward the agent's actual task.
Agent-first platforms invert this model. Instead of the agent configuring infrastructure, the agent provisions its own inbox through a single call and starts sending immediately. The scaling, queue management, deduplication, and throttle logic happen at the platform level, not in your application code.
import { LobsterMail } from '@lobsterkit/lobstermail';
const lm = await LobsterMail.create();
const inbox = await lm.createSmartInbox({ name: 'outreach-worker' });
await inbox.send({
to: 'recipient@example.com',
subject: 'Follow-up on your request',
text: 'Here are the details you asked for...'
});
No queue setup. No idempotency layer. No rate limiter configuration. The agent focuses on what to send, and the platform handles how it gets delivered at scale. If your agent needs real-time delivery notifications instead of polling, LobsterMail also supports webhooks as a push-based alternative to avoid unnecessary polling loops.
This doesn't mean you should never build your own stateless worker pool. If you need fine-grained control over delivery timing, custom retry logic, or integration with an existing job queue, owning your workers is the right move. But for most agent workloads where email is a capability rather than the core product, offloading the infrastructure removes a full class of scaling problems. If you want to try that approach, and skip the queue engineering entirely.
Monitoring what matters#
Regardless of which architecture you choose, you need visibility into three things. Queue depth over time tells you whether workers are keeping up with incoming jobs; a consistently growing queue means you need more instances or something is stuck. Processing latency, specifically the gap between P50 and P99, reveals whether a subset of jobs is hitting retries or slow SMTP servers while the rest process normally. And dead-letter queue volume shows you how often retry logic is exhausting itself, which usually points to systematic issues like expired credentials or blocklisted sender IPs.
Set alerts on all three. A stateless architecture makes it straightforward to auto-scale based on queue depth, but only if you're actually measuring it.
Picking the right architecture for your volume#
For agents sending fewer than 1,000 emails per day, the architecture barely matters. A single worker with a simple retry loop handles the load fine. The horizontal scaling question only gets real above that threshold.
Between 1,000 and 50,000 daily messages, stateless workers with a Redis-backed queue and idempotency layer give you linear scaling with manageable complexity. Above 50,000, you're looking at partitioned queues, sharded rate limiters, and probably a managed email platform handling the delivery layer so your workers don't have to.
The honest answer is that most agent teams don't need to build this infrastructure themselves. If email is one of ten things your agent does, spending two weeks on queue architecture is hard to justify. If email is your agent's core function, owning every layer gives you control that no platform can match. Pick based on where email sits in your priority stack, not on what sounds more impressive architecturally.
Frequently asked questions
What does stateless mean for an email worker agent?
A stateless email worker retains no per-message session data between job executions. Each job is self-contained: the worker reads the payload, processes it, and forgets everything. All persistent state (retry counts, delivery status, idempotency keys) lives in external stores like Redis or a database.
How does a stateless email worker achieve horizontal scaling without coordination overhead?
By removing per-instance state, any worker can process any job from the queue. You add instances and throughput scales near-linearly. No sticky sessions or complex routing required. The message queue distributes work, and workers are fully interchangeable.
What happens when two stateless worker instances dequeue the same email job simultaneously?
Without safeguards, both workers send the email, creating a duplicate delivery. The fix is an atomic claim using idempotency keys: each job has a unique ID, and workers use an atomic set operation (like Redis SET NX) to ensure only one instance processes each job.
How do you implement idempotent email delivery across multiple worker replicas?
Assign every email job a unique idempotency key at enqueue time. Before sending, each worker attempts an atomic write to a shared store. If the key already exists, another worker claimed it first. The worker skips the send and acknowledges the job without delivering.
What message queue or broker is best suited for stateless email agent workers?
Redis with Streams or BullMQ is the most common choice for small-to-medium scale. RabbitMQ handles more complex routing patterns. SQS works well as a managed option. All three support visibility timeouts and dead-letter queues, which are essential for email retry logic.
Can stateless email workers handle retry logic without a shared state store?
Not reliably. Without external state, the worker has no way to know how many times a job has been attempted. You need an external counter (Redis key, database row, or message attribute) to track attempt count and calculate backoff delays between retries.
What is the difference between stateless and stateful architecture for high-volume email sending?
Stateless workers scale horizontally by adding identical instances behind a queue. Stateful workers maintain in-memory session context, which limits them to vertical scaling or sticky-session routing. Stateless wins at high volume; stateful wins at lower volume with complex per-message logic.
How do you monitor queue depth and worker lag when scaling stateless email agents horizontally?
Track queue depth over time (are jobs backing up?), processing latency at P50 and P99 (are some jobs abnormally slow?), and dead-letter queue volume (are retries exhausting?). Most queue systems expose these metrics via built-in dashboards or Prometheus-compatible exporters.
What design pattern prevents duplicate email sends during horizontal scale-out events?
The claim-before-send pattern. Every worker attempts an atomic claim on the job's idempotency key before sending. Combined with at-least-once delivery from the queue and worker-level deduplication, this pattern achieves exactly-once email delivery even during rapid scale-out.
When should an email pipeline use stateful agents instead of stateless workers?
When the agent manages long-running, multi-turn email conversations where context changes over hours or days. Stateful agents avoid the I/O overhead of loading full conversation history from an external store on every message. Below a few hundred concurrent threads, stateful is often simpler to build.
How does agent-first email infrastructure like LobsterMail differ from building your own SMTP worker pool?
Agent-first platforms let the agent self-provision an inbox through a single SDK call. Scaling, rate limiting, deduplication, and delivery happen at the platform level. You skip the queue engineering and focus your code on what to send rather than how to deliver it.
How does a stateless email worker handle a crash without losing message state?
The queue provides the safety net. If a worker crashes before acknowledging a job, the queue's visibility timeout expires and the job becomes available to another worker. No message is lost as long as acknowledgment only happens after successful processing.


