Launch-Free 3 months Builder plan-
Pixel art lobster working at a computer terminal with email — websocket connection drops agent email fix

WebSocket connection drops keep killing your agent's email

WebSocket drops silently break agent email pipelines. Here's why they happen, how to detect them, and the architecture that makes them irrelevant.

9 min read
Ian Bussières
Ian BussièresCTO & Co-founder

Your agent processed 340 emails on Monday. On Tuesday, it processed zero. The logs showed nothing wrong. The WebSocket connection had dropped sometime around 2 AM, the reconnection logic never fired, and your agent sat idle listening to a dead socket for nine hours before anyone noticed.

This is the most common failure mode in agent email pipelines. It's also the hardest to detect because nothing actually errors out.

I've spent months debugging this pattern across agent deployments. It always looks the same: a long-lived WebSocket connection dies without triggering onclose, the agent assumes it's still receiving events, and inbound emails pile up unprocessed on the server. If you're done babysitting socket connections, . LobsterMail handles delivery at the infrastructure level, so a dropped socket never means a dropped email.

Why WebSocket connections drop: root causes at a glance#

These are the seven most common reasons WebSocket connections fail in production:

  1. NAT gateway timeouts silently expire idle connections after 30-60 seconds of inactivity
  2. Missing application-level heartbeats leave both sides unaware the connection is dead
  3. Reverse proxies like nginx close WebSocket connections when the default proxy_read_timeout (60 seconds) expires
  4. TLS certificate expiry or mismatch triggers intermittent connection resets
  5. Server-side idle timeout thresholds close connections that haven't exchanged data recently
  6. Half-open TCP state leaves one side believing the connection is alive while the other has already torn it down
  7. No event queue behind the WebSocket means messages sent during a drop vanish permanently

For agent email pipelines, cause number seven is the one that actually costs you money. The first six are fixable with configuration tweaks. The seventh requires a different architecture entirely.

Silent drops vs clean closes#

A normal WebSocket close involves a close frame (opcode 0x8). Both sides acknowledge the shutdown, and your onclose handler fires. Your code knows the connection ended and can reconnect within milliseconds.

A silent drop skips all of that. The TCP connection dies at the network layer without either endpoint sending a close frame. Your client's socket object still exists. readyState still returns OPEN. But nothing flows through the connection anymore. This happens when a NAT gateway expires the connection mapping, when a load balancer reclaims an idle slot, or when the remote server crashes without performing a graceful shutdown.

The practical difference is simple: a clean close triggers your reconnection logic. A silent drop triggers nothing. Your agent stops receiving email and neither the agent nor your monitoring has any indication that something went wrong.

Why agent email breaks differently#

A chat application with 10,000 connected users can tolerate scattered WebSocket drops. Users notice the lag, refresh the page, and reconnect. The lost messages are usually just typing indicators or presence updates.

Agent email pipelines have a different failure profile. An agent processing inbound email typically has one connection, running unattended for days or weeks. There's no human watching the interface. There's no refresh button. When that single WebSocket drops silently, the agent continues running its main loop, checking a socket that will never produce another event.

Meanwhile, emails keep arriving at the mail server. They accumulate in a queue the agent can't see because it's listening to a dead connection. By the time someone discovers the problem (usually when a customer complains about a missed reply), hundreds of messages may be sitting in a server-side buffer that your agent never polled.

If your email delivery depends on WebSocket events, you've coupled email reliability to network reliability. Those two things have very different failure modes. We covered the tradeoffs between real-time push and periodic retrieval in our guide on webhooks vs polling: how your agent should receive emails.

WebSocket vs SSE vs polling for agent email#

If WebSocket connections are this fragile for agent workloads, should you use a different transport? Here's how the three main approaches compare:

FeatureWebSocketSSEPolling
Connection typeFull-duplex, persistentHalf-duplex, persistentRepeated HTTP requests
Auto-reconnectManual (you build it)Built into the protocolNot applicable
Missed messages on dropLost unless queued server-sideResumable via Last-Event-IDNone lost (each request gets current state)
Proxy compatibilityRequires Upgrade header supportStandard HTTP, works everywhereStandard HTTP, works everywhere
Best fitBidirectional real-time streamsServer push with built-in recoveryReliable, simple retrieval

SSE handles reconnection more gracefully than WebSocket because the protocol includes a retry mechanism and a Last-Event-ID header. When an SSE connection drops, the client automatically reconnects and tells the server which event it last received. The server can then replay anything that was missed.

But all three approaches share the same fundamental weakness for email agents: they put the delivery guarantee on the client. If your agent is down, sleeping, or restarting during a deploy, events accumulate somewhere. That "somewhere" may not survive a restart, and your agent has no way to know what it missed.

Queue-backed delivery: the actual fix#

The real solution isn't choosing between WebSocket, SSE, or polling. It's making the transport choice irrelevant to delivery guarantees.

A queue-backed email architecture persists every inbound email before attempting to notify your agent. The notification channel (WebSocket, webhook, polling, whatever you prefer) is just a signal that new mail exists. If the signal gets lost, the email doesn't. Your agent calls receive() when it's ready, and it gets everything that arrived since its last check.

This is how LobsterMail works. When an email hits your agent's inbox, it's stored immediately. Your agent retrieves messages through the SDK or a webhook callback. If your agent goes offline for an hour, the emails wait. If a webhook delivery fails, retries happen automatically. The storage layer and the notification layer are completely separate concerns with independent reliability guarantees.

import { LobsterMail } from '@lobsterkit/lobstermail';

const lm = await LobsterMail.create();
const inbox = await lm.createSmartInbox({ name: 'Support Agent' });

// Emails are persisted server-side. receive() returns everything
// since your last retrieval, regardless of connection state.
const emails = await inbox.receive();

for (const email of emails) {
  console.log(email.subject, email.from);
}

No WebSocket to manage. No heartbeat interval to tune. No reconnection logic to debug at 3 AM. No silent drops to lose sleep over. The SDK makes an HTTP request and gets back whatever emails have arrived. If you want real-time push on top of that, add a webhook, but the webhook is a notification convenience, not the delivery mechanism itself.

If you're stuck with WebSocket-based email delivery#

If you're running infrastructure that requires WebSocket for email events and can't switch right now, three changes will catch most silent drops.

First, implement application-level heartbeats. Don't rely on TCP keepalive or WebSocket protocol-level pings alone. Send a custom ping message every 15 seconds from your client and expect a pong within 5 seconds. If you miss two consecutive pongs, tear down the connection and force a reconnect. Fifteen seconds is the right interval because most NAT gateways expire idle mappings between 30 and 120 seconds, and you need your heartbeat to arrive before that window closes.

Second, fix your reverse proxy configuration. If you're running nginx, set proxy_read_timeout to at least 3600 seconds for your WebSocket location block. The default 60-second value is the single most common cause of unexpected WebSocket termination in production.

location /ws {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
}

Third, add a reconciliation loop. Every five minutes, have your agent make a separate HTTP request to fetch any emails it might have missed. Compare the results against what came through the WebSocket. This catches every silent drop, including the ones your heartbeat logic misses.

Warning

If your reconciliation loop consistently catches emails the WebSocket missed, your socket connection isn't reliable enough to justify maintaining. You're running polling with extra complexity.

That reconciliation loop is really just polling with more steps. If you find yourself building one on top of your WebSocket, it's worth asking whether the WebSocket is earning its keep. Before testing any of these changes in production, try testing agent email without hitting production to verify your reconnection logic handles edge cases cleanly.

The simpler path is letting infrastructure handle transport reliability for you. and spend your time on what your agent actually does with its email, not on keeping the connection alive.

Frequently asked questions

What causes a WebSocket connection to drop without triggering the onclose event?

A silent drop happens when the TCP connection dies at the network layer (NAT timeout, load balancer eviction, server crash) without either side sending a WebSocket close frame. Your client's readyState stays OPEN even though nothing can pass through the connection. Only application-level heartbeats can detect this state.

How many consecutive missed heartbeats should force a WebSocket reconnect attempt?

Two missed pongs is the standard threshold. With a 15-second ping interval and 5-second pong timeout, you'll detect a dead connection within about 40 seconds. Going below two risks false positives on congested networks.

What ping interval is recommended for WebSocket connections running behind a NAT gateway?

Use 15-20 seconds. Most NAT gateways expire idle TCP mappings between 30 and 120 seconds. A 15-second heartbeat keeps the mapping active while leaving headroom for network jitter.

How do I configure nginx proxy_read_timeout to stop silently killing idle WebSocket connections?

Set proxy_read_timeout 3600s in your WebSocket location block. The default 60-second value silently closes any WebSocket that hasn't received data in the last minute. You also need proxy_http_version 1.1, proxy_set_header Upgrade $http_upgrade, and proxy_set_header Connection "upgrade" in the same block.

Why does my email agent stop receiving new messages after a few minutes of low traffic?

Your agent's WebSocket connection likely dropped silently during the idle period. NAT timeouts or proxy read timeouts kill connections with no recent traffic. Switching to polling-based or queue-backed retrieval (like LobsterMail's receive() method) eliminates this failure mode entirely.

What is the difference between a WebSocket protocol-level ping frame and an application-level heartbeat message?

Protocol-level pings (opcode 0x9) are handled at the frame layer and may not traverse all proxies correctly. Application-level heartbeats are regular data messages (like {"type":"ping"}) that your code sends and monitors explicitly. Application-level heartbeats are more reliable because they exercise the full data path end to end.

How should an AI email agent replay missed events after successfully reconnecting a dropped WebSocket?

Send a timestamp-based cursor or Last-Event-ID with your reconnection request so the server knows where you left off. If your WebSocket server doesn't support cursor-based replay, you need a separate HTTP endpoint that returns missed events. LobsterMail avoids this problem by persisting all emails server-side and returning them through its receive() method regardless of connection history.

Is exponential backoff or fixed-interval retry safer for WebSocket reconnection inside an email agent?

Exponential backoff with jitter is safer. It prevents thundering herd problems when many agents reconnect simultaneously after an outage. Start at 1 second, cap at 60 seconds, and add random jitter of up to 30% to each delay interval.

Can expired or mismatched TLS certificates cause intermittent WebSocket drops in agent pipelines?

Yes. A certificate mismatch or near-expiry certificate can cause some connections to reset while others succeed, depending on the client's TLS implementation and caching behavior. Check certificate validity with openssl s_client -connect your-server:443 if you're seeing intermittent drops without a clear network cause.

What is a half-open TCP connection and why is it especially dangerous for long-lived agent connections?

A half-open connection exists when one side has closed the TCP session but the other never received the FIN packet due to a network failure or crash. The "alive" side keeps listening on a socket that will never produce data. For an unattended email agent, this means silently missing all inbound mail until a heartbeat or OS-level timeout detects the problem.

How does a queue-backed architecture prevent lost emails when the socket layer fails?

Every email is written to persistent storage the moment it arrives, before any notification is sent. The WebSocket, webhook, or polling layer only signals that new mail exists. If that signal fails, the email stays in the queue and your agent picks it up on the next receive() call.

What metrics should I track to detect WebSocket instability in a production email agent?

Track reconnection frequency, time between reconnections, heartbeat round-trip latency, and the delta between emails received via WebSocket versus emails found through reconciliation polling. A growing delta means your WebSocket is silently dropping messages.

How do I automatically reconnect a WebSocket in JavaScript?

Listen for the onclose event and call a reconnect function with exponential backoff. For silent drops that never trigger onclose, combine this with an application-level heartbeat that detects dead connections and manually calls socket.close() to force the event. Libraries like reconnecting-websocket handle the retry loop automatically.

Does LobsterMail use WebSocket for email delivery?

No. LobsterMail uses a queue-backed architecture where emails are persisted on arrival and retrieved via the SDK's receive() method or through webhooks. There's no persistent connection to manage or keep alive.

When comparing WebSocket to SSE for email agent delivery, which handles reconnection more gracefully?

SSE handles reconnection better because auto-reconnect and event replay via Last-Event-ID are built into the protocol. WebSocket requires you to implement all reconnection and replay logic yourself. For email agents running unattended, though, neither approach guarantees delivery if the agent process is down when events fire. A queue-backed model is more reliable for both cases.

Related posts