
debugging webhook delivery failures in production agent workflows
Production webhook failures silently break agent email workflows. Diagnose endpoint errors, signature failures, and retry exhaustion fast.
Your agent has been ignoring emails for four hours. Not failing loudly — just quiet. No errors in your application logs. No alerts. The agent looks healthy from the outside, but it hasn't processed a single incoming message since yesterday afternoon.
The culprit is almost always a webhook delivery failure. The frustrating part is that it doesn't announce itself. The agent just stops reacting, and you find out when a user complains — or when you check the dashboard and notice zero events processed.
Here's a systematic approach to webhook delivery failure debugging in production, focused on the specific patterns that break LobsterMail-based agent workflows.
How the failure cascade works#
When LobsterMail can't deliver a webhook, it retries with exponential backoff, up to 10 times. Any non-2xx response from your endpoint, or a timeout after 10 seconds, counts as a failure. After 10 consecutive failures, the webhook is automatically disabled.
Disabled means silent. Your agent's inbox keeps receiving email. LobsterMail keeps accepting it. Nothing gets forwarded to your handler. No more retries. No alerts. Just an inbox filling up with messages your agent will never see.
That 10-attempt limit sounds generous, but it burns through fast. If your endpoint is returning 500s after a bad deployment, you can hit 10 failures in under an hour. By the time you notice, the webhook has been dead for a while and a backlog of unprocessed messages has built up.
Step 1: Check if your webhook is disabled#
Start here, not with your application logs. List your webhooks and look at the enabled flag:
const webhooks = await lm.listWebhooks();
console.log(webhooks.map(w => ({ id: w.id, enabled: w.enabled })));
If enabled is false, the webhook exhausted its retries. Before you re-enable it, figure out why it failed — otherwise you'll disable it again within hours. Re-enabling looks like this:
PATCH /v1/webhooks/:id
{ "enabled": true }
Don't skip the diagnosis step. Re-enabling a broken webhook just restarts the countdown.
Step 2: Reproduce the failure locally#
Get your endpoint responding correctly before touching the webhook configuration. POST a test payload directly to your handler:
{
"event": "email.received",
"timestamp": "2026-02-17T12:00:00Z",
"data": {
"emailId": "em_abc123",
"inboxId": "in_xyz789",
"from": "sender@example.com",
"subject": "Hello",
"preview": "First 200 characters of the email..."
}
}
A 5xx response means a server error in your handler. A 4xx means check your routing, authentication middleware, or body parsing configuration. If the request hangs and never returns, you have a timeout problem — and that's more common than people expect.
The 10-second timeout is a hard limit. If your handler calls a database, an LLM, or any downstream API, you need to acknowledge the request immediately and process asynchronously:
app.post('/hooks/lobstermail', async (req, res) => {
// Verify signature first, then acknowledge immediately
res.status(200).json({ ok: true });
// Processing happens in the background — don't await here
processEmailEvent(req.body).catch(console.error);
});
The response has to go out before any slow work begins. Awaiting the processing inside the handler is the most common cause of timeout failures I see.
Step 3: The raw body signature trap#
Signature verification failures are the sneakiest class of webhook errors. Your code looks correct. The secret is right. The header is present. Still invalid.
The issue is almost always body parsing. HMAC-SHA256 is computed over the exact raw bytes that came over the wire. If your framework parses the JSON body before you get to it, you're signing a re-serialized object — and even a single whitespace difference breaks the signature.
import { createHmac } from 'node:crypto';
function verifyWebhook(body: string, signature: string, secret: string): boolean {
const expected = createHmac('sha256', secret)
.update(body)
.digest('hex');
return expected === signature;
}
In Express, you need the raw buffer, not the parsed object:
// This fails — body is already JSON-parsed by the global middleware
app.use(express.json());
app.post('/hooks/lobstermail', (req, res) => {
verifyWebhook(JSON.stringify(req.body), sig, secret); // ❌ re-serialized
});
// This works — raw body captured before any parsing
app.post('/hooks/lobstermail',
express.raw({ type: 'application/json' }),
(req, res) => {
verifyWebhook(req.body.toString(), sig, secret); // ✓ original bytes
}
);
Warning
The webhook secret is only returned once — at creation time. If you didn't save it when you created the webhook, you'll need to delete it and create a new one. Store it in your secrets manager immediately, not in a .env file that might get overwritten.
Step 4: Status code edge cases#
LobsterMail counts anything non-2xx as a failure. 200 works. 201 works. I've seen handlers return 204 No Content under the assumption it's a valid success — it is HTTP-correct, but when in doubt, return 200 { "ok": true }. It's unambiguous and easy to grep for in logs.
Also watch for middleware rewriting your status codes on its way out. Authentication layers returning 401 on malformed requests, rate limiters returning 429, load balancers returning 502 during a rolling deploy — all of these register as failures on LobsterMail's side, even if your handler logic ran fine.
If you're seeing failures you can't reproduce locally, suspect the infrastructure between LobsterMail and your handler before assuming the handler itself is broken.
Step 5: Making it stay fixed#
Once the immediate failure is resolved, the handler architecture is what determines whether this happens again.
Keep the handler stateless and fast. Its only job: verify the signature, enqueue the job, return 200. All processing happens in a worker. This pattern makes the handler nearly indestructible — it can't time out, it can't fail because a downstream service is slow, and it can't block under load.
Set an alert on webhook re-enable events. If you're manually re-enabling the same webhook more than once, you have a systemic reliability problem, not a one-off incident. Retry exhaustion is a symptom, not the root cause.
Test handler changes in a sandbox before deploying to production. The agent email sandbox testing guide covers how to set up an isolated test environment for exactly this kind of work.
For more on the architectural tradeoffs between webhooks and polling — including cases where polling is genuinely the better choice — see webhooks vs polling for agent email.
Give your agent its own email. Get started with LobsterMail — it's free.
Frequently asked questions
How do I know if my LobsterMail webhook was automatically disabled?
Call lm.listWebhooks() and check the enabled field on each webhook object. If it's false and you didn't disable it manually, retry exhaustion is the likely cause. Re-enable via PATCH /v1/webhooks/:id with { "enabled": true } after fixing the underlying issue.
Can I recover emails that arrived while my webhook was disabled?
Yes. Your inboxes continued receiving email even while the webhook was disabled — the messages are stored and accessible. You can fetch them via the API and process them manually, or trigger reprocessing from your own queue once the webhook is back online.
Why does LobsterMail disable the webhook instead of just pausing retries?
Auto-disabling protects your endpoint from continued hammering when something is clearly wrong. A webhook that has failed 10 consecutive times is almost certainly pointing at a broken endpoint, not experiencing transient errors. Re-enabling is a deliberate action that signals you've addressed the problem.
How long does LobsterMail wait between retry attempts?
Retries use exponential backoff, so the gap between attempts grows with each failure. The first retry is near-immediate; later ones are spaced further apart. The exact timing means 10 failures can represent anywhere from a few minutes to several hours of downtime before the webhook is disabled.
My HMAC signature verification keeps failing even though my secret is correct. What's wrong?
Almost certainly a body parsing issue. You're likely signing a re-serialized JSON object instead of the original raw bytes. See the raw body section above — use express.raw() or your framework's equivalent to capture the body before any JSON parsing happens.
What HTTP status codes does LobsterMail treat as successful delivery?
Any 2xx status code. 200 and 201 are the most common. Avoid 204 unless you've confirmed it's handled correctly — when in doubt, return 200 with a simple JSON body like { "ok": true }.
What happens to my webhook secret if I lose it?
The secret is only returned once, at webhook creation time. If you lose it, delete the webhook and create a new one to get a fresh secret. This is why you should store it in a secrets manager immediately after creation.
Can I register multiple webhook endpoints for the same event?
Yes. You can create multiple webhooks pointing to different URLs, all listening to email.received. Each will receive its own delivery attempts and has its own retry counter. This is useful for fan-out patterns or routing different inboxes to different handlers.
How do I test my webhook handler during local development?
Use a tunneling tool like ngrok to expose your local server, then register a webhook pointing at the tunnel URL. The agent email sandbox testing guide covers the full development-to-production testing workflow in more detail.
Should I still verify signatures in a development environment?
Yes. It's much easier to catch signature verification bugs in development — where you control the test payloads — than to discover them in production after a webhook has been auto-disabled. Treat signature verification as non-optional from day one.
What's the right way to handle duplicate webhook deliveries?
Implement idempotency using the emailId field in the payload. Store processed IDs and skip re-processing if you've already handled a given emailId. LobsterMail retries can deliver the same event more than once if your endpoint acknowledged it slowly or returned a borderline status.
My handler returns 200 but deliveries still show as failed in LobsterMail. Why?
Check whether middleware between LobsterMail and your handler is overriding the response. Proxies, API gateways, and auth layers can rewrite status codes before the response leaves your infrastructure. Also confirm your handler is actually sending the response before the 10-second timeout — if processing takes too long, the connection closes regardless of what status you intended to return.


