Launch-Free 3 months Builder plan-
Pixel art lobster working at a computer terminal with email — integration testing ai agent email mock inbox

integration testing ai agent email with mock inboxes

Mock inboxes keep your AI agent from emailing real users during tests. Compare Mailosaur, Mockmail, smolclaw, and LobsterMail for agent email testing.

10 min read
Ian Bussières
Ian BussièresCTO & Co-founder

Your AI agent sends a welcome email during a test run. Except it sends it to a real customer's inbox. The customer replies, confused. Your agent responds to the reply, now locked in an actual conversation that no human authorized.

This happens more often than anyone admits. Integration testing AI agent email flows against live inboxes is how you accidentally spam users, trigger bounce-rate penalties, and burn sender reputation on a domain you need for production. Mock inboxes exist to prevent exactly this.

If you're building agents that handle email and want to skip the horror stories below, LobsterMail's test-mode tokens let you sandbox everything from day one. and your agent can send, receive, and thread emails without anything escaping to real recipients.

Why agent email testing is different#

Traditional email testing is straightforward. A QA engineer sends a message, checks if it arrives, verifies the formatting. Agent email testing introduces three problems that human-driven testing never had to deal with.

Agents act autonomously. A misconfigured test can trigger hundreds of sends before anyone notices. Agents also maintain state across multiple email exchanges: threading replies, extracting verification codes, following up on non-responses. Testing a single send-receive pair doesn't validate the full workflow. And agents interpret email content as instructions. If a test accidentally receives a real email containing "cancel all pending orders," an unprotected agent might try to comply.

We covered the sandbox approach in detail in our guide on testing agent email without hitting production. But the tooling question remains: which mock inbox service actually fits agent workflows?

What goes wrong without a mock inbox#

I keep seeing the same failure modes across agent deployments, so here's a catalog.

The most common disaster is accidental reply-all. An agent under test receives an email from a shared mailbox, interprets it as a task, and replies to every address on the thread. One team reported their agent sent 340 replies in under a minute to an internal company mailing list during what was supposed to be a unit test.

Verification code extraction causes subtler problems. An agent testing a signup flow actually creates a real account on a third-party service. Now there's an orphaned account tied to your domain, possibly with billing implications you won't discover for months.

Domain reputation damage hits harder than people expect. An agent in a test loop sends 2,000 messages from your production domain in an hour. Gmail and Outlook notice. Your domain's sender score drops. Production emails from real users start landing in spam. Recovering from that takes weeks of careful warm-up.

Inbox pollution rounds it out. Testing against a shared inbox means test emails mix with real ones. Agents can't distinguish between test data and real messages, leading to unpredictable behavior that passes in one run and fails in the next.

None of these are theoretical. They're the reason mock inboxes moved from "nice to have" to "required" in any serious agent email pipeline.

How to set up a mock inbox for AI agent integration testing#

Here's the process, regardless of which provider you choose:

  1. Choose a mock inbox provider with a REST API.
  2. Provision a fresh inbox programmatically before each test.
  3. Point your agent's email config at the test environment.
  4. Trigger the agent workflow under test.
  5. Assert against captured messages via the read API or webhook.
  6. Delete the inbox after assertions pass.

The key difference from traditional email testing: steps 2 and 6 need to be fast enough for parallelized test suites. If provisioning takes 3 seconds per inbox, a suite with 200 email tests becomes a 10-minute bottleneck. Sub-second provisioning isn't optional at scale.

Comparing mock inbox tools for agent email testing#

Here's how the main options stack up for AI agent integration testing specifically:

FeatureMailosaurMockmailsmolclawLobsterMail
Agent-first designNoNoYesYes
Programmatic inbox creationYesYesYesYes
Sub-second provisioningYesYesYesYes
Webhook supportYesLimitedYesYes
Injection protectionNoNoPartialYes (scoring)
Multi-turn thread testingManualNoYesYes
Free tier100 emailsUnlimited (self-host)500 emails1,000 emails/mo
CI/CD integrationMatureBasicGrowingSDK + API

Mailosaur#

Mailosaur has been around since before AI agents were a category. It's designed for QA teams testing transactional email: password resets, welcome sequences, marketing templates. The API is mature, the documentation is solid, and most CI/CD platforms have existing recipes for it.

The gap: Mailosaur doesn't understand agents. There's no concept of injection protection, no built-in threading for multi-turn conversations, and no way to simulate realistic agent email patterns like autonomous verification code extraction. You can absolutely make it work, but you're bolting agent-specific logic onto a tool that was built for a different era of email testing.

Mockmail#

Mockmail is an open-source option you can self-host. The appeal is obvious: no usage limits, no vendor dependency, full control over the test environment. For teams with existing infrastructure expertise, it's a solid choice.

The trade-off is maintenance. You're running another service, handling updates, and building your own API wrappers for anything beyond basic send-and-capture. For a solo developer or small team, the time cost often exceeds what you'd pay for a hosted solution at the Builder tier ($9/mo) of a managed service.

smolclaw#

smolclaw was purpose-built for agent testing. It creates reproducible mock environments where agents interact with email and other services without touching anything real. The focus on safety and reproducibility makes it interesting for teams running complex multi-agent workflows where email is one of several services to mock.

It's newer and the ecosystem is still growing. Documentation is thinner than Mailosaur's, and community support is limited. But if your testing needs center on agent behavior rather than email formatting, smolclaw is closer to the actual problem.

AgentMail#

AgentMail is another agent-focused platform worth considering, especially if you're already deep into the LangChain or CrewAI ecosystem. Their SOC 2 Type II certification matters for enterprise compliance. However, their free tier is limited to 3 inboxes, and their testing story leans more toward production use than sandboxed integration testing.

LobsterMail#

LobsterMail isn't a dedicated mock inbox tool. It's production email infrastructure for agents that happens to have a clean test mode baked in. Every API token follows a lm_sk_test_* / lm_sk_live_* pattern. Test tokens give you the full API surface (provisioning, sending, receiving, webhooks) without any email actually leaving LobsterMail's servers.

This means your integration tests run against the same code path your agent uses in production. You eliminate an entire category of "works in test, fails in prod" bugs because the test and production environments share the same API behavior.

The injection scoring that LobsterMail applies to incoming emails also works in test mode. So if your agent processes inbound email, you can verify that it respects risk scores before going live. For more on how agents handle incoming mail in production, see our breakdown of webhooks vs polling: how your agent should receive emails.

Testing multi-turn email conversations#

This is where most mock inbox tools fall short. A single send-receive assertion is easy. Testing a three-message thread where your agent sends an initial email, receives a reply asking for clarification, responds with specifics, and then receives a confirmation requires the mock environment to maintain conversation state across multiple exchanges. Thread IDs, In-Reply-To headers, and References headers all need to stay consistent.

With LobsterMail's test mode, you can script the entire sequence using two test inboxes:

import { LobsterMail } from '@lobsterkit/lobstermail';

// Test token ensures nothing leaves LobsterMail's servers
const lm = await LobsterMail.create();
const agent = await lm.createSmartInbox({ name: 'test-agent' });
const customer = await lm.createSmartInbox({ name: 'test-customer' });

// Agent sends initial outreach
await agent.send({
  to: customer.address,
  subject: 'Your proposal',
  body: 'Here are the project details...',
});

// Simulate customer reply
const outreach = await customer.receive();
await customer.send({
  to: agent.address,
  subject: 'Re: Your proposal',
  body: 'Can you clarify the timeline?',
});

// Agent processes the reply
const reply = await agent.receive();
expect(reply[0].subject).toContain('Re:');
expect(reply[0].body).toContain('timeline');

Both inboxes are real LobsterMail inboxes running on test tokens, so threading, headers, and delivery semantics all behave exactly as they would in production. This kind of stateful testing is difficult to replicate with tools that treat each email as an isolated event.

Wiring mock inboxes into CI/CD#

The goal: every pull request that touches email-related code should run integration tests against mock inboxes. No manual QA, no "I'll test it in staging."

For LobsterMail, your CI configuration looks like this:

# .github/workflows/email-tests.yml
env:
  LOBSTERMAIL_TOKEN: ${{ secrets.LOBSTERMAIL_TEST_TOKEN }}

steps:
  - name: Run email integration tests
    run: pnpm test:email

The test token ensures nothing escapes to real inboxes, even if your tests have bugs. Each test provisions its own inbox, runs assertions, and tears it down. Parallel test runners each get isolated inboxes, so there's no cross-contamination between test cases.

One pattern I've found effective: tag your email integration tests separately from unit tests. Email tests involve network I/O and inbox provisioning, so they're inherently slower. Run them on PR creation and merge, not on every push. Your feedback loop stays fast without sacrificing coverage.

Picking the right tool#

If you're already using LobsterMail in production, use its test mode for integration testing. Same API, same behavior, zero risk of real delivery.

If you're starting a new agent project and haven't picked email infrastructure yet, the test-mode approach is worth considering from day one. and your test token is ready immediately. No credit card, no human signup. Your agent handles the setup itself.

If you need to test against email infrastructure you don't control (a client's Exchange server, a partner's custom mail gateway), Mailosaur's SMTP capture approach gives you more flexibility there. And if you're running complex multi-agent simulations where email is one of five or six services to mock, smolclaw's broader scope might be the better fit.

The worst option is no mock inbox at all. Pick any tool from this list, wire it into your test suite this week, and stop wondering whether your next test run will email a real customer.

Frequently asked questions

What is a mock inbox and why do AI agents need one during integration testing?

A mock inbox captures emails in an isolated environment so they never reach real recipients. AI agents need them because agents act autonomously and can send hundreds of messages during a test run before anyone intervenes. Without a mock inbox, every test is a live-fire exercise.

How does Mailosaur compare to Mockmail for AI agent email workflows?

Mailosaur is a hosted service with a mature API and broad CI/CD support. Mockmail is self-hosted and free but requires you to maintain the infrastructure. Neither was designed for agent workflows specifically, so both lack injection protection and native multi-turn thread simulation.

Can I use disposable test addresses to simulate agent signup and onboarding flows?

Yes. Most mock inbox providers let you create addresses on demand. With LobsterMail, createSmartInbox() using a test token gives you a working address that captures all inbound mail without forwarding anything externally.

What's the difference between a mock inbox and a sandbox email environment?

A mock inbox captures individual messages for assertion. A sandbox (like LobsterMail's test mode) replicates the full production API, including sending, threading, and webhook delivery, with a safety net that blocks external delivery. Sandboxes catch more integration bugs because they exercise the real code path.

How do I automatically extract OTP or verification codes from captured test emails?

Most mock inbox APIs return the full email body as structured data. Parse the HTML or plain-text body with a regex for common OTP formats (6-digit codes, alphanumeric tokens). LobsterMail's SDK returns parsed email objects you can query directly in test assertions.

How do I wire a mock inbox into a CI/CD pipeline so email assertions run on every build?

Store your test API token as a CI secret, provision fresh inboxes in your test setup, run your agent's email workflow, assert against captured messages, and tear down inboxes in cleanup. Most providers have REST APIs that work with GitHub Actions, GitLab CI, and CircleCI.

What risks come up when an AI agent sends emails to real users during staging?

You can damage your domain's sender reputation, accidentally create accounts on third-party services, violate data protection regulations by sending test content to real addresses, and confuse actual customers with automated messages they never expected.

Can mock inbox tools test inbound email handling, not just outbound sending?

Yes. LobsterMail and smolclaw both support injecting test emails into an inbox so your agent can process them. This lets you test the full receive-parse-respond cycle without relying on external senders.

How do I simulate multi-turn email threads for stateful agent conversation tests?

Create two test inboxes and have them exchange messages. Verify that threading headers (In-Reply-To, References) stay consistent across the conversation. LobsterMail's test mode handles threading natively, so the headers behave exactly as they would in production.

Which mock inbox tools have a free tier for early-stage agent development?

LobsterMail offers 1,000 emails per month free with no credit card. Mailosaur includes around 100 messages on its trial. Mockmail is free if you self-host. smolclaw offers 500 emails on its free tier.

Do mock email testing tools support webhooks for real-time test assertions?

Mailosaur, smolclaw, and LobsterMail all support webhooks in their test environments. This lets your test harness react to incoming emails as they arrive rather than polling. For more on the trade-offs, see webhooks vs polling.

How many concurrent disposable inboxes do I need for a parallelized test suite?

One inbox per parallel test worker is the safest approach. If you run 8 parallel workers with 25 email tests each, you need 8 concurrent inboxes (created and destroyed per test), not 200. Reuse inboxes within a worker, isolate across workers.

What's the fastest way to provision and tear down a test inbox between test runs?

Use the provider's REST API or SDK in your test setup and teardown hooks. LobsterMail provisions inboxes in under 200ms. Avoid creating inboxes through a dashboard for automated testing.

Is LobsterMail free for integration testing?

Yes. The free tier includes 1,000 emails per month with no credit card required. Test tokens (lm_sk_test_*) are available immediately on signup and give you the full API surface.

What security considerations matter when choosing a shared mock inbox service?

Check whether test data is isolated per account, whether emails are encrypted at rest, and how long the provider retains captured emails. Avoid sending real customer data through shared mock services. Use synthetic test data instead.

Related posts