Pixel art lobster building with CrewAI framework — crewai multi crew email routing

crewai email automation guides infrastructure

crewai multi-crew email routing: how flows dispatch emails to specialized crews

How CrewAI's Flow system routes incoming emails to specialized crews using @router and @listen, plus the infrastructure decisions most tutorials skip.

March 25, 202610 min read

Ian BussièresCTO & Co-founder

Your agent gets 200 emails a day. Some are customer support tickets, some are purchase confirmations, some are newsletter noise your agent never subscribed to. A single CrewAI crew trying to handle all of them will either drown in prompt complexity or produce generic responses that miss the point entirely.

Multi-crew email routing solves this by splitting the workload. One crew triages incoming messages. Specialized crews handle each category with their own agents and tools. CrewAI's Flow system connects them with decorators that control when and where each email lands.

This guide covers how multi-crew routing works, how to wire it up, and the infrastructure decisions that most tutorials skip.

How CrewAI multi-crew email routing works#

Multi-crew email routing uses CrewAI Flows to classify incoming emails and dispatch them to specialized crews. Here is the general sequence:

Create an ingestion crew that receives raw email data.
Use the @router decorator to classify each email by intent.
Define specialized crews for each category with their own agents and tools.
Connect crews with @listen decorators that trigger on specific router outputs.
Pass email context between crews using a shared Pydantic state object.
Add a response crew to draft and send replies from specialized crew outputs.
Insert a human-in-the-loop approval step before outbound emails are sent.

Each crew operates independently. The Flow orchestrates the handoffs.

The beauty of this pattern is that each crew stays small and focused. Your support crew only needs tools for ticket lookup and knowledge base retrieval. Your billing crew only needs invoice and payment APIs. Neither crew carries the prompt weight of the other's context, which means faster execution and more accurate results.

The @router decorator: your email classifier#

The @router decorator is what makes conditional branching possible. It sits on a method in your Flow class and returns a string label that determines which downstream crew receives the email.

from crewai.flow.flow import Flow, listen, start, router
from pydantic import BaseModel

class EmailState(BaseModel):
    sender: str = ""
    subject: str = ""
    body: str = ""
    category: str = ""
    handled_by: str = ""
    response_draft: str = ""
    confidence: float = 0.0
    error_log: str = ""

class EmailRoutingFlow(Flow[EmailState]):
    @start()
    def ingest_email(self):
        return self.state

    @router(ingest_email)
    def classify(self):
        result = self.classification_crew.kickoff(
            inputs={"subject": self.state.subject, "body": self.state.body}
        )
        self.state.category = result.raw
        return result.raw  # "support", "billing", "spam", etc.

    @listen("support")
    def handle_support(self):
        self.state.handled_by = "support_crew"
        return self.support_crew.kickoff(inputs={"email": self.state})

    @listen("billing")
    def handle_billing(self):
        self.state.handled_by = "billing_crew"
        return self.billing_crew.kickoff(inputs={"email": self.state})

    @listen("spam")
    def handle_spam(self):
        self.state.handled_by = "spam_filter"

When classify() returns "support", only handle_support() fires. The @listen decorator accepts the exact string the router returns.

This is where the difference between @listen and @router trips people up. Use @listen(method_ref) when every downstream step should run after a method completes, regardless of output. Use @router when only one path should execute based on the classification result. In email routing, you almost always want @router, because a billing email shouldn't also trigger your support crew.

Building a reliable classification crew#

The classification crew is the single point of failure in this entire system. If it mislabels a billing dispute as spam, that email vanishes into a black hole. Here's a minimal but effective setup:

from crewai import Agent, Task, Crew

classifier_agent = Agent(
    role="Email Classifier",
    goal="Classify incoming emails into exactly one category",
    backstory="You are an email triage specialist. You read the subject and body, then return one label: support, billing, spam, or general.",
    llm="gpt-4o-mini",
    verbose=False
)

classification_task = Task(
    description="Classify this email. Subject: {subject}. Body: {body}. Return ONLY one word: support, billing, spam, or general.",
    expected_output="A single classification label",
    agent=classifier_agent
)

classification_crew = Crew(
    agents=[classifier_agent],
    tasks=[classification_task],
    verbose=False
)

A few things to note about this setup. First, gpt-4o-mini is fast and cheap enough for classification. You don't need a large model to decide whether an email is a support ticket or spam. Save the heavier models for your response-drafting crews where output quality matters more.

Second, the task description asks for exactly one word. This is intentional. If your classifier returns "This email appears to be a support request," the router won't match any @listen handler. Constrain the output format tightly and add output validation if you're running in production.

Third, you can add a confidence score by asking the classifier to return structured output (a JSON object with category and confidence fields). Route low-confidence classifications to a human review queue instead of letting them flow through automatically.

Keeping email context intact between crews#

State management is where multi-crew flows get messy. Each crew runs independently, so you need a shared state object that carries email metadata, classification results, draft responses, and error context through the entire pipeline.

CrewAI Flows handle this with a typed Pydantic model (the EmailState class above). Every method in the flow reads from and writes to self.state. This works well for linear pipelines but gets complicated when you run crews in parallel or need to merge outputs from multiple branches.

A few patterns that hold up in production:

Keep the state model flat. Nested objects make debugging harder when a crew writes to the wrong field.
Log the state after each step. When a routing crew misclassifies an email, you need to trace exactly what the classifier saw and what label it returned.
Add a handled_by field to track which crew processed each message. This is your audit trail when something goes wrong at 2 AM.
Set default values for every field. A crew that reads an uninitialized field will fail silently or hallucinate data.
Include an error_log field. When a downstream crew fails, write the exception details to state so your fallback handler has full context.

Handling state across parallel branches#

If you're dispatching emails to multiple crews simultaneously (for example, running sentiment analysis in parallel with response drafting), you need to be careful about state mutations. Two crews writing to the same field will create a race condition.

The safest approach is giving each parallel branch its own state fields. Your sentiment crew writes to self.state.sentiment_score while your drafting crew writes to self.state.response_draft. A downstream merge step reads both fields and combines them.

@listen("support")
def analyze_sentiment(self):
    result = self.sentiment_crew.kickoff(inputs={"body": self.state.body})
    self.state.sentiment_score = result.raw

@listen("support")
def draft_support_response(self):
    result = self.support_crew.kickoff(inputs={"email": self.state})
    self.state.response_draft = result.raw

@listen(analyze_sentiment, draft_support_response)
def finalize_response(self):
    # Both branches have completed; merge their outputs
    if float(self.state.sentiment_score) < 0.3:
        self.state.response_draft = self.escalation_crew.kickoff(
            inputs={"draft": self.state.response_draft}
        ).raw

This pattern keeps state mutations isolated and predictable.

If you're coordinating emails between multiple agents (not just crews within one system), the architecture changes significantly. We wrote about that in multi-agent email: when agents need to talk to each other.

Error handling and fallback routes#

Production email flows fail. LLM calls time out, classification crews return unexpected labels, and downstream crews throw exceptions. Without explicit error handling, a failed crew silently drops the email.

Wrap every crew kickoff in a try/except block and route failures to a fallback handler:

@listen("support")
def handle_support(self):
    try:
        self.state.handled_by = "support_crew"
        result = self.support_crew.kickoff(inputs={"email": self.state})
        self.state.response_draft = result.raw
    except Exception as e:
        self.state.error_log = str(e)
        self.state.handled_by = "fallback"
        self.fallback_handler()

def fallback_handler(self):
    # Log the failure, notify a human, queue for retry
    logging.error(f"Crew failed for email: {self.state.subject}")
    logging.error(f"Error: {self.state.error_log}")
    self.notification_service.alert(
        channel="email-failures",
        message=f"Failed to process: {self.state.subject}"
    )

You should also handle the case where the classifier returns a label that doesn't match any @listen handler. Add a catch-all route for unknown categories:

@listen("unknown")
def handle_unknown(self):
    self.state.handled_by = "manual_review"
    # Forward to human review queue

And update your classification prompt to include "unknown" as a valid output when the email doesn't fit any defined category. This prevents the classifier from forcing a bad match just to produce output.

The infrastructure layer nobody talks about#

Every CrewAI email tutorial starts the same way: connect to Gmail with OAuth, poll for new messages, pipe them into a Flow. This works for demos. It falls apart in production.

Polling vs. webhooks#

Polling is inherently slow and wasteful. Checking Gmail every 30 seconds means your agent burns API calls doing nothing most of the time, then processes emails in batches with unpredictable latency. At 500+ emails per day, you'll hit rate limits and start missing time-sensitive messages.

Webhook-based delivery flips this model. Instead of your agent asking "any new mail?" every 30 seconds, the mail server pushes each email to your agent the moment it arrives. Latency drops from seconds-to-minutes down to milliseconds. Your agent processes emails one at a time in real time, and you eliminate the wasted API calls entirely.

OAuth token fragility#

OAuth token management adds another layer of fragility. Gmail tokens expire, need refreshing, and break silently. Your entire routing flow stops at 3 AM because a token expired and nobody was around to re-authorize. There's no clean way to auto-recover without storing refresh tokens securely, which most tutorials gloss over.

In practice, this means building a token refresh service, handling the edge case where a refresh token itself expires (Google revokes refresh tokens that haven't been used in six months), and monitoring for auth failures. That's a lot of infrastructure for something that has nothing to do with your email routing logic.

Deliverability is invisible until it isn't#

If your routing flow generates and sends replies, those emails need to actually arrive. SPF records, DKIM signing, domain warmup, sender reputation: none of this is handled by CrewAI itself. It's infrastructure you have to build or source separately.

A freshly configured domain sending 50 emails on day one will land in spam. Domain warmup requires gradually increasing send volume over weeks, monitoring bounce rates, and maintaining consistent sending patterns. Your agent doesn't care about any of this; it just calls an SMTP function. But if you skip the deliverability work, every reply your agent crafts goes straight to junk folders.

The isolation problem#

When your agent reads from and sends through your personal Gmail, one misconfigured crew can reply-all to your entire contact list. Agent email and human email should never share the same mailbox.

The fix is giving your agent its own email infrastructure. Dedicated inboxes with webhook-based delivery instead of polling, plus send limits that prevent a runaway crew from torching your domain reputation. We covered the simplest version of this in how to give your CrewAI agent an email address in 5 minutes.

LobsterMail handles this for agent workflows. Your agent pinches its own dedicated inbox, and emails arrive via webhook the moment they land, with built-in injection scoring so your classification crew can filter malicious content before it reaches downstream agents. and point the webhook at your Flow's ingestion endpoint.

Testing without sending real emails#

You don't need live email to validate routing logic. Create test fixtures with sample email data and feed them directly into your Flow:

test_emails = [
    EmailState(
        sender="alice@example.com",
        subject="Can't log in",
        body="Password reset not working since yesterday morning"
    ),
    EmailState(
        sender="billing@vendor.com",
        subject="Invoice #4821",
        body="Payment due March 30 for services rendered"
    ),
    EmailState(
        sender="noreply@spammer.net",
        subject="You won a free iPad",
        body="Click here to claim your prize immediately"
    ),
    EmailState(
        sender="ops@partner.com",
        subject="API integration question",
        body="We need help connecting our webhook endpoint"
    ),
]

flow = EmailRoutingFlow()
for email in test_emails:
    flow.state = email
    flow.kickoff()
    print(f"{email.subject} -> {flow.state.category} -> {flow.state.handled_by}")

This validates routing rules and crew outputs without touching a mail server. Once your classification logic is solid, connect the live ingestion layer and test with real volume.

Building a regression test suite#

Beyond basic routing tests, build a regression suite that catches classifier drift. Every time you find a misrouted email in production, add it to your test fixtures with the correct expected label. Over time, this corpus becomes your ground truth for classification accuracy.

import pytest

regression_cases = [
    ("Can't log in", "Password reset not working", "support"),
    ("Invoice #4821", "Payment due March 30", "billing"),
    ("You won a free iPad", "Click here to claim", "spam"),
    ("Meeting tomorrow", "Let's sync at 3pm", "general"),
]

@pytest.mark.parametrize("subject,body,expected", regression_cases)
def test_classification(subject, body, expected):
    flow = EmailRoutingFlow()
    flow.state = EmailState(subject=subject, body=body)
    flow.kickoff()
    assert flow.state.category == expected

Tip

Run your test suite every time you update the classification crew's prompt. Small prompt changes can shift routing behavior in unexpected ways.

When multi-crew routing is overkill#

Not every email workflow needs four crews and a router. If your agent handles one type of email (verification codes, form submissions, support tickets from a single product), a single crew with well-defined agents is simpler and faster to debug.

Multi-crew routing pays off when:

Emails arrive from diverse sources with different intents
Each intent requires different tools or response formats
You need an audit trail showing which pipeline handled each message
Volume is high enough that a single crew's context window becomes a constraint
Different email categories need different LLM models (cheap and fast for spam filtering, capable and thorough for support responses)

Start with one crew. Add routing when the complexity demands it, not before.

Putting it all together#

A complete multi-crew email routing system has four layers. The ingestion layer receives emails via webhook and converts them into your Pydantic state model. The classification layer runs a lightweight crew that labels each email. The dispatch layer uses @router and @listen to send emails to the right specialized crew. The response layer drafts replies, runs them through approval (human or automated), and sends them from your agent's dedicated address.

Each layer can be developed and tested independently. Your classification crew doesn't care whether emails arrive via webhook or test fixture. Your specialized crews don't care how emails were classified. This modularity is what makes multi-crew routing maintainable as your email volume grows and your categories multiply.

The most common mistake is building all four layers at once. Start with classification. Get that working reliably with test fixtures. Then add one specialized crew. Validate it end-to-end. Then connect live email ingestion. Each layer you add should work before you build the next one.

Frequently asked questions

What is multi-crew email routing in CrewAI and when should you use it?

It's a pattern where CrewAI Flows classify incoming emails and dispatch them to specialized crews based on intent. Use it when your agent handles diverse email types that need different tools, prompts, or response formats. For single-purpose email handling (like extracting verification codes), a single crew is simpler.

How does the @router decorator decide which crew handles an incoming email?

The @router method runs your classification logic (which can be a full crew with its own agents) and returns a string label. Only the @listen handler matching that exact string will fire. You can classify using LLM-based reasoning, regex matching, sender domain lookups, or any combination.

What is the difference between @listen and @router in a CrewAI flow?

@listen(method_ref) triggers whenever the referenced method completes, regardless of its output. @router returns a label and only activates the @listen handler that matches. Use @listen for unconditional sequential steps; use @router for conditional branching.

How do you prevent an email from being dropped if a routing crew times out?

Wrap crew kickoffs in try/except blocks and route failures to a fallback handler that logs the full state for reprocessing. CrewAI doesn't include built-in retry logic for Flow steps, so you'll need to implement retries yourself or use an external task queue like Celery.

Can you run email classification and response-generation crews in parallel?

Yes. Use multiple @listen decorators on the same router output to trigger parallel crews. Merging their outputs requires a downstream step that waits for all branches. CrewAI's or_ and and_ operators help coordinate parallel branches within a Flow.

What Gmail or IMAP permissions does a CrewAI email agent require?

For Gmail, you'll typically need gmail.readonly and gmail.send OAuth scopes. Full inbox management (labels, deletion) requires gmail.modify. Tokens expire periodically and need refreshing. An alternative is giving your agent a dedicated inbox that doesn't depend on human OAuth sessions.

How do you add a human-in-the-loop approval step before sending a reply?

Insert a step between the response-drafting crew and the sending step that pauses the flow and surfaces the draft for review. A common approach is posting the draft to Slack or a dashboard via webhook, then resuming the flow when a human approves.

What is the difference between a CrewAI agent and a crew?

An agent is a single AI entity with a defined role, goal, and backstory. A crew is a team of agents that collaborate on a set of tasks using a defined process (sequential or hierarchical). In multi-crew routing, each branch uses a separate crew with agents specialized for that email category.

What email volume makes polling impractical compared to webhooks?

Polling Gmail typically breaks down around 200-500 emails per day due to API rate limits and accumulated latency. Beyond that threshold, webhook-based delivery (where emails push to your agent in real time) is more reliable and eliminates wasted polling cycles.

How does CrewAI multi-crew routing compare to LangGraph for email workflows?

CrewAI Flows use @router and @listen decorators, which means less boilerplate for straightforward routing. LangGraph uses explicit graph definitions with nodes and edges, giving finer control over complex state machines. For simple email classification and dispatch, CrewAI is faster to prototype. For flows with many conditional branches and cycles, LangGraph's graph model can be easier to reason about.

What infrastructure do you need outside CrewAI to send emails reliably?

CrewAI handles orchestration but not email delivery. You need SMTP sending capability, DNS authentication records (SPF, DKIM, DMARC), and domain warmup to avoid spam folders. LobsterMail handles this automatically so your agent can send from a verified address without manual DNS configuration.

Can CrewAI route emails based on sender domain or subject keywords?

Yes. The @router method has full access to the email state object, so you can route on any field: sender domain, subject-line patterns, body content, attachment types, or a combination. You can use simple conditionals or a full LLM-based classification crew.

What are the deliverability risks of letting an AI agent send emails autonomously?

Agents can send too many emails too fast, trigger spam filters with repetitive content, or damage sender reputation by emailing invalid addresses. Mitigate this with send rate limits, content variation, proper bounce handling, and authentication records. Using a dedicated agent email address (separate from your human accounts) contains the blast radius if something goes wrong.

How do you log which crew handled each email for auditing?

Add a handled_by field to your Pydantic state model and set it in each @listen handler. After the flow completes, write the full state (including category, handler, timestamps) to a database or logging service. This gives you a complete audit trail of every routing decision.