The numbers always come up first in these conversations: 50,000+ customer interactions per day, 180 live agents replaced by one deployment, 15 service flows, 99.97% uptime over 30 days. They're real. They're audited monthly. But they're also the boring part of the story. The interesting part is what we got wrong before we got it right.
This essay walks through the architecture of a production service agent in regulated FinTech. Names anonymized at the client's request. Numbers and patterns are real.
The system in one diagram
Channel (web/app/SMS)
↓
Identity + intent classifier (fine-tuned 7B)
↓
[tier-1 self-serve] ← cards, FAQs, status pages
[tier-2 GenAI agent] ← persona-driven, tool-equipped
[tier-3 human] ← escalation queue
↓
Persona Router (5 personas: billing, card, KYC, dispute-triage, general)
↓
RAG (scoped corpus per persona) + Tool Calls (transactions, account, status)
↓
LLM Response (frontier model + few-shot for tone)
↓
Compliance Filter (PII redaction, profanity, regulated-claim guards)
↓
Response delivery + Telemetry (OpenTelemetry trace)
What "50K conversations a day" actually means
Distribution matters. At peak (Friday evening across the GCC), the system handles 8.3 requests per second. The arrival pattern is bursty — Eid week sees 4x volume. The architecture has to support that without elastic-scale surprises, because every model call is a billable event.
Tier-1 self-serve catches 38% of incoming traffic. The GenAI tier-2 agent handles 47%. Tier-3 humans handle 15%. The economics work because the GenAI tier handles the meaty middle — the 47% that previously consumed your most expensive support seats.
Five things we got wrong
1. We started with one persona
The initial design routed everything through a single "service agent" persona. Predictably, the model overgeneralized — answering KYC questions in the tone of a billing agent, citing card policies in account-recovery flows. Splitting into five distinct personas (billing, card, KYC, dispute-triage, general) lifted accuracy 9 points and CSAT 0.4.
2. We used the same RAG corpus for every persona
Initially the agent retrieved from "all policy documents." It would cite a corporate-card document while answering a personal-card question. The fix was scoped corpora per persona — the billing persona only retrieves from billing docs, the KYC persona only from KYC docs. Cross-persona queries get a routing pass first.
3. We didn't gate on intent classification confidence
If the intent classifier had a confidence below 0.7, we initially still routed the query to the GenAI tier. This meant ambiguous queries got confidently wrong answers. Now any classifier below 0.7 routes to a clarification turn ("Can you tell me more about what you need?") or to a human. Catastrophic-confidence errors dropped 80%.
4. We logged tokens but not cost
For three months we had no idea what the system was costing per persona, per channel, or per tenant segment. When we instrumented for cost-per-conversation, we discovered that the SMS channel was 3.2x more expensive than web because the persona retries on truncation were happening twice as often. We tightened the response-length policy on SMS and saved 27% of monthly model spend.
5. We built no replay infrastructure
When the agent regressed after a model swap, we had no way to test the new model against the old conversations to see what changed. We built a replay harness in week 14 — and immediately wished we'd built it in week 1. Now every model upgrade is preceded by a full replay against the last 30 days of production traffic. We catch regressions before they ship.
The economics
Conservatively: 180 live agents at fully-loaded GCC cost (salary, benefits, supervision, real estate, training, attrition) ran the client about $14M annually. The deployed agent system costs roughly $1.8M annually all-in (model spend, infrastructure, our retainer, the human escalation team of 22 specialists). Net annual savings: ~$12M. Payback: under 4 months.
But more importantly — the human team is now better. Specialists handle higher-value escalations, develop deeper expertise, and have a CSAT (4.8/5) that surpasses the previous agent floor (3.9/5). The model didn't replace humans; it routed humans to where they're worth $250 an hour instead of $25.
What we'd tell anyone shipping a service agent
- Personas, not one. Specialize the agent's tone, retrieval, and tools per intent class.
- Triage before LLM. The fastest, cheapest, safest tier of every conversation is the one that doesn't reach the model.
- Cost-per-conversation as a first-class metric. If you're not tracking it, you're losing money on the SMS channel and don't know it.
- Replay infrastructure on day one. You will swap models. The replay harness is what makes that safe.
- Human escalation is a feature, not a fallback. The 15% you escalate is your CSAT signal — defend it.
This pattern is repeatable. We've now deployed variants of it in four different regulated environments. If you're staring at a customer-service P&L and trying to figure out where GenAI fits, book 30 minutes and we'll walk you through the architecture against your specific volumes.