How AI Voice Agents Take PCI-Compliant Payments

Your AI voice agent is impressive. It handles intent recognition, sentiment analysis, conversational routing, customer lookup against your CRM, scheduling, and escalation — all in real time, all without a human agent. Then the customer says: "Yes, I'd like to pay now."

And your stack hits a wall.

Taking a card payment during a live AI voice call isn't a product problem or a UX problem. It's an infrastructure and compliance problem. This guide explains exactly what the architecture looks like, why the naive approaches don't work, and what a correct integration actually involves.

The Problem: AI Can Do Everything Except Take a Payment

The moment a customer agrees to pay, your voice agent needs to capture a 16-digit card number, a 4-digit expiry, and a 3-digit CVV. That's sensitive cardholder data under PCI DSS. And PCI DSS has a very clear rule: any system that stores, processes, or transmits cardholder data is in scope for full compliance.

Here's what that means in practice for a CCaaS platform:

If card data enters your AI model — even as audio that gets transcribed — your entire voice infrastructure is in PCI scope. That includes your ASR pipeline, your LLM inference layer, your call recording system, your data lake, your transcription storage, your model training pipelines, and every network segment they touch.

PCI DSS Level 1 certification for that kind of footprint costs roughly $500,000 in the first year. Ongoing annual costs run $200,000 or more, plus quarterly vulnerability scans, annual penetration testing, and a Qualified Security Assessor (QSA) who will not be cheap or fast.

That's the compliance burden of getting this wrong. Most CCaaS companies — even well-funded ones — cannot absorb it, nor should they. The goal is to keep cardholder data out of your platform entirely.

The Architecture: Secure Payment Handoff

The correct architecture is a clean handoff: your AI agent orchestrates the conversation, a separate payment layer handles all card data capture, and your platform never sees, stores, or processes cardholder data. Here's the step-by-step flow:

Intent recognition — The AI agent identifies payment intent from the customer's speech ("I'd like to pay my bill" / "Can I settle this now?").

Amount confirmation — The agent confirms the payment amount with the customer and explains that they'll be prompted to enter their card details via their keypad.

Session initiation — Your platform makes an API call to the payment layer: POST /payment-session with the amount, currency, and the end-customer's PSP configuration. The payment layer returns a session token and signals that it's ready to capture.

Audio stream split — This is the critical step. The audio stream bifurcates. The payment layer takes control of the DTMF capture channel. The main call audio — the AI agent's voice, the conversation — is either paused or continues on a separate path that is explicitly isolated from the card capture channel.

Card data entry — The payment layer plays a secure prompt to the caller ("Please enter your 16-digit card number followed by the hash key"). The caller enters digits via their phone keypad.

DTMF capture in isolation — The DTMF tones are captured exclusively by the payment layer. They do not enter your CCaaS platform. They do not reach your AI model. They do not appear in call recordings. The main audio stream receives masked flat tones where the keypad presses would otherwise appear.

Tokenisation and authorisation — The payment layer tokenises the card data and sends it to the customer's PSP for authorisation. This entire operation happens within the payment layer's PCI DSS Level 1 certified environment.

Result returned — The PSP returns an auth result to the payment layer. The payment layer fires a webhook to your platform: payment_completed with success/failure, a transaction ID, and a masked card reference (e.g., ****4242). No card data.

Conversation resumes — Your AI agent picks up: "Your payment of £150 has been processed. Your reference number is TXN-9821. Is there anything else I can help you with?"

The entire card capture happens in a sandboxed environment your platform never touches. Your PCI scope is limited to the API calls between your platform and the payment layer — which is a dramatically smaller, more defensible surface area.

DTMF vs Speech Recognition for Card Capture

When engineers first think about AI voice agent payments, the obvious question is: "Why can't the AI just listen to the customer read out their card number?" It understands speech. It can transcribe numbers.

It can, technically. But compliance architecture doesn't care what the AI can do — it cares what data flows where.

If a customer reads their card number aloud and your ASR pipeline transcribes it, that audio and that transcript both contain cardholder data. Your entire ASR infrastructure is now in PCI scope. Your call recording system is in scope. Your transcription storage is in scope. Your model training data — if you're using call audio for fine-tuning — is in scope.

DTMF (Dual-Tone Multi-Frequency) keypad entry solves this at the architecture level:

Channel isolation: DTMF tones can be captured on a separate audio path that is entirely managed by the payment layer. The main call audio stream never carries card data.
Tone masking: Standard practice is to replace DTMF tones in the main audio stream with flat replacement tones (sometimes called "beeping"). Call recordings contain no card data — they contain silence or flat tones during the card entry window.
No transcription: There's no speech-to-text step for card data. The payment layer decodes the tones directly. No LLM, no ASR, no transcript.
Caller familiarity: Customers are used to entering card details via keypad. It's the standard IVR flow they've been doing for 20 years. There's no UX friction.

Speech capture of card numbers is a compliance anti-pattern. DTMF capture is the industry-standard, compliance-correct approach. Any architecture that routes spoken card numbers through your AI pipeline is building a very expensive PCI scope problem.

PCI Scope: What Changes and What Doesn't

This is worth being precise about, because "PCI compliance" gets hand-waved in a lot of vendor conversations.

Without a payment handoff architecture:

If your agents — human or AI — hear or process card numbers, PCI DSS scope expands to include:

All call recording infrastructure
All transcription services and storage
All ASR pipelines
All AI model inference infrastructure
All data warehouses or lakes that receive call data
All networks connecting these systems
All personnel with access to those systems

That's essentially your entire platform. PCI DSS Level 1 certification for a footprint that size is not a checkbox exercise — it's a multi-year program with dedicated compliance staff.

With a payment handoff architecture:

Your PCI scope shrinks to:

The API connection between your platform and the payment layer (TLS in transit — table stakes)
The payment layer itself (which carries its own PCI DSS Level 1 certification)

You don't handle card data. You don't store it. You don't transmit it. You send a payment session request and receive a success/failure webhook. Your QSA scope is minimal. Your compliance burden is minimal.

The payment layer — Shuttle, in this context — carries the PCI DSS Level 1 certification. That's the certification that covers the card capture, tokenisation, vault, and PSP routing. You inherit the compliance posture without the certification cost.

Multi-PSP: Why Your Customers' Gateway Matters

Here's a practical problem that most "just add Stripe" thinking ignores: your enterprise CCaaS customers already have PSP relationships.

An insurance company processing 50,000 premium collections a month has a negotiated rate with their acquirer. A utility company has a direct integration with a specific gateway. A debt collection agency is contractually required to process through a particular payment provider. None of them want to move off their existing PSP to use whatever you've embedded.

A correct payment layer needs to be PSP-agnostic. When a payment session is initiated for a given end-customer, the payment layer routes to that customer's configured PSP — not to a single hardcoded gateway.

This is why "add Stripe" doesn't solve the problem for CCaaS operators. Stripe is a single gateway. Your enterprise customers need their own gateway. The payment infrastructure needs to support multi-tenancy at the PSP level: each customer of your platform routes through their own PSP, using their own merchant credentials, with their own settlement.

Shuttle supports 40+ PSPs out of the box. When you initiate a payment session, you pass the end-customer's PSP configuration. The payment layer handles the routing. You never need to build a new PSP integration for a new customer.

Build vs Buy

Let's be direct about what building this in-house actually requires:

Build:

DTMF capture with audio stream isolation (non-trivial telephony engineering)
PCI DSS Level 1 certification: ~$500K in year one, $200K+ annually thereafter
Tokenisation vault design, implementation, and auditing
PSP integrations: each one is 2-4 weeks of engineering, plus ongoing maintenance as PSP APIs change
Ongoing quarterly vulnerability scans, annual penetration tests, key rotation schedules
A dedicated compliance function or expensive external QSA relationship
Timeline to first production payment: 12-18 months minimum

Buy (integrate a payment layer):

Single API integration: a few weeks of engineering
PCI compliance carried by the payment layer — you're out of scope
40+ PSP integrations available on day one
Compliance, auditing, pen testing, key rotation: the payment layer's problem
Timeline to first production payment: weeks

For a CCaaS company under 500 people — and most CCaaS companies are — this calculus is not close. The build path is a multi-year distraction from your core product. The buy path lets you ship a payments feature, close enterprise deals that require payment capabilities, and let your engineering team stay focused on the AI and conversation capabilities that actually differentiate your product.

What the Integration Actually Looks Like

Stripped to its essentials, the integration is three API calls and a webhook:

```

POST /payment-session

Body: { amount, currency, merchant_id, psp_config } Response: { session_id, dtmf_ready: true }

[Audio handoff — DTMF capture handled by payment layer]

Webhook received: POST /your-webhook-endpoint

Body: { event: "payment_completed", session_id: "sess_abc123", status: "success", transaction_id: "txn_xyz789", masked_card: "****4242", amount: 15000, currency: "GBP" }

[AI agent resumes conversation using status from webhook]

```

No card data flows through your system at any point. The session ID ties the payment to the conversation. The webhook fires within seconds of the PSP authorisation. Your AI agent reads the status and continues the call.

The same integration works for human agents via an agent-assist interface. Same API, same DTMF flow, same PCI boundary. You build the integration once and it serves both your AI and human agent channels.

Summary

AI voice agents are fully capable of handling payments — but the architecture has to be right. The LLM cannot hear or process card data. Speech recognition of card numbers creates a compliance catastrophe. DTMF capture with a dedicated payment layer keeps card data entirely out of your platform.

The architecture is:

AI agent handles conversation and identifies payment intent
Platform initiates a payment session via API
Payment layer takes control of card capture via DTMF
Card data never enters your platform, your recordings, or your AI pipeline
Payment layer handles tokenisation and PSP routing
Webhook returns success/failure — AI agent resumes the call

PCI scope stays with the payment layer. Your engineering team stays focused on your product. Your enterprise customers use their existing PSPs.

Talk to us

See how Shuttle can power payments for your platform — multi-PSP, multi-channel, white-label.

Book a Demo

How AI Voice Agents Take PCI-Compliant Payments

The Problem: AI Can Do Everything Except Take a Payment

The Architecture: Secure Payment Handoff

DTMF vs Speech Recognition for Card Capture

PCI Scope: What Changes and What Doesn't

Multi-PSP: Why Your Customers' Gateway Matters

Build vs Buy

What the Integration Actually Looks Like

Summary

Related Reading

Related Reading

Talk to us

The Problem: AI Can Do Everything Except Take a Payment

The Architecture: Secure Payment Handoff

DTMF vs Speech Recognition for Card Capture

PCI Scope: What Changes and What Doesn't

Multi-PSP: Why Your Customers' Gateway Matters

Build vs Buy

What the Integration Actually Looks Like

Summary

Related Reading

PCI-Compliant Payments Over the Phone: What You Really Need to Know

Agent-Ready Commerce: How SaaS Platforms Prepare for AI-Driven Payments

The First Agentic Payments Went Live. Here's What the Infrastructure Looks Like.

Payment Links for QuickBooks Payments: Send Branded Checkout Links Beyond Email Invoices

Invoice Payments in QuickBooks with Stripe Integration

The Future of Voice Commerce: From IVR to Intelligent Payments

Talk to us