AI Voice Agents and Payments: How PolyAI Captures Payments in Conversation

By Shuttle Team, February 16, 2026

AI Voice Agents Are Creating a New Payment Channel

AI voice agents are no longer novelty. They're production infrastructure. Companies in insurance, financial services, utilities, and healthcare are deploying AI agents that handle entire phone conversations — from greeting to resolution — without a human involved.

What changed in the last 18 months isn't the quality of the conversation. It's the scope. These agents aren't just answering FAQs or routing calls. They're renewing policies. Collecting outstanding balances. Processing upgrade orders. Booking appointments with a deposit. Taking a card number and completing a transaction before the customer hangs up.

This is a net-new payment channel. It didn't exist three years ago. AI voice agents aren't replacing web checkout or point-of-sale terminals. They're creating payment surfaces in places where payment capture was previously impossible without a human agent.

Think about what that means for industries where the phone is still the primary channel. Insurance renewals. Utility payments. Debt collection. Appointment scheduling. These are high-volume, high-value transaction environments where customers call in, and where — until now — either a human agent processed the payment or the customer was told to "visit our website."

The AI voice agent eliminates that friction. The customer calls in, the AI handles the conversation, and payment happens inside that same call. No transfer. No website redirect. No callback.

But enabling this creates a hard technical problem. The AI can talk. It can understand. It can reason about what the customer owes and offer options. What it cannot do — and must never do — is handle a credit card number.

The Payment Problem for AI Voice Agents

AI voice agents are built on large language models and speech-to-text engines. They excel at understanding natural language and generating fluid responses. They were not built to handle payment card data.

Here's the problem, broken down:

The agent can't "see" a card number. Even if a customer reads their card details aloud, routing that audio through a speech-to-text model and into the AI agent's context window creates an immediate PCI compliance violation. Card data has been processed by an uncertified system. Every model log, every debug trace, every piece of training data is now in scope.

DTMF capture needs integration. The standard approach for card entry over phone is DTMF — the customer types their card number on their phone keypad. But DTMF capture has to be integrated into the voice flow at the telephony level. The tones must be intercepted before they reach the AI's audio stream. This isn't a simple API call — it's a real-time audio stream modification within a PCI-certified environment.

PCI compliance is non-negotiable. PCI DSS Level 1 is the standard for any system that processes, stores, or transmits card data. If the AI platform is handling payment capture, the entire platform is in PCI scope — every server, every log, every employee with access. No AI voice platform wants that.

Enterprise customers mandate their PSP. This is the one most people miss. PolyAI doesn't process payments for itself. It processes payments on behalf of its enterprise customers — insurance companies, utilities, financial institutions. Each of those enterprises has existing PSP relationships. They use Worldpay, or Adyen, or a regional acquirer mandated by their compliance team. The AI voice platform doesn't get to choose Stripe because it's easier.

Tokenisation for repeat transactions. If a customer pays once through the AI agent, the business wants to store a token for next time. Not the card number — a token that can be reused across future calls, future channels, and future transactions. That token needs to be PSP-compatible and stored in a PCI-certified vault.

Multi-currency for international deployments. Enterprise AI voice agents are deployed globally. A UK utility collects in GBP. An American insurer collects in USD. A European telco collects in EUR. The payment layer needs to handle currency, regional PSP routing, and local compliance requirements.

None of these problems are solved by the AI itself. They require dedicated payment infrastructure that operates alongside the voice agent — integrated tightly enough to feel seamless, but isolated completely from the conversational AI environment.

How PolyAI Solved It

PolyAI builds enterprise-grade AI voice agents. Their customers include some of the most regulated industries in the world — insurance carriers, financial services firms, utility providers. These aren't small businesses experimenting with AI. They're large enterprises deploying voice agents at scale, handling thousands of calls per day.

For many of these enterprises, the AI voice agent isn't just handling enquiries. It's handling transactions. An insurance customer calls to renew a policy — the agent needs to collect payment. A utility customer calls about an overdue bill — the agent needs to process the balance. A patient calls to book a procedure — the agent needs to take a deposit.

Each of these scenarios requires the AI to capture a payment mid-conversation. And each of PolyAI's enterprise customers has their own PSP, their own compliance requirements, and their own audit expectations.

PolyAI chose Shuttle as their payment infrastructure layer.

"Shuttle let us treat legacy payment providers as a modern SaaS service. It enabled us to support the gateways our customers required and fully automate high-value transactions across regulated industries." — Nathan Liu, PolyAI

That quote captures two things. First, "treat legacy payment providers as a modern SaaS service" — PolyAI's enterprise customers don't all use modern API-first PSPs. Some use legacy gateways with decades-old integration specs. Shuttle abstracts those differences away behind a single, modern API. PolyAI integrates once and supports all of them.

Second, "support the gateways our customers required" — PolyAI doesn't dictate which PSP an enterprise uses. The enterprise keeps its existing payment relationships. Shuttle routes through whichever gateway the enterprise mandates. This is the architecture that makes enterprise deals close.

The Architecture: How It Actually Works

Here is the technical flow for a payment captured within a PolyAI AI voice agent conversation, powered by Shuttle:

1. The AI agent handles the conversation. PolyAI's voice agent is on the call. It has identified the customer, confirmed the amount owed, and determined that payment should happen now. The AI is running on PolyAI's infrastructure — large language models, speech synthesis, real-time audio processing.

2. The AI agent triggers Shuttle's payment session. When payment is needed, PolyAI's system makes an API call to Shuttle: "Start a payment session. Amount: $247.50. Currency: USD. PSP: the enterprise customer's configured gateway." Shuttle returns a session identifier and the voice flow is prepared for secure capture.

3. The voice flow transitions to secure capture mode. The AI agent tells the customer: "I can take your payment now. Please enter your 16-digit card number using your phone keypad." At the telephony level, the audio stream is now being intercepted by Shuttle's PCI-certified capture environment.

4. The customer enters card details via DTMF. The customer types their card number, expiry date, and CVV on their phone keypad. The DTMF tones are captured within Shuttle's PCI DSS Level 1 environment. Critically, the tones are stripped from the audio stream — they never reach PolyAI's AI models, never appear in call recordings, never enter any log outside the PCI boundary.

5. Shuttle tokenises the card and routes to the enterprise's PSP. Shuttle validates the card (BIN check, Luhn validation), tokenises it, and routes the transaction to whichever PSP the enterprise customer has configured — Worldpay, Adyen, Stripe, a regional acquirer, or any of 40+ supported gateways. The token is stored in Shuttle's PCI-certified vault for future use.

6. The transaction result is returned to the AI agent. Shuttle sends the result back to PolyAI via API: approved, declined, or requiring additional action. No card data crosses this boundary. The AI agent receives only the transaction outcome.

7. The AI agent confirms payment and continues the conversation. The voice agent says: "Your payment of $247.50 has been processed successfully. You'll receive a confirmation email shortly. Is there anything else I can help with?" The call continues naturally. The customer may not even register that a system transition occurred.

8. No card data touches PolyAI's or the enterprise's infrastructure. At no point did card data enter PolyAI's AI models, PolyAI's servers, the enterprise's systems, the call recording, or any environment outside Shuttle's PCI-certified boundary. PolyAI's PCI scope: zero. The enterprise's PCI scope for this channel: zero.

Total elapsed time from "please enter your card number" to "payment confirmed": seconds. Human involvement: none.

Why This Can't Be Built with Standard Payment APIs

Standard payment APIs were designed for the web. Stripe Checkout renders a payment form in a browser. Adyen Drop-In embeds a card capture widget in a web page. PayPal opens a redirect flow. These all assume one thing: the customer is looking at a screen.

AI voice agents don't have a screen.

There's no browser to render a checkout form. There's no webpage to embed a widget. The customer is on a phone call. The "interface" is audio and keypad tones.

Building payment capture for AI voice agents requires:

DTMF capture within a PCI-certified environment. The keypad tones have to be intercepted at the telephony layer, decoded, validated, and processed — all within a PCI DSS Level 1 boundary. Standard payment APIs don't provide this. They expect card data to arrive via HTTPS POST from a browser, not as audio frequency tones from a phone call.

Voice-native payment flows. The payment sequence needs to integrate with the conversational flow — prompting for card number, then expiry, then CVV, handling re-entry if a digit is missed, confirming the amount before processing. This is a fundamentally different UX pattern from a web form.

Real-time integration with the conversational AI engine. The payment layer needs to signal back to the AI agent in real time: "card captured," "processing," "approved," "declined." The agent needs these signals to continue the conversation naturally. Latency kills the experience — if the AI goes silent for five seconds while waiting for a payment result, the customer thinks the call dropped.

PSP-neutral routing. The AI voice platform doesn't get to mandate a PSP. Its enterprise customers bring their own. The payment layer must route to whichever gateway the enterprise has configured, through a single integration that the AI platform maintains once.

Tone masking. DTMF tones must be stripped from the audio stream in real time so they never reach the AI model, call recording systems, or analytics platforms. This is a real-time audio processing requirement that sits at the intersection of telephony engineering and payment security.

None of this exists in Stripe's API. Or Adyen's. Or any standard payment gateway. It's specialised infrastructure built for a specific problem: capturing payments inside a voice conversation.

The Regulated Industry Factor

PolyAI's customers aren't e-commerce companies. They're regulated enterprises.

Insurance carriers operating under FCA oversight. Financial services firms subject to PSD2 and state-level regulations. Utility companies with OFGEM compliance requirements. Healthcare providers operating under HIPAA.

These aren't companies that move fast and figure out compliance later. They have legal teams that review every vendor. They have compliance officers who audit every data flow. They have specific, non-negotiable requirements about where card data goes, which PSP processes it, and who has access to transaction records.

For the payment layer inside an AI voice agent, this means:

PSP mandates are real. An insurance carrier doesn't switch PSPs because an AI vendor prefers Stripe. They use Worldpay, or a specific acquirer, because their compliance team approved it and their existing reconciliation systems depend on it. The payment infrastructure must support that PSP — not ask the enterprise to change.

Audit trails matter. Regulated enterprises need to demonstrate exactly how card data was handled during an AI voice agent call. They need to show auditors that card data never entered the AI's environment, never appeared in call recordings, and was processed through a PCI DSS Level 1 certified system. The payment layer must provide this documentation.

Data residency and sovereignty. European enterprises may require that card data is processed within the EU. UK financial services firms may require UK-based processing. The payment layer needs to support these geographic constraints.

Compliance certifications compound. PCI DSS Level 1 is the baseline. But regulated enterprises also ask about ISO 27001, SOC 2, and sector-specific certifications. The payment layer carries these certifications so the AI platform and the enterprise don't have to.

Shuttle holds PCI DSS Level 1, ISO 27001, and SOC 2 certifications. For PolyAI's enterprise customers, this means the payment component of their AI voice agent deployment arrives pre-certified. No additional compliance burden. No new audit scope. The payment infrastructure is already approved.

The Market Context: Agentic Commerce in 2026

The payments industry has woken up to AI agents. The announcements are coming fast.

Stripe launched its Agentic Commerce Suite — tools for AI agents to discover, negotiate, and pay for digital services. Their x402 protocol enables machine-to-machine payments. The focus is agent-to-agent commerce: software systems paying each other for APIs, data, and compute.

Google announced AP2, the Agent Payments Protocol, with over 60 partners including Adyen, American Express, Mastercard, and PayPal. AP2 is designed to let AI agents initiate payments on behalf of consumers — a protocol-level standard for how agents request authorisation.

Worldline connected AI agents to its payment ecosystem via MCP (Model Context Protocol) servers, creating bridges between large language models and payment APIs.

Visa completed the first voice-enabled agentic payment transaction, demonstrating a cardholder using an AI agent to pay real estate service charges.

These announcements share a common focus: enabling AI agents to initiate and authorise payments. They're solving the protocol problem — how an AI agent expresses "I want to pay" in a way a payment system understands.

What they don't address is the infrastructure problem for voice.

Stripe's Agentic Commerce Suite assumes a digital/API context. Google's AP2 is a web-first protocol. Visa's demonstration was a proof of concept. None of them solve DTMF capture within a PCI boundary. None of them address PSP-neutral routing for enterprise customers who mandate their gateway. None of them provide the voice-native payment flow that production AI voice agents require.

Voice is the harder problem. Web-based AI agents can render a checkout form. Chat agents can send a link. Voice agents operate in an audio-only environment where the customer's phone keypad is the input device and compliance requires that tones never reach the AI.

It's also the bigger opportunity. In insurance, financial services, utilities, and healthcare — industries with trillions of dollars in annual transaction volume — the phone is still the primary customer channel. AI voice agents are automating those calls. The payment infrastructure that powers them is critical.

What AI Voice Agent Platforms Need

If you're building an AI voice agent platform that needs to process payments — or you're an enterprise evaluating AI voice agents for transaction-heavy workflows — here's what the payment infrastructure must provide:

PSP-neutral architecture. Your enterprise customers will mandate their PSP. The payment layer must support 40+ gateways through a single integration, routing each transaction to the correct provider based on the enterprise's configuration. This is not optional — it's the requirement that makes or breaks enterprise deals.

PCI DSS Level 1 certification within the voice flow. Not PCI compliance for a web form that gets sent via SMS as a fallback. PCI compliance for the actual DTMF capture that happens while the customer is on the call with the AI agent. The entire capture, tokenisation, and processing chain must sit within a Level 1 certified environment.

DTMF capture with tone masking. Real-time interception of keypad tones at the telephony layer, with simultaneous stripping of those tones from the audio stream. The AI model, call recording, and analytics systems must never see the raw DTMF data. This is the technical mechanism that keeps the AI platform out of PCI scope.

API-triggered payment sessions. The AI agent needs to initiate a payment session programmatically — specifying amount, currency, PSP, and any metadata — and receive a real-time result. The API must be fast enough that the conversational flow isn't disrupted. Sub-second response times for session creation. Transaction results returned within the normal payment processing window.

Tokenisation for repeat payments. First-time card capture should produce a reusable token. That token should work for future transactions across channels — if the customer calls back, if a payment link is sent, if a web portal is used later. One capture, multiple uses.

Multi-currency support. Enterprise deployments span geographies. The payment layer must handle currency conversion, regional PSP routing, and local regulatory requirements without requiring the AI platform to build separate integrations per market.

Compliance documentation for regulated customers. Enterprise compliance teams will audit the payment flow. The payment layer must provide clear documentation of data flows, certifications (PCI DSS Level 1, ISO 27001, SOC 2), and architecture diagrams showing the separation between AI and payment environments.

White-label operation. The payment flow should be invisible. The enterprise's customers should experience the payment as part of the enterprise's service. No third-party branding. No redirects to external platforms. The payment layer sits quietly underneath, powering the transaction while the enterprise owns the customer experience.

Building the Payment Layer for AI Voice

AI voice agents represent a fundamental shift in how payments happen. Not a new checkout button. Not a new form factor. A new channel — one where transactions occur inside conversations, where the input device is a phone keypad, and where enterprise compliance requirements shape every architectural decision.

The platforms building these voice agents — PolyAI and others — need payment infrastructure that was designed for this environment. Not web checkout repurposed for voice. Not a single-PSP integration that locks out enterprise customers. Purpose-built infrastructure that handles DTMF capture, PCI compliance, multi-PSP routing, and real-time AI integration as a single, unified layer.

That's what Shuttle provides. A single integration point for AI voice platforms to support 40+ payment gateways, with PCI DSS Level 1 compliance, DTMF capture and tone masking, tokenisation, multi-currency, and white-label operation — all designed to sit inside the conversational flow without the customer or the AI ever touching card data.

For the chat-side of this equation — how AI agents capture payments on websites, WhatsApp, and messaging platforms — see Chat Agent Payments: How AI Closes Sales Without a Human Handoff. For the broader architecture covering both channels, see How AI Agents Process Payments: The Infrastructure Guide.

If your AI voice agent platform needs payment infrastructure — or your enterprise is deploying AI voice agents that handle transactions — [book a call with Shuttle](https://shuttleglobal.com/contact).

Talk to us

See how Shuttle can power payments for your platform — multi-PSP, multi-channel, white-label.

Book a Demo