The Voice AI Boom Is Real
The voice AI market is valued at $47.5 billion and accelerating. That figure isn't a projection built on wishful extrapolation — it reflects what's already deployed in enterprise contact centres today, and what's being fast-tracked for the next 18 months.
Every major CCaaS vendor is executing the same playbook. Genesys has rebuilt its routing engine around generative AI, positioning AI-native agents as a core product rather than an add-on. NICE has rolled out Enlighten AI across its CXone platform, replacing scripted IVR flows with conversational models. Amazon Connect launched its Q in Connect capability, letting contact centre operators deploy large language models on top of existing telephony infrastructure without a full platform replacement.
The fundraising signals are just as clear. PolyAI raised $86 million at a $750 million valuation, making it one of the best-capitalised AI voice companies in the market. Their thesis is straightforward: voice AI is the new front door to customer experience. The phone is not dead — it's being rebuilt from scratch.
Operators agree. Gartner estimates that by 2027, AI voice agents will handle a majority of routine customer service contacts that previously required a human agent. The transition from IVR to generative voice AI is not incremental — it's a wholesale replacement of the old call routing model with something that can hold a real conversation.
The channel is validated. Enterprises are committed. The investment has been made.
And yet, when a customer says "I'd like to make a payment" — most voice AI hits a wall.
The Wall
Picture the scenario. A customer calls their insurance company to renew a policy. The AI voice agent handles the call fluently. It retrieves the customer's account, confirms the renewal terms, answers a question about the coverage details, and moves naturally toward close.
Then: "Great. Would you like to pay by card today?"
This is where the architecture breaks.
The customer says yes. The AI agent has no secure mechanism to capture card data over the live voice channel. It cannot request card digits without exposing them to systems that are not PCI-certified. It cannot pause the conversation and hand off to a DTMF capture environment without either disrupting the call flow or exposing sensitive audio to its own speech-to-text pipeline.
The most common outcomes in production today:
The agent hands off to a human agent to take the payment — defeating the automation case entirely
The agent sends a payment link via SMS and asks the customer to pay separately — killing the in-call conversion
The platform instructs the agent to decline payment capture entirely and redirect customers to a website — turning a potential end-to-end automated transaction into a multi-step, multi-session process
The agent processes payment through an insecure channel, unknowingly putting the platform out of PCI compliance
None of these are acceptable at scale. None of them reflect the promise of AI voice as a customer experience channel.
The channel has been rebuilt. The payment infrastructure hasn't.
Why This Is an Infrastructure Problem, Not a Software Problem
There is a tempting assumption among voice AI product teams: that PCI compliance for voice payments is a software configuration question. Add a library, enable a flag, toggle a setting in the conversation flow — and the payment works.
This is wrong, and it's wrong in ways that matter.
PCI DSS scope is architectural, not configurational. PCI DSS Level 1 — the certification required for any platform processing significant card volume — governs how cardholder data is stored, transmitted, and processed. An AI voice agent that processes card audio through its speech-to-text pipeline (even momentarily, even without storing it) is in scope. An AI model that receives card digits as part of a conversation transcript is in scope. Scope is determined by whether card data touches a system, not by whether the system is "trying" to handle card data.
DTMF capture requires purpose-built infrastructure. The standard for PCI-compliant voice payments is DTMF (Dual-Tone Multi-Frequency) capture — the customer enters card digits via their phone's keypad rather than reading them aloud. The DTMF tones must be captured by a PCI-certified environment that is architecturally separated from the AI voice agent. The AI agent's audio stream is paused during card entry. Card data never reaches the speech-to-text engine or the language model. Confirmation is returned to the agent without card data crossing the boundary.
This requires the voice platform and the payment layer to integrate at the telephony layer — not just at the API layer. It is a non-trivial integration that requires a payment partner with voice-specific infrastructure.
Secure card capture over live voice is not the same as online card capture. Web-based PCI compliance (SAQ A, hosted checkout pages) is relatively well-understood. Voice PCI compliance is different. The card capture environment must integrate with the voice channel's telephony infrastructure, handle DTMF tone detection, manage real-time call state, route the captured data to the PSP, and return a payment confirmation — all within the window of an active phone call. Latency, call drops, and DTMF detection errors are operational realities that don't exist in online checkout. The infrastructure must be built for voice, not adapted from web.
Fraud detection and identity verification over voice require separate consideration. A web checkout has device fingerprinting, IP analysis, and behavioural signals to inform fraud scoring. A voice call has caller ID and the customer's spoken responses. Voice-specific fraud signals — caller ID spoofing, synthetic voice detection, call velocity — need to be integrated at the payment layer. This is not a capability that ships with a generic payment API.
The software problem — getting the AI agent to understand that a payment is needed and to manage the conversation flow around it — is largely solved. The infrastructure problems underneath it are not.
The CCaaS Gap: Incumbents and Startups Both Face It
The infrastructure gap exists across the full CCaaS market, but for different reasons depending on whether you're talking about the incumbents or the AI-native entrants.
The incumbents (Genesys, NICE, Amazon Connect) built their payment flows for human agents.
The existing payment infrastructure in enterprise CCaaS was designed around agent-assisted payment capture: a human agent stays on the line while the customer enters card details via DTMF, or reads card numbers to an IVR after the agent has handed off the sensitive portion of the call. PCI compliance was maintained because the human agent was removed from the card capture environment — the call was transferred to a separate, certified IVR system.
When AI voice agents replace human agents, this architecture doesn't simply carry over. The AI agent is not a drop-in replacement for the human in the existing flow. It needs its own integration to the PCI-certified capture environment, its own handling of the call state transition, and its own mechanism for receiving payment confirmation and continuing the conversation. Most CCaaS platforms have not rebuilt this flow for AI-native agents yet — they've shipped the AI conversation capabilities and left the payment integration as a gap for the customer to solve.
The AI-native startups are inheriting the same gap without the institutional knowledge.
Companies like PolyAI, Parloa, Bland.ai, and the growing field of AI voice platforms have solved the hard problem — building conversational AI that can handle enterprise customer service at scale. Most of them have not yet built, or are in the process of building, production-grade voice payment infrastructure.
The challenge is predictable: voice payment infrastructure is not their core competency, and it shouldn't be. Building PCI DSS Level 1 infrastructure requires dedicated compliance investment, annual audits, specialised engineering, and ongoing operational overhead. For a voice AI company whose core product is conversational AI, turning a significant portion of engineering capacity toward PCI certification is a serious detour.
The result is that many AI voice platforms ship without payment capability, or with a payment capability that works for simple use cases but fails the compliance bar required by enterprise CCaaS customers in regulated industries.
Voice AI without embedded payment capture is a lobby without a checkout counter. You can greet customers, answer their questions, and move them through a conversation — but you cannot close the transaction.
What Payment-Ready Voice AI Actually Looks Like
The architecture for production-grade AI voice payments is not theoretical. It is deployed and handling real enterprise transactions today.
The separation model is foundational. The AI voice agent handles the conversation. The payment layer handles card data. These two systems are architecturally separated, with a clearly defined handoff: the AI agent triggers a payment request (amount, currency, merchant context), the payment layer captures card data via DTMF, processes the transaction through the merchant's PSP, and returns a payment confirmation. Card data never enters the AI agent's environment.
DTMF capture is the proven standard. When payment is required, the voice channel temporarily routes the customer's keypad input to the PCI-certified payment environment. The AI agent's speech-to-text processing is suspended. The customer enters card digits via keypad. The payment layer confirms the transaction. The call returns to the AI agent with a payment reference. The agent confirms payment and continues the conversation. From the customer's perspective, this is a brief, natural pause — equivalent to typing card details on a website.
PSP-neutral infrastructure is a hard requirement at enterprise scale. An AI voice agent platform serving multiple enterprise customers will encounter a different PSP at each account. One customer uses Worldpay. Another uses Adyen. A third has a regional acquirer with bespoke API requirements. The payment layer must support each PSP through a single integration — the AI platform cannot maintain individual PSP integrations for each customer. PSP-neutral architecture, with 40+ providers available through a single API, is the only operationally viable approach.
Real-time confirmation is part of the product. The AI agent's conversation flow depends on knowing whether the payment succeeded immediately. A successful payment triggers a confirmation message and next steps. A decline triggers a retry flow or alternative payment option. A timeout triggers a graceful fallback. This requires the payment layer to return a webhook confirmation within the window of the live call — typically under two seconds for a good customer experience. Infrastructure that processes payments asynchronously or with significant latency will break the voice conversation.
Compliance is a platform-level requirement, not a customer problem. Enterprise CCaaS customers in regulated industries — insurance, financial services, healthcare, utilities — cannot deploy AI voice agents that handle payments unless the payment flow meets PCI DSS Level 1 requirements. This is not an optional certification tier. Platforms that cannot demonstrate Level 1 compliance, alongside ISO 27001 and SOC 2, will not get through enterprise procurement. The compliance must be in the payment layer, and it must be demonstrable at the point of sale.
For CCaaS Providers and Voice AI Companies: What to Look for in a Payment Partner
If you are building AI voice capability and payment is a required flow for your target customers, these are the non-negotiable criteria for a payment infrastructure partner:
PCI DSS Level 1 certification — specific to voice. Many payment providers carry PCI certification for online transactions. Voice payment certification is different and less common. Confirm the provider's certification explicitly covers DTMF capture over telephony channels. Ask for their Attestation of Compliance (AOC) and confirm the scope includes voice payment flows.
Native DTMF integration, not a workaround. Some providers attempt to handle voice payments by routing customers to a separate IVR system mid-call. This breaks the customer experience and defeats the purpose of AI-native voice. The DTMF capture should integrate with your voice infrastructure at the telephony layer, not require a call transfer to a third-party system.
PSP breadth. The payment partner needs to support the PSPs your enterprise customers already use. A partner with one or two PSPs will create blockers in enterprise sales. Confirm the provider's PSP coverage map before committing to an integration.
Latency and reliability at enterprise scale. Voice payments fail in ways that web payments don't. Call drops, DTMF detection failures, and latency-induced timeouts create customer experience problems that are hard to recover from on a live call. Ask for SLA commitments on payment confirmation latency and uptime, and reference production deployments in enterprise CCaaS environments.
Multi-channel coverage. Payment requirements in CCaaS don't stop at voice. Chat agents need payment links. SMS follow-ups need trackable checkout pages. Agents handling escalations need to transfer payment context across channels. A payment partner whose coverage is limited to voice will require additional integrations as your platform grows.
Commercial model. Per-transaction pricing is standard for voice payments, but the commercial model varies significantly. Confirm whether the provider's pricing scales reasonably for enterprise volumes and whether there are revenue-share mechanisms for platforms that are embedding payments as a product feature for their customers.
The Gap Is Closing — But Slowly
The voice AI market has moved faster than the payment infrastructure supporting it. That asymmetry is temporary — enterprise deployments in regulated industries will force the payment capability question, and platforms that cannot answer it will lose deals to those that can.
The good news is that the infrastructure model is proven. AI voice agents processing PCI-compliant payments through PSP-neutral payment layers is not a whiteboard concept. PolyAI processes payments at enterprise scale today. The architecture described in this guide — DTMF capture, payment layer separation, PSP-neutral routing, real-time confirmation — is production infrastructure, not a roadmap.
The gap is not "can AI voice agents take payments." The answer is yes.
The gap is between the voice AI platforms that have integrated production-grade payment infrastructure and those that have not. For CCaaS providers and voice AI companies evaluating where to direct engineering resources, the build-vs-partner decision on voice payments is clear: building PCI DSS Level 1 voice payment infrastructure in-house typically takes 12 to 18 months and consumes compliance resources that most voice AI companies do not have on staff. Integrating with a purpose-built payment layer takes weeks.
The voice AI lobby has been built. The checkout counter is the next required addition.
Related Reading
AI Payment Security: How AI Agents Handle Card Data Without Breaking PCI — the detailed architecture for keeping card data out of AI voice models
Agentic Payments in 2026: The Infrastructure Guide for Platforms — the broader infrastructure landscape for AI agent payments across voice, chat, and machine-to-machine
AI Voice Payments for Insurance — how regulated industries are deploying voice AI payments in production
Chat Agent Payments: How AI Chat Agents Take Payments Mid-Conversation — extending the payment infrastructure model to AI chat
CCaaS Payments: Embedded Payments for Contact Centre Platforms — Shuttle's approach to the CCaaS vertical
Agentic Payments for Platforms — the platform-level case for embedding agentic payment infrastructure
Ready to close the payment gap in your voice AI platform?
Shuttle powers PCI-compliant voice payments for AI voice agents and CCaaS platforms — DTMF capture, PSP-neutral processing, real-time confirmation, 40+ PSPs through a single integration.
Book a Demo | See How Platforms Use Shuttle