How little audio does it take to clone someone's voice?

Current commercial voice cloning tools can produce a convincing synthetic voice from as little as three seconds of audio. Better models use 30 to 60 seconds to capture more of a speaker's natural variation, inflection, and accent. That audio can come from any source: a conference recording, a podcast, a YouTube interview, a company video, or voicemail greeting. Once cloned, the voice can be used to generate arbitrary speech in real time or as pre-recorded audio.

Can deepfake video work on a live video call?

Yes. Real-time face-swapping tools can run on a standard GPU and replace a speaker's face on a live video call with sufficient quality to deceive a recipient who is not looking for signs of manipulation. The Arup fraud in February 2024, where a finance employee was deceived into transferring $25 million, involved a live multi-person video call in which every other participant had been deepfaked. The employee believed they were speaking with real colleagues because the faces, voices, and mannerisms matched the people they recognised.

What is a vishing attack?

Vishing is voice phishing: a phone-based social engineering attack that impersonates a trusted person or organisation to manipulate the target into revealing credentials, authorising transactions, or providing access. AI voice cloning has substantially raised the sophistication of vishing by allowing attackers to impersonate a specific named individual whose voice is known to the target, rather than relying on an anonymous caller claiming to be from a bank or IT department.

How do businesses verify identity without relying solely on voice or video?

The most effective approach combines a secondary channel with a shared secret. For any sensitive request arriving by phone or video call, require a confirmation through a different channel (email from the corporate account, an approval in a workflow system, or a call back to a pre-registered number) and verify a code word or phrase that was agreed in advance and is not publicly available. This means a fraudster who controls one channel cannot complete the verification without also controlling the second.

Deepfakes and Identity Fraud: How AI Voice and Video Manipulation Is Targeting Businesses

Key Takeaways

AI voice cloning tools can replicate a person's voice from three seconds of audio, enabling real-time impersonation over phone calls with no technical barrier for attackers.
The February 2024 Arup incident, in which a finance employee transferred HK$200 million ($25M) after a deepfaked video call, demonstrated that real-time video impersonation is now viable at scale.
AI voice deepfakes are being used in vishing attacks targeting finance teams, C-suite assistants, and IT helpdesks to authorise transactions, reset credentials, and bypass access controls.
Identity verification systems that rely on live video or voice liveness checks are increasingly vulnerable as the quality of real-time deepfake tools improves.
Effective defences combine a secondary verification channel, a shared code word protocol for sensitive requests, and staff training focused on specific deepfake scenarios rather than generic phishing awareness.
Organisations onboarding customers or employees remotely need liveness detection technology that goes beyond video-based checks and incorporates behavioural and document signals.

The $25 million video call

In February 2024, an employee in Arup's Hong Kong finance department received an email purportedly from the company's UK chief financial officer requesting a confidential transfer. The employee was sceptical. To allay their concerns, they were invited to a video call with the CFO and several other senior colleagues.

Every person on that call had been deepfaked. The faces, voices, and mannerisms matched the real individuals closely enough that the employee was convinced. Over several transactions, HK$200 million, approximately $25 million, was transferred to accounts controlled by the fraudsters. The fraud was not discovered until the employee contacted the UK head office directly.

The Arup incident was not an isolated event; it was the public face of a technique that had already been deployed in dozens of undisclosed fraud cases. It removed any remaining doubt that real-time synthetic video had crossed the threshold from research demonstration to operational criminal tool.

Important context

Deepfake fraud involving voice or video does not require the attacker to be a nation-state actor or technically sophisticated group. Tools capable of real-time voice cloning and face-swapping are available commercially and on open-source repositories, often for under $50 per month in subscription costs.

How voice cloning works

Voice cloning uses deep learning models, typically a variant of a text-to-speech architecture, trained on a sample of a target speaker's audio. Given a short recording, the model learns the acoustic characteristics of that person's voice: timbre, rhythm, pitch range, and speaking style. It can then synthesise new speech in that voice from any text input.

Three seconds of audio is sufficient for some commercial tools to produce a recognisable clone. Thirty seconds produces a markedly better result. The audio sample can come from any public source: a LinkedIn video, a company presentation, a podcast interview, a press conference recording, or a voicemail greeting left on a public business line.

The cloned voice can be used in two ways. Pre-recorded mode generates audio clips that are inserted into a voicemail or audio message. Real-time mode processes text input and converts it to the target's voice with low enough latency to hold a live phone conversation, with the attacker typing what they want the voice to say and the synthesis playing through the call.

Minimum audio sample needed for a working voice clone with current commercial tools

$25M

Transferred in the Arup deepfake fraud, February 2024

Rise in deepfake fraud attempts in financial services year-on-year, per sector threat intelligence

Attack types in the corporate environment

📞

Vishing: AI voice impersonation over the phone

An attacker clones a CEO, CFO, or IT manager's voice and calls an employee with a time-pressured request: authorise a wire transfer, provide a temporary password, or grant system access. The call appears to come from a known person. The voice sounds right. The caller knows details about the organisation from public sources. Finance teams, IT helpdesks, and executive assistants are the primary targets because they have the authority to act quickly on verbal requests.

📹

Deepfaked video calls for wire fraud authorisation

Attackers use face-swap software to replace their face with a target's in a live video call. Combined with a cloned voice, the resulting call appears to show a real colleague speaking in real time. This technique was used in the Arup case and is becoming more common in large wire fraud attempts, particularly where email confirmation alone is not sufficient to release funds and a call is used as secondary verification.

📋

KYC bypass with synthetic identity documents

Know Your Customer processes that rely on video liveness checks are increasingly targeted by attacks that combine AI-generated document images with real-time face-swapping to present a synthetic identity that matches a fabricated or stolen document. Fraudsters use these techniques to open bank accounts, apply for credit, or onboard under false identities at financial institutions, fintech platforms, and exchanges.

🔒

Credential reset via impersonated helpdesk calls

An attacker with a cloned voice of a known employee calls the IT helpdesk claiming to be locked out of an account. They provide the employee's name, department, and enough contextual detail gathered from public sources or prior reconnaissance to pass verbal verification. The helpdesk resets credentials or adds an attacker-controlled recovery method, granting access to corporate systems without any malware being deployed.

The Arup employee did everything right by asking for a video call to verify the request. The attack succeeded precisely because that verification step was anticipated and defeated.

Why standard verification fails

Organisations typically rely on three layers of informal verification for sensitive verbal requests: they recognise the voice, they can see the person on video, and the caller knows things only that person would know. All three can now be defeated by a prepared attacker.

Voice recognition is fallible at the best of times; phone audio quality introduces compression artefacts that mask differences between a real voice and a good clone. Video verification, as Arup demonstrated, is no longer reliable when face-swap software can run on consumer hardware in real time. Caller ID is trivially spoofed. And the contextual details that sound like insider knowledge are frequently available from LinkedIn profiles, company websites, annual reports, and public event recordings.

This is not a failure of individual judgement. Employees trained to be sceptical of email phishing have no equivalent training for voice or video impersonation. Their instinct to verify via a second channel, such as requesting a video call, is correct but exploited by attackers who prepare for exactly that response.

Controls that actually work

Secondary channel and code word protocol

Any sensitive request arriving by phone or video call, whether for a payment, credential reset, or system access, must be confirmed through a completely separate channel. Use a different medium: if the request came by phone, confirm by email to the corporate address. If it came by video call, call back on a pre-registered direct number. Combine this with a shared verbal code word agreed in advance out of band, one that is not written in any email or document that could be accessed in a breach. A caller who cannot produce the code word does not receive the action, regardless of how convincing the voice or face appears.

Payment authorisation above a threshold

Remove verbal authorisation as a valid trigger for any transfer above a defined threshold. All payments above that limit require two approvals through the corporate payment system, not a phone call. This is a process change, not a technology one, but it directly eliminates the mechanism that fraud calls rely on.

Staff training on AI voice and video fraud specifically

Generic phishing awareness training does not prepare staff for voice or video impersonation. Run targeted scenarios: play a voice clone of a senior executive making a fraudulent request and have employees practise the verification protocol. The experience of hearing a familiar voice be wrong is a more effective training intervention than reading about the concept. Focus training on the roles most targeted: finance teams, IT helpdesks, and executive assistants.

Upgrade KYC liveness detection

Organisations that onboard customers remotely using video liveness checks should audit their KYC provider's approach to deepfake detection. Passive liveness checks that measure facial movement are increasingly insufficient. Look for providers using passive and active liveness in combination, document authenticity checks that go beyond image analysis, and behavioural signals such as device fingerprinting and session anomaly detection as supplementary signals.

Caller ID validation at the network layer

Work with your telephony provider to implement STIR/SHAKEN attestation, which validates that a caller's number has been verified by their carrier. While not foolproof, it raises the bar for number spoofing on calls originating within compliant networks and provides a signal that can be factored into call handling procedures.

Executive and high-value contact audio watermarking

For organisations with a high executive exposure, consider audio watermarking for sensitive recordings. Some detection services offer the ability to embed imperceptible markers in authorised audio of executives, enabling any recording of that executive's voice to be cross-checked against the watermark database. This is more relevant for financial institutions and large enterprises where executive impersonation is a specific and recurring threat profile.