- AI voice cloning tools can replicate a person's voice from three seconds of audio, enabling real-time impersonation over phone calls with no technical barrier for attackers.
- The February 2024 Arup incident, in which a finance employee transferred HK$200 million ($25M) after a deepfaked video call, demonstrated that real-time video impersonation is now viable at scale.
- AI voice deepfakes are being used in vishing attacks targeting finance teams, C-suite assistants, and IT helpdesks to authorise transactions, reset credentials, and bypass access controls.
- Identity verification systems that rely on live video or voice liveness checks are increasingly vulnerable as the quality of real-time deepfake tools improves.
- Effective defences combine a secondary verification channel, a shared code word protocol for sensitive requests, and staff training focused on specific deepfake scenarios rather than generic phishing awareness.
- Organisations onboarding customers or employees remotely need liveness detection technology that goes beyond video-based checks and incorporates behavioural and document signals.
The $25 million video call
In February 2024, an employee in Arup's Hong Kong finance department received an email purportedly from the company's UK chief financial officer requesting a confidential transfer. The employee was sceptical. To allay their concerns, they were invited to a video call with the CFO and several other senior colleagues.
Every person on that call had been deepfaked. The faces, voices, and mannerisms matched the real individuals closely enough that the employee was convinced. Over several transactions, HK$200 million, approximately $25 million, was transferred to accounts controlled by the fraudsters. The fraud was not discovered until the employee contacted the UK head office directly.
The Arup incident was not an isolated event; it was the public face of a technique that had already been deployed in dozens of undisclosed fraud cases. It removed any remaining doubt that real-time synthetic video had crossed the threshold from research demonstration to operational criminal tool.
Deepfake fraud involving voice or video does not require the attacker to be a nation-state actor or technically sophisticated group. Tools capable of real-time voice cloning and face-swapping are available commercially and on open-source repositories, often for under $50 per month in subscription costs.
How voice cloning works
Voice cloning uses deep learning models, typically a variant of a text-to-speech architecture, trained on a sample of a target speaker's audio. Given a short recording, the model learns the acoustic characteristics of that person's voice: timbre, rhythm, pitch range, and speaking style. It can then synthesise new speech in that voice from any text input.
Three seconds of audio is sufficient for some commercial tools to produce a recognisable clone. Thirty seconds produces a markedly better result. The audio sample can come from any public source: a LinkedIn video, a company presentation, a podcast interview, a press conference recording, or a voicemail greeting left on a public business line.
The cloned voice can be used in two ways. Pre-recorded mode generates audio clips that are inserted into a voicemail or audio message. Real-time mode processes text input and converts it to the target's voice with low enough latency to hold a live phone conversation, with the attacker typing what they want the voice to say and the synthesis playing through the call.
Attack types in the corporate environment
The Arup employee did everything right by asking for a video call to verify the request. The attack succeeded precisely because that verification step was anticipated and defeated.
Why standard verification fails
Organisations typically rely on three layers of informal verification for sensitive verbal requests: they recognise the voice, they can see the person on video, and the caller knows things only that person would know. All three can now be defeated by a prepared attacker.
Voice recognition is fallible at the best of times; phone audio quality introduces compression artefacts that mask differences between a real voice and a good clone. Video verification, as Arup demonstrated, is no longer reliable when face-swap software can run on consumer hardware in real time. Caller ID is trivially spoofed. And the contextual details that sound like insider knowledge are frequently available from LinkedIn profiles, company websites, annual reports, and public event recordings.
This is not a failure of individual judgement. Employees trained to be sceptical of email phishing have no equivalent training for voice or video impersonation. Their instinct to verify via a second channel, such as requesting a video call, is correct but exploited by attackers who prepare for exactly that response.
Controls that actually work
Secondary channel and code word protocol
Any sensitive request arriving by phone or video call, whether for a payment, credential reset, or system access, must be confirmed through a completely separate channel. Use a different medium: if the request came by phone, confirm by email to the corporate address. If it came by video call, call back on a pre-registered direct number. Combine this with a shared verbal code word agreed in advance out of band, one that is not written in any email or document that could be accessed in a breach. A caller who cannot produce the code word does not receive the action, regardless of how convincing the voice or face appears.
Payment authorisation above a threshold
Remove verbal authorisation as a valid trigger for any transfer above a defined threshold. All payments above that limit require two approvals through the corporate payment system, not a phone call. This is a process change, not a technology one, but it directly eliminates the mechanism that fraud calls rely on.
Staff training on AI voice and video fraud specifically
Generic phishing awareness training does not prepare staff for voice or video impersonation. Run targeted scenarios: play a voice clone of a senior executive making a fraudulent request and have employees practise the verification protocol. The experience of hearing a familiar voice be wrong is a more effective training intervention than reading about the concept. Focus training on the roles most targeted: finance teams, IT helpdesks, and executive assistants.
Upgrade KYC liveness detection
Organisations that onboard customers remotely using video liveness checks should audit their KYC provider's approach to deepfake detection. Passive liveness checks that measure facial movement are increasingly insufficient. Look for providers using passive and active liveness in combination, document authenticity checks that go beyond image analysis, and behavioural signals such as device fingerprinting and session anomaly detection as supplementary signals.
Caller ID validation at the network layer
Work with your telephony provider to implement STIR/SHAKEN attestation, which validates that a caller's number has been verified by their carrier. While not foolproof, it raises the bar for number spoofing on calls originating within compliant networks and provides a signal that can be factored into call handling procedures.
Executive and high-value contact audio watermarking
For organisations with a high executive exposure, consider audio watermarking for sensitive recordings. Some detection services offer the ability to embed imperceptible markers in authorised audio of executives, enabling any recording of that executive's voice to be cross-checked against the watermark database. This is more relevant for financial institutions and large enterprises where executive impersonation is a specific and recurring threat profile.