Using Voice AI for ADR

LinkInstagramLink

Disclaimer: Per just about every major production's protocols - do not upload any production audio to cloud based Voice AI services without explicit permission from the proper chain of command. Additionally, actors are starting to bake anti-AI clauses into their contracts, so just because you may be able to get near-perfect audio in some instances for ADR - that doesn't mean it's legally useable in your edit.

~~~

The buzzword "AI" is ever-present in the news. On one hand there is concern and controversy over misuse and replacing the need for warm human bodies, on the other excitement for pushing the bounds of what is possible. One thing is for sure, AI is here to stay - and it's improving very quickly.

Perhaps most notoriously known for it's use by the internet's darling griefers, ElevenLabs is without a doubt the standout deepfake technology for Voice AI at the moment.

Before you delve into  ElevenLabs (a paid service), you might want test the waters with the second best alternative - free/open source tortoise-tts (webUI here, github here). Tortoise operates by feeding it clean audio of the person-to-be-cloned - the more audio the better. An oh-so-exploitable Richard Nixon voice recording is pre-loaded for you in the Replicant WebUI.  Feeding the dialog of a person you wish to voice clone into a Voice AI is called "voice training", and the more you voice train (the more dialog you feed into the AI), the better the results. While I only dabbled with the Replicant WebUI (and not CLI Tortoise app), I found the results to be mixed at best - but you could definitely get some decent useable Temp ADR out of it from time to time.

However ElevenLabs offers the real pro solution (for a very reasonable monthly fee). Unlike Tortoise, ElevenLabs severely limits the size of voice training audio files - but the results are of much higher quality. Below you can see the WebUI:

You'll want to crank down the Voice Settings "Stability" and "Clarity + Similarity Enhancement" to make the voice tone direction take effect. Clicking "Generate" triggers a random seed, giving you different results every click. You may find that dialog editing with production audio and multiple Voice AI outputs will give you the best result. Leaning on a single Voice AI output 'take' can often feel stiff and robotic. EQ your Voice AI output to help match production audio.

Voice training uploads on Elevenlabs are limited to 10MB, so you'll want to use MP3 mono to get the best bang for your buck. A well encoded MP3 (such as made the the LAME encoder) at 96kbps should do the trick and give you enough length in the duration for a thorough voice training. Additionally, cleaning up dirty audio with RX Izotope will help, but ultimately you'll want to feed your AI nice clean audio whenever possible.