Voice cloning reaches the next level

Analysis from Steve Ahern.

 

Ai voice cloning is entering the next level of operational sophistication.

At SMPTE’s MetExpo conference yesterday I saw a demonstration of the latest voice cloning tools. What struck me most was hearing Australian commentators calling a football game in Australian English, which then switched to Chinese and Portugese.

Before the demo I thought, ‘what’s so special about that, there’s plenty of speech to speech tools.’

When I saw it though, I was impressed. I heard the Chinese commentary spoken in the same male voice as the Aussie commentary and the voice spoke Chinese with an Australian accent. Same with the Portugese example. But that’s not all, the translated voices also reflected the emotion of the original commentary, rising and falling, getting louder and softer as the play got more exciting.

With this level of sophistication, the personalities of the commentators will carry through to the translated feeds.

Discussing the pros and cons of using Ai, session moderator Ken Kerschbaumer from Sports Video Group and speakers Adrian Britton from Ai-Media and Robin Herin from Ateme addressed the first worry of anyone discussing Ai, job losses. They made the point that sophisticated speech to speech technology is likely to offer more options for the audience and more chances to earn revenue, rather than taking jobs. “Broadcasters currently have to hire a translator who doesn’t know much about sport, or a sports commentator who is not fluent in the other language, or they choose not supply a translated feed at all. Now the best commentator can be used in multiple languages, allowing the broadcaster to supply the best feed to new audience segments.”

The implications for radio are also obvious. Playout systems like RCS already have the ability to split feeds, just add language translation and the ability to target additional language segments with the same personalities becomes possible. In Australia there is not much demand for this variation, but across Asia, South America and other regions where many languages are spoken, there is a new business opportunity.

Security was another issue discussed by the panel. The technology can either be situated in the cloud, where SaaS benefits such as less rack space and managed software updates are beneficial, or in an air-gapped server that is not available to the internet.  Broadcasters who are concerned about hacking will likely choose the on-premise option, while those with less to lose if they are hacked will choose off-prem. The one caveat with the on-prem air-gapped option is that the translation server will still need to go online to update its LLM (Large Language Model) data from time to time, to ensure the best translations.

“There are cyber security attacks all the time. There is nothing as secure as running it in a box in your isolated workflow. But of course that needs more hardware and updates. This level of security is important for some clients, such as parliaments or the United Nations.”

The latest technology for tv sport broadcasts is a plug and play box that sits between the studio and the output, but for radio it may sit within the playout system and work with pre-recorded voice tracks inside the system.

Latency (ie. the delay between real time and the translated broadcast output) was the other issue addressed by the panel. In Cambodia last week at the Asia Media Summit I discussed latency with some of the translation software suppliers, who were happy that they have got latency for their products down to 0.8 of a second. But they acknowledged that the faster the translation, the less accurate it may be. During the MetExpo panel, Britton and Heron were more realistic about the latency, saying the best results are, currently, achieved with between a 4-5 second delay. If a live tv broadcast is delayed by a few seconds the best results can be achieved, but for radio, where near real time is a goal, the technology may still not be fast enough.

The complexity of translation was also discussed. Humans paraphrase and anticipate words and sentence structure automatically in their brains, while Ai, currently, translates all content in full and sequentially as it processes a sentence.

Britton explained: “Some languages rearrange the order of words in a sentence. In those languages the entire sentence needs to be ingested in full before translation can occur, because the sentence structure needs to be rearranged before broadcasting.”

The other complexity is unusual words or words specific to a particular sport for instance. Britton demonstrated that by saying SMPTE, which is not a real English word and is usually translated wrongly as ‘simple’ or ‘sample.’ But if you know there are unusual words in advance you can pre-train the Ai to recognise them by adding them to a library of specific words in the LLM database.

Britton’s company, Ai Media, works with various open LLMs to keep its translation data updated, while Herin’s company, Ateme, works with Lingopal for its translation data.

It wouldn’t be a SMPTE conference with a discussion of standards. ISO/IEC 42001 is the world’s first AI management system standard, providing guidance for this rapidly changing field of technology. It addresses the unique challenges AI poses, such as ethical considerations, transparency, and continuous learning. In Australia there are also guidelines for Ai transparency set by Government agencies.

Download the Australian guidelines at www.industry.gov.au/sites/default/files/2024-09/voluntary-ai-safety-standard.pdf   and www.digital.gov.au/sites/default/files/documents/2024-08/Standard%20for%20AI%20transparency%20statements%20v1.1.pdf 

The EU has developed a learning chart for teaching about the use of Ai, which is also useful for conceptualising the various elements of Ai and prompting thinking about how humans should use it.

 

 

Contacts:

Adrian Britton

Robin Herin

Ken Kerschbaumer

 

 

Related Articles:

Using Ai for quality and accuracy in News Reporting

The world’s oldest songs heard at MetExpo 2025

Tags: | | | | | | |