How to Recognize Speech Generated by Artificial Intelligence

This post is part of Lifehacker’s Artificial Intelligence Debunked series . We explore six different types of AI-generated media and highlight common features, by-products, and distinguishing features that will help you distinguish artificial content from human-generated content.

In recent years, artificial intelligence technologies have made it possible to clone someone else’s voice and make that “person” say whatever you want. You don’t even have to be an expert to do this: a quick Google search and your words can be spoken by anyone from President Biden to SpongeBob. It’s exciting, fun and scary.

AI voice technology can be used for good: for example, Apple’s Personal Voice feature allows you to create a version of your own voice that can be used to convert text to speech, aimed at people who are losing the ability to speak for themselves. It’s amazing that we have the ability to save people’s voices, so instead of using the regular TTS voice, their words actually sound like their own.

Of course, there is the other side of the coin: the possibility of rampant disinformation. When current technology makes it too easy to get someone to say something, how can you be sure that what you listen to online was actually said?

How AI Voice Generators Work

Like other AI models, such as text and image models, AI voice generators are based on models trained on huge data sets. In this case , models are trained on speech samples of other people . For example, the OpenAI Whisper model was trained on 680,000 hours of data . In this way, he learns to reproduce not only the words themselves, but also other elements of speech, such as tone and tempo.

However, once the model is trained, it does not need as much data to reproduce the voice. You might not be too impressed with the results when you give the model five minutes of recordings, but some of them can produce voices that resemble that limited training data. Give it more data and it will reproduce your voice more accurately.

As technology develops, it becomes increasingly difficult to immediately detect a fake here. But there are some notable quirks and flaws inherent in most AI voices, so detecting them is critical to determining whether a recording is real or fake.

Listen for strange pronunciation and tempo

AI models mimic the sound of the human voice quite well, to the point that it is sometimes difficult to tell the difference. However, they still have difficulty replicating the way we speak.

If you have doubts, listen carefully to the intonation of the speaker’s “voice”: an AI bot may occasionally mispronounce a word, which most people do not do. Yes, people mispronounce things all the time, but be aware of mistakes that can give more information. For example, “collages” can range from co -lah -jez to co -lah-jez or co- lay -ges. You can hear these exact errors in Microsoft’s VALL-E 2 model if you click the first section under Audio Samples and listen to the Smart Cats example.

Tempo can also affect it. While AI is getting better at reproducing normal speech rates, it also makes strange pauses between words or skips words in unnatural ways. The AI ​​model can go beyond the interval between two sentences, which will immediately give itself away. (Even a person who can’t stop talking doesn’t sound that robotic.) When testing Eleven Labs’ free generator, one of the results had no space between my first sentence, “Hey, how are you doing?” and my second sentence: “I’m thinking about going to the movies tonight.” To be fair, most attempts included a space, but be careful when determining whether a piece of audio is valid or not.

On the other hand, moving on to the next word or sentence may take too long. While the AI ​​is getting better at replicating natural pauses and breathing (yes, some generators now insert “breaths” before speech), you’ll also hear strange pauses between words, as if the bot thinks that’s how people tend to speak. It would be one thing if this was done to imitate how someone thinks about the next word they want to say, but it doesn’t sound that way. Sounds like a robot.

You can hear those pauses in this fake audio recording of President Biden that someone made during the primaries earlier this year. In his address, the fake Biden tries to convince voters not to show up for the primaries and says: “Voting this Tuesday will only enable Republicans in their quest to elect… Donald Trump… again.”

Minimum emotions and variations in voice.

Likewise, AI voices tend to sound a little flat. It’s not that many of them aren’t compelling, but if you listen closely, you’ll see fewer changes in tone than you’d expect from most people speaking.

It’s funny too, because these models can reproduce the sound of someone’s voice very accurately, but often miss the mark when it comes to imitating the rhythms and emotions of the speaker. Check out celebrity examples in the PlayHT generator : If you listen to Danny DeVito’s example, it’s obvious that he’s imitating DeVito’s voice. But you don’t understand some of the pros and cons of his particular way of speaking. It appears flat. There is some difference here: a bot saying “Oh Danny, you’re Italian” sounds quite realistic. But soon after that, the sentence “I was on the leaning tower of Pisa” no longer fit him. The last word of the entry, “sandwich,” sounds especially unusual. The Zach Galifianakis recording further down the page has a similar problem: there are some compelling uses of the word “um” that make the recording sound casual, but most of the sample lacks emotion or inflection.

Again, things are moving quickly here. Companies like OpenAI are training their models to be more expressive and reactive in voice messages. GPT-4o’s advanced voice mode is probably the closest we’ll get to creating a persuasive AI voice, especially one that can conduct “conversations” in real time. However, there are shortcomings that can be noticed if you listen carefully. In the video below, listen to the bot say: “opposite, contiguity, and hypotenuse” (specifically hypotenuse). Here GPT-4o pauses, the realistic difference disappears, and the voice becomes a little more robotic as it tries to string these unusual words together.

It’s very subtle: the bigger clues are probably the pauses he puts between words, like the pause before the word “opposite.” In fact, the way it slows down “identification” is probably a sign too, but it’s impressive how normal it looks in the model.

Does a celebrity or politician say something funny or provocative?

AI voice detection isn’t just about identifying flaws in the results, especially when it comes to recordings of “celebrities.” When it comes to AI-generated speeches from people with power and influence, these recordings are likely to be one of two things: stupid or provocative. Perhaps someone on the Internet wants to make a video of a celebrity saying something funny, or a bad actor wants to convince you, for example, that a politician said something that made you angry.

Most people who see video of Trump, Biden and Obama playing video games together don’t actually think it’s real: it’s an obvious joke. But it’s not hard to imagine someone wanting to disrupt an election creating a fake recording of a political candidate, playing it over a video and uploading it to TikTok or Instagram. Elon Musk shared one such video on X , which features a fake recording of Kamala Harris, without revealing that the video was made using artificial intelligence.

This doesn’t do justice to the actual content: if a candidate says something that might call into question their suitability for the job, it’s important to take note of it. But as we enter what is sure to be a divisive election season, being skeptical of these kinds of recordings will become more important than ever.

Part of the solution here is to look at the source of the audio: who posted it? Was it a media organization or just some random Instagram account? If this is true, many media outlets will likely catch on quickly. If an influencer shares something that aligns with their point of view without crediting the proper source, pause before sharing it yourself.

You can try an AI voice detector (but know the limitations)

There are tools that advertise themselves as “AI voice detectors” that can detect whether an audio recording was created using machine learning or not. PlayHT has one such detector , and ElevenLabs has a detector specifically designed for finding audio created using the company’s own tools.

However, as with all AI media detectors, take these tools with a grain of salt. AI audio detectors use AI to look for signs of generative audio content , such as missing frequencies, missing breath, and robotic timbre (some of which you can listen to yourself). But these AI models will only be effective at detecting what they know: if they encounter audio with variables they weren’t trained on, such as poor audio quality or excessive background noise, it could stump them.

Another problem? These tools are trained on the technologies available to them now, not the AI ​​audio that is currently coming out or on the way. He might be able to detect any of the examples listed in this article, but if someone made a fake recording of Tim Walz with a new model tomorrow, he might not find it.

Earlier this year , NPR tested three AI detection tools and found that two of them —AI or Not and AI Voice Detector —were wrong about half the time. Another, Pindrop Security , correctly identified 81 of the 84 sample clips submitted, which is impressive.

If you have a recording that you’re not sure about, you can try one of these tools. Just understand the limitations of the programs you use.

More…

Leave a Reply