November 10, 2013, 6:57 PM — In early October, CNN revealed that veteran voice actor Susan Bennett was the voice behind Siri until Apple changed it in iOS 7. Her utterances, she revealed in an interview, were being used by the tech giant (and its likely voice synthesis partner Nuance) to generate the digital assistant's own words.
Of course, even a company as technologically sophisticated as Apple is unlikely to have figured out a way to clone Ms. Bennett and place tiny copies of her inside every iPad and iPhone. Which makes for a question more fascinating than that of Siri's identity: How exactly is a person's voice transformed into a software program that can synthesize any text thrown at it?
My voice is my passport
In Sneakers, a much underrated movie that seems oddly appropriate in today's era of government spying on its own citizens, Robert Redford's ragtag team of hackers manages to bypass a sophisticated voice-based security system by splicing together individual words taped from an unsuspecting employee.
The process of giving voice to iOS's digital assistant may not be all that different, although it is far more thorough. "For a large and dynamic synthesis application, the voice talent (one or more actors) will be needed in the recording studio for anywhere from several weeks to a number of months," says veteran voice actor Scott Reyns, who is based in San Francisco. "They'll end up reading from thousands to tens of thousands of sentences so that a good amount of coverage is recorded for phrasing and intonation."
As you can imagine, the complexity of this process varies from language to language; some tongues are more complicated than others. After all, pronouncing English with the wrong intonation--like, say, not inflecting a question--results in a voice that sounds unnatural but doesn't necessarily alter the meaning of the words that are spoken.
That's not always the case, according to Arash Zafarnia, director for consulting firm Handsome, also based in San Francisco: "Compare that to Chinese, where tone and intonations are essential to distinguishing words that have the same vowels and consonants," and you end up with a whole new level of difficulty. For this reason, consistency is key in obtaining a good voice sample: "The same words and phrases have to be repeated dozens of times. The voice of the actor should not change at all--it must stay consistent through all the period of recordings in order to produce the best result possible," says Zafarnia.