At first glance, the challenge might seem like an easy one. After all, Internet search engines do these sorts of searches millions of times a day. But it is not so easy, Ferrucci said.
"There is a misconception that [the computer] is just looking the answer up somewhere. I wish it were that easy," he said. Google and other Internet search services return only the documents that may provide the answers, not the answers themselves. And databases hold material that only can be accessed through precisely worded queries.
"The reality is that you have to interpret the question and relate the question to the millions of different ways that the answer might be expressed," Ferrucci said.
The software orchestrating the process of returning an answer is called DeepQA. It combines capabilities in natural language processing, machine learning and information retrieval.
When given a question, the software initially analyzes it, identifying any names, dates, geographic locations or other entities. It also examines the phrase structure and the grammar of the question for hints of what the question is asking.
Sometimes the question is an obvious one, and a query to a specific database will do the trick. Most times, however, the question will kick off five or 10 searches across different data sources, each an interpretation of what the question might be.
For this challenge, IBM has amassed an immense amount of reference material, including multiple encyclopedias, millions of news stories, novels, plays and other digital books. Some of the material is in structured databases; other material resides in unstructured text files.
The process is iterative. A set of results may require a new set of searches to be undertaken. "So, now you might have hundreds of processes, each generating additional candidate answers. Imagine that fan-out," Ferrucci said. An end-result may have 10,000 sets of possible questions and their corresponding answers.
Of course, Jeopardy requires only a single answer, preferably the right one. So once all the possible answers are collected, the system uses about 100 algorithms to rate each one, assessing it from different perspectives: Does the answer match the approximate time frame that the question hints at? Is it in the right geographic region? Does the grammatical form of the answer match what is required by the question? A categorical check is done: If the question is looking for a kind of liquid, is the answer a kind of liquid?
If the question with the highest score meets a preliminary threshold of confidence, that answer will be submitted.