The company used a number of algorithms developed in academia for finding hidden matches in DNA. But the engineers at Ancestry.com had to parallelize the algorithms to run them across a multinode Hadoop deployment. Using traditional scale-up architectures, it would take Ancestry.com up to four weeks to compare 120,000 sets of DNA.
Also at the conference, Vaclav Petricek, director of machine learning at eHarmony, described how the online dating service uses Hadoop to make better matches among its customers.
Like Ancestry.com's DNA service, the fundamental problem eHarmony tackles is a massively parallel one. The service wants to find a set of potential suitors for each member of the service, which involves doing many comparisons across a large number of factors, while slimming down the result sets to manageable proportions.
"We want to give people enough options to keep them engaged, but we don't want to overwhelm them," Petricek said. "Because this is an embarrassingly parallel problem, you can run this on Hadoop in parallel."
EHarmony customers fill out an extensive questionnaire, which helps to estimate the user's personality across 29 different dimensions.
The system first uses algorithms to predict how happy two potential matches would be if they were married, using scientific studies that describe the personality traits of people in both happy and "distressed" marriages, Petricek said. If they have personality types that would indicate they would be happy in a marriage together, they are considered for pairing.
This is only the first step, however. EHarmony must also predict how attracted two potential people would be to one another.
"There is no guarantee that people who have compatible personalities would be interested in each other," Petricek said.
Gauging attractiveness between two people is where the use of big-data-styled machine learning comes in. The service keeps track of a wide range of additional variables of its members, from the types of devices used to interact with eHarmony to whether each individual is single or divorced. The company also keeps track of the flow of messages among its members, charting which exchanges led to successful matches and trying to find indicators among all the known variables as to why these matches were successful.
For instance, one fairly predictable variable has been distance. The farther apart two people are geographically, the less likely they are to pick one another from a list of candidates. Another variable is the difference in heights between a potential heterosexual couple. On average, the two people are most likely to communicate if the male is 4 to 8 inches taller than the female. The company will not know which factors ahead of time will prove to be predictors of compatibility, so Hadoop churns through all the combinations of all the variables looking for clues.