Performing human-subjects experiments on Amazon Mechanical Turk offers many benefits, including very low experiment costs, quick turn-around rates, and relatively simple approvals from human subjects boards. But you have to be careful to avoid bias and error; we describe some techniques below. Feel free to add your insights in the comments.
The basics and benefits of Mechanical Turk
Amazon Mechanical Turk, or MTurk for short, is a cloud computing platform that permits outsourcing of tasks to other users, using a built-in payment scheme to compensate users. People (often referred to as "Turkers") perform MTurk tasks, which are called Human Intelligence Tasks (HITs), and are paid just a few cents for completing them.
Low cost. Turkers seldom earn more than $5-10 an hour. While it may appear that the low earnings potential would severely skew the demographics, that doesn't actually seem to be the case. Note that a large portion of users are not from developing countries as there are only 2 ways to cash out: spend the earnings on products sold by Amazon, or transfer earnings to a U.S. bank account. [Panos Ipeirotis provides an excellent overview of MTurk demographics.]
Anonymity. The MTurk terms of service do not allow you to ask for identifying information, use the platform to commit crimes (e.g., clickfraud), or use it to violate the terms of service of other providers. Since the service doesn't offer "verified" user profiles, users can lie about their demographic group to qualify for a study. But given the built-in anonymity of the system, they may also be more honest in answering stigmatizing questions. We have evidence that this is often the case.
Speed and access to subjects. It's possible to get several hundred subjects in just a few days — though of course the number depends on the task, how much you're willing to pay, and how many restrictive qualifications you place on participants.
#1 How to hide what your study is about and avoid other biases
Whatever your motive for hiding your study — protecting an organization, obscuring an idea, or most commonly to avoid selection bias — the solution is simple. First, perform a first-phase study that doesn't introduce any bias, but which allows you to select subjects to perform follow-up studies with. Then, as a second step, select subjects from this collection of turkers. They will be asked to visit a website to perform the "real" task. This way, you don't have to worry about a biased selection — e.g., passionate baseball players who sign up for a study of average baseball knowledge. But if you only want people from a certain demographic — say, males between the ages of 30 to 35 — then screen for those in the first phase.
In many situations, subjects may be biased by the knowledge that they are participating in a study, or by knowing the goals of a study. They may think that in order to please you – the experimenter – they should respond in a particular way. They may lie in order to hide an embarrassing truth about themselves. They may pay more attention to some aspects than they normally would have, because they know that this is what they are tested on. [See "Phishing IQ Tests Measure Fear, Not Ability" in Financial Cryptography and Data Security (2008) for more about this bias.]
For example: if you show someone a website and ask whether it is a phishing site, s/he is far more likely to scrutinize the site, detect an aberration, and respond "yes" than they would have if they had come across the site in a real-life situation. A large body of recent research (including mine) has examined how to perform such studies. In the particular case of phishing, the resulting technique is referred to as a "phishing stint" or as a "naturalistic phishing experiment". [For addressing ethical aspects of such experiments, see "Designing and Conducting Phishing Experiments" in IEEE Technology and Society magazine's Special Issue on Usability and Security (2007).]
To perform any such naturalistic study, you need to convey a different task to your subject than what you are observing – essentially deceive them – to see how they react when faced with the situation of interest. You may, for example, say that you are studying the common reaction to online e-commerce sites, and ask them to rate how helpful various sites are, adding an additional free-text input field where they can add other observations. You first show them three or four perfectly legitimate websites, asking them to rate and describe them; then you show them a phishing site and do the same. Will they tell you that this is a site run by fraudsters? If they do, they noticed signs of fraud without you prompting them.
You can also perform much more invasive studies where you actually attempt to defraud them, only to see what portion of users fall for it. But this has to be done with extreme care — or you'll become a criminal! Your IRB will offer you plenty of advice if you decide to try an experiment of this type – be sure to read up on some ways in which it has been successfully done before submitting your application. This area is full of pitfalls, and deserves a separate explanation. [See"Social Phishing" in the Communications of the ACM (2007) for an example.]
#2 How to avoid cheaters
People may supply arbitrary information (to save time, hide personal information, increase their chances of getting paid, or simply because they're lazy) or even lie (to preserve self-image or to intentionally destroy a study). Here are some techniques for detecting and deterring these kinds of behavior:
- Don't lead. Instead of asking "Are you male and between 30 and 35 years of age?", ask for gender and age group. You will get a lot of responses that don't help you towards the second phase of the study, but by obscuring what you're looking for it's difficult for people to lie to "qualify". Furthermore, if you care about the age group but not exact age, then only ask for the age range.
- Make it easier for users to opt out than opt in. When recruiting males of age 30-35, you can have a question 3 "If you responded 'female' to question 1, then skip to question 4; otherwise describe how your shaving lotion smells in 30 words or less." This makes it easier for "lazy cheaters" and liars to opt out. You don't really care what their shaving lotion smells like; you just want to make it more "expensive" to claim to be male than female, since you supposedly want to recruit only males in the follow-up study. Male cheaters will claim to be female; female cheaters will probably abstain from cheating.
- Avoid technical or brand names early on. Let's say you want to find BlackBerry phone owners; don't ask "Do you own a BlackBerry?" Instead, ask some collection of questions that allow you to narrow down who has the kind of phone you are interested in and finally ask for the brand and style of their phone. The reason is that many people may be confused about brand names — in a recent study of mine, several subjects claimed to own "Nokia Blackberries". I don't want these users to enroll in my follow-up study.
- Ask error-detecting questions. Instead of asking people what kind of phone they have, for example, you can also ask them for whether their phone has some specific features. For example, "Can you download applications from the Apple app store for your phone?" These questions may have nothing to do with what you really care about, and may simply be included to obfuscate the purpose of the other questions.
Once you have selected subjects and asked them to participate in a follow-up study, you can ask additional "error-detecting" questions (or even some of the same questions from the first phase). This improves your chances of catching cheaters, especially lazy liars or liars who filled the first phase form in an arbitrary manner. They won't know how to answer in a consistent manner.
#3 How to determine what to pay, and when
One of the most compelling benefits of MTurk is how inexpensive it is to carry out experiments on it. Some researchers may be tempted to pay no more than what is needed to get the work done. I am against that: I believe that if you pay peanuts, you get peanuts. If you are very clearly trying to minimize your payments, subjects will respond by minimizing their effort or avoid the HIT altogether. An average HIT requiring a minute of the user's time may pay 5-10 cents — which corresponds to an hourly wage of $3-6. But why pay minimum wages if paying four times as much is still an incredible deal for you? I would pay about 25 cents for a minute's effort.
To determine the best price, I've performed simple experiments where I ask people to answer a question at different prices. When you pay a bit more, the results often improve — and also make evident to subjects their expected level of effort.
But paying well also introduces problems. If you pay more than others, you may skew your subject distribution by getting people who focus excessively on the payment. They, in turn, may be willing to rationalize a bit more than you want. My approach is to first use a screening study (like the ones I describe above) where I don't offer to pay above the norm. Then, I pay the users the two cents they expected – plus an immediate bonus of another two cents (which doesn't cost a lot, but gets people's attention.) Finally, I offer a follow-up study in which I pay quite respectably, say, 60 cents for a two-minute effort. That is very inexpensive as far as subject reimbursement goes, but still means an hourly wage of $18 — significantly above the hourly wage for average MTurk tasks.
It's also a good idea to set realistic expectations for when a subject is to be paid. This is particularly true when your technique involves bonus payments. Some people get skittish and wonder if they will be paid, and if so, when. You don't want a few hundred inquiries. Tell them that it may take a few days, because you pay in batches.
#4 How to perform multi-stage and complex experiments
MTurk is not optimized for following up with a subject after a few months. It does allow you to assign predicates with each user who performs a task, and later offer HITs only to users who have (or who don't have) predicates of your choice. That's a bit complicated, though, and leads to much lower opt-in rates than directly contacting the desired subjects. Here's what you can do:
- Post a HIT where you asks turkers to perform some task, which allows you to collect demographics or other information for future subject calls. The task might appear independent of your study, and primarily serve to classify potential subjects so that you later can determine whom to ask to participate.
- Use the "send an email message" feature to ask subjects to participate in a follow-up study. By using MTurk as an anonymous proxy, they will receive an email with your e-mail address without disclosing their contact information to you. You can also send an email by clicking on "Worker ID" in the list of completed tasks.
- Include a URL in your email. This URL could be personalized for each subject, so once you know a given URL was accessed (and task completed), you would know whom to pay. Alternatively, you could ask the subject to enter a "payment code" on the site. You need to give them this piece of information in the email, and it has to be unique so that you can pay them after the task is performed. I like using the temporary user pseudonym as an identifier. This is a tag that is specific to this user and to this HIT. You can put together the email by cutting and pasting text and worker IDs which you can obtain from the MTurk site.
- Pay a "bonus" payment once a follow-up task has been completed. How? Look up the pseudonym you received from the list of completed HITs from the task described in step 1 above, click on the worker ID, and press "Give bonus payment". Regular payments that are not part of a multi-phase study are paid using the normal MTurk payment interface, which is straightforward and allows payment of several subjects at the same time. [To simplify your life and increase efficiency, you can write a script that handles subject interaction and payments. However, the script has to be able to parse error messages from Amazon which aren't uncommon.]
For constructing and deploying more complicated surveys, MTurk offers a programmable tool. But it isn't easy to do, and doesn't offer an easy visualization of results. Instead, I set up a survey on SurveyMonkey and link to that in my recruiting message.
It's also possible to ask your subjects to visit a URL of your suggestion, perform a task there, and report back an observation to another site including MTurk. Or, you can ask subjects to input their observations on the visited site.
For example: you can ask subjects to visit 3 websites and rank them. One of them is yours or your client's, the others belongs to competitors — which one is nicest, and why? Similarly, you may have constructed a site that dispenses user advice – such as the SecurityCartoon one I run – and want to know whether typical Internet users understand a particular piece of advice. You can ask them to read the advice (include a URL for them to visit); then ask them to tell you, in their words, what it means. Or ask them to judge five fictional situations and tell you which ones are safe. If they understood the advice, that should impact their selection. Or ask them simply whether they learnt anything new, and whether they would suggest to friends of theirs concerned with the associated issue to read the advice.
#5 How to test correctness and perfect your study
Using MTurk may seem to complicate matters in terms of getting truthful answers, including demographic information. As described above, there are many ways to identify "casual liars". But how about subjects who take great pains to lie in a way that cannot be detected? How can we test their presence?
First of all, turkers enjoy anonymity: by the nature of the underlying medium, they are pseudonymous to experimenters, on an already-anonymous Internet. Even though many dispute the concept of anonymity on the Internet, and there are methods to identify pseudonymous records, turkers feel that they're anonymous. Given these assumptions, it's possible that they won't bother lying if there is no apparent advantage associated with doing so. It's also possible that those inclined to lie don't make the effort to construct particularly good lies. After all, generating a meaningful lie far exceeds task effort and remuneration.
I conducted 2 parallel studies: one on MTurk, and one proprietary (and much more expensive) study I helped a client perform through an established, independent survey company. The results were statistically indistinguishable from each other.
While I'm not claiming that survey results would always be the same in both traditional and MTurk mediums, you can test whether your method is likely to mimic traditional survey results. You could include components in your study whose results you can compare to representative results. If they don't match, it's likely that the rest of your results may not represent what you would have obtained using a traditional approach. You can then go back and look at how you ask questions, how you recruit, what demographics you drew from. Maybe one of these aspects introduced a bias. If you find a problem, just re-run the study with new participants.
In fact, it's hard to design studies, and easy to make mistakes. How should you approach this? Expect mistakes. You can run a small (10-15 participant, one day) study to review responses and identify problems in the design. Fix the flaws, then run a new version and observe those results. When you're satisfied that things work, you can run your real study with a much larger number of subjects. If you're concerned that your early experiments may introduce bias into the results of your real experiment, then use an MTurk feature to effectively "ban" early participants from later experiments, by assigning them credentials that later disqualify their participation.
Iterate until you get it right. The beauty of MTurk is the low cost of testing each iteration.
Republished with permission from PARC. (View original version).