It's not as if it hasn't been done before, but researchers at Stanford quantified ways to crack the CAPTCHA test many web sites use to make users prove they're human, and recommended ways to make sure the test continues to work.
CAPTCHA is a utility invented by a group of Carnegie Mellon students in 2000 to foil automated form fillers by forcing users to type out text presented as images of words warped in ways that are difficult for spambots or other programs to read.
Though there are dozens or hundreds of variations using different types of images, text or performance requirements – one particularly vicious one requiring that users solve a mathematical formula before using a site – most sites use variations of the warped-text scheme as their test.
Using a variety of image- and text-recognition algorithms and analyzing the patterns in which various CAPTCHA schemes divide text into segments, Stanford security researchers attacked the CAPTCHA tests used by Wikipedia, Authorize, Baidu, Blizzard, CNN, Digg, eBay, Google, Reddit, Slashdot and others popular sites.
By identifying and coding consistencies among various schemes, they eventually developed an automated cracking tool called Decaptcha.
The tool (also available for download) successfully cracked 13 of the 15 most commonly used text-image schemes, though only 25 percent of the time, according to the paper. That link leads to a personal site of one of the researchers – Elie Bursztein – who posted links to both the paper describing the research and the tool itself.
Predictability is the enemy
The success rate may be low, but reliably being able to crack 13 of the 15 most common ways used to present CAPTCHA tests – not just 13 of the 15 most popular sites using CAPTCHA – pretty thoroughly skewers it as a reliable 'bot preventative.
An earlier paper from Bursztein and a slightly larger mix of other Stanford security wonks actually reinforced the reputation of CAPTCHA as a good 'bot preventative by showing even humans had trouble decoding the images, let alone 'bots.
They tested CAPTCHA against a hacking service specifically designed to break it, but found spammers would be less effective using hacking tools than Amazon's Human Turk – a service that enlists humans who are paid a small fee to perform small tasks that are impossible for computers.
On average, the Image Bypass service succeeded in fooling the test 84 percent of the time, compared to 87 percent success from the Human Turk.
There are so many CAPTCHA hacks available there are clearinghouses that compete on the completeness and low cost of their particular collection.
There are also competing APIs for particularly popular tools, leading services that charge as little as $2 per 1,000 successful fake CAPTCHAs, and debates comparing hackers' favorite services.
CAPTCA is beaten but not broken
Of all the sites they tested, only Google and Recaptcha consistently resisted being cracked, according to the report.
The authors – Elie Bursztein, Matthieu Martin and John C. Mitche – may have broken CAPTCHA and defined ways others could also, but their results with Google and Recaptcha led them to recommend that web sites use CAPTCHA more wisely rather than throwing it out.
Most sites use fairly generic images as CAPTCHA tests, the paper found. The more generic the image – in the length of the segments of text, form and size of the characters presented and other elements – the easier a CAPTCHA test is to break.
Using specific techniques to make CAPTCHA images harder for 'bots to decode can make the mini-test far more successful both now and in the future, the paper found.
Variation in design of the image is the key. Specifically successful techniques include varying the length of segments, changing colors within the segments, drawing lines through certain characters, making them appear to collapse or be crushed and changing the whole CAPTCHA character scheme a site uses periodically to confuse 'bots that have already beaten.
It won't fix the unenviable silliness of CAPTCHA's actual name (Completely Automated Public Turing Test To Tell Computers and Humans Apart) and won't beat every spammer or hack available already.
It will filter out the bulk of the junk, if the images and schemes are changeable enough to keep the 'bots guessing, however.
The only question from a production perspective is whether it's cheaper for a specific site to delete a lot of spam, or spend a lot of time customizing the CAPTCHA scheme designed to filter it out.
That, like a lot of things, comes down to skills and money – specifically, how much money the business side is willing to spend to hire people with the right skills to let other people in and shut only automation out.
Read more of Kevin Fogarty's CoreIT blog and follow the latest IT news at ITworld. Follow Kevin on Twitter at @KevinFogarty. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.