Web puzzles don't protect against bots, but humans have spent 819 million unpaid hours solving them
Updated Google promotes its reCAPTCHA service as a security mechanism for websites, but researchers affiliated with the University of California, Irvine, argue it's harvesting information while extracting human labor worth billions.
The term CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart," and, as Google explains, it refers to a challenge-response authentication scheme that presents people with a puzzle or question that a computer cannot solve.
Such tests have been used for nearly two decades to combat fraud and other forms of online automated abuse. CAPTCHA puzzles – which may involve text, image, audio, or behavioral challenges such as clicking checkboxes – are ubiquitous online.
Google acquired the reCAPTCHA service in 2009, two years after its debut.
The search giant has since revised the service since– reCAPTCHA v2 arrived in 2014 and reCAPTCHA v3 in 2018, shortly after the shutdown of v1. Though v3 is the latest version, v2 is still used by almost three million websites.
The utility of reCAPTCHA challenges appears to be significantly diminished in an era when AI models can answer CAPTCHA questions almost as well as humans.
Show me the money
UC Irvine academics contend CAPTCHAs should be binned.
In a paper [PDF] titled "Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2," authors Andrew Searles, Renascence Tarafder Prapty, and Gene Tsudik argue that the service should be abandoned because it's disliked by users, costly in terms of time and datacenter resources, and vulnerable to bots – contrary to its intended purpose.
"I believe reCAPTCHA's true purpose is to harvest user information and labor from websites," asserted Andrew Searles, who just completed his PhD and was the paper's lead author, in an email to The Register.
"If you believe that reCAPTCHA is securing your website, you have been deceived. Additionally, this false sense of security has come with an immense cost of human time and privacy."
The paper, released in November 2023, notes that even back in 2016 researchers were able to defeat reCAPTCHA v2 image challenges 70 percent of the time. The reCAPTCHA v2 checkbox challenge is even more vulnerable – the researchers claim it can be defeated 100 percent of the time.
reCAPTCHA v3 has fared no better. In 2019, researchers devised a reinforcement learning attack that breaks reCAPTCHAv3's behavior-based challenges 97 percent of the time.
"Version 3 is better than v2 since it is purely behavioral," noted Gene Tsudik, professor of computer science at the University of California, Irvine. "But, like v2, is not a true CAPTCHA – meaning it's not 'public' and it's not a Turing Test. It is a behavioral analytics-based method that assigns scores to user behavior. Thus it's privacy-invasive, since we (the public) don't know how it works. It's essentially a 'black box.'
"These systems were beaten before they were ever introduced on the global scale," argued Searles. "Image selection problems were solved by computers in 2009 (yet added by Google in 2014). reCATPCHA third-party cookies for behavioral detection introduced the 'click-jacking' vulnerability, making it easier to automatically bypass them."
You are the product
The authors' research findings are based on a study of users conducted over 13 months in 2022 and 2023. Some 9,141 reCAPTCHAv2 sessions were captured from unwitting participants and analyzed, in conjunction with a survey completed by 108 individuals.
Respondents gave the reCAPTCHA v2 checkbox puzzle 78.51 out of 100 on the System Usability Scale, while the image puzzle rated only 58.90. "Results demonstrate that 40 percent of participants found the image version to be annoying (or very annoying), while <10 percent found the checkbox version annoying," the paper explains.
But when examined in aggregate, reCAPTCHA interactions impose a significant cost – some of which Google captures.
"In terms of cost, we estimate that – during over 13 years of its deployment – 819 million hours of human time has been spent on reCAPTCHA, which corresponds to at least $6.1 billion USD in wages," the authors state in their paper.
"Traffic resulting from reCAPTCHA consumed 134 petabytes of bandwidth, which translates into about 7.5 million kWhs of energy, corresponding to 7.5 million pounds of CO2. In addition, Google has potentially profited $888 billion from cookies [created by reCAPTCHA sessions] and $8.75–32.3 billion per each sale of their total labeled data set."
Asked whether the costs Google shifts to reCAPTCHA users in the form of time and effort are unreasonable or exploitive, Searles pointed to the original white paper on CAPTCHAs by Luis von Ahn, Manuel Blum, and John Langford – which includes a section titled "Stealing cycles from humans."
"This basically [summarizes] how CAPTCHAs create an exploitative economy of function where nefarious bots can conscript humans to complete challenges for them," Searles explained. "It is unreasonable to make someone solve a security challenge when there is no gained security."
That cost should be borne by Google rather than website users, Searles argued. "If a service claims to detect bots then it should detect bots – especially if it's a paid service."
As the paper points out, image-labeling challenges have been around since 2004 and by 2010 there were attacks that could beat them 100 percent of the time. Despite this, Google introduced reCAPTCHA v2 with a fall-back image recognition security challenge that had been proven to be insecure four years earlier.
This makes no sense, the authors argue, from a security perspective. But it does make sense if the goal is obtaining image labeling data – the results of users identifying CAPTCHA images – which Google happens to sell as a cloud service.
"The conclusion can be extended that the true purpose of reCAPTCHA v2 is a free image-labeling labor and tracking cookie farm for advertising and data profit masquerading as a security service," the paper declares.
"I think that there is absolutely NO space for hard AI problems to exist in computer security," suggested Searles. "This has been an experiment that has enhanced some computational ability but there is no realistic or measurable security achieved from using such technology."
Google did not respond to a request for comment. ®
Updated to add at 1830 UTC
In a statement provided to The Register after this story was filed, a Google spokesperson said:
reCAPTCHA user data is not used for any other purpose than to improve the reCAPTCHA service, which the terms of service make clear.
Further, a majority of our user base have moved to reCAPTCHA v3, which improves fraud detection with invisible scoring. Even if a site were still on the previous generation of the product, reCAPTCHA v2 visual challenge images are all pre-labeled and user input plays no role in image labeling.
Asked to respond to Google’s comment, Searles addressed several points below.
Regarding the internet titan's assertion that "reCAPTCHA user data is not used for any other purpose than to improve the reCAPTCHA service, which the terms of service make clear."
“Could they prove this with a public audit of all their records?” Seales asked. “While they may claim this to be the case now, this is not the claim of the white paper [PDF]. The ‘re’ in ‘reCAPTCHA’ stands for reusing the data from CAPTCHAs to train ML models.
“Also, legally, this is a very vague statement: You could consider that selling reCAPTCHA user data to be an improvement of the service because you can take that money and reinvest it into reCAPTCHA and it would be considered an improvement. Note how they do not claim that they don’t sell user data.”
Regarding, "Further, a majority of our user base have moved to reCAPTCHA v3, which improves fraud detection with invisible scoring."
“Trivially bypassed in 2019, reCAPTCHA v3 offers zero provable claims surrounding its security,” said Searles. “Invisible scoring, aka a black box, is a ridiculous claim and has nothing to do with Turing tests or CAPTCHAs.”
Regarding, "Even if a site were still on the previous generation of the product - reCAPTCHA v2 visual challenge images are all pre-labeled and user input plays no role in image labeling."
“Would they publicly release all data from all historic reCAPTCHA solutions to prove such a claim?” Seales asked.
“Notably they claim in 2014 that they add it based on a ‘classic computer vision problem of image labeling,’ when this computer vision problem was solved in 2010 with 100 percent accuracy. Earlier in this blog they claim to be phasing out distorted text because of its ability to be solved by computers at 99 percent accuracy.
"There is either an extreme degree of incompetence or a massive contradiction. ‘Let's replace defeated technology with more defeated technology because it's more secure!’ It’s pretty obvious that they used it to train machine learning models since this is the purpose of reCAPTCHA.”
- Mutton
- 1
Recommended Comments
There are no comments to display.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.