Software

AI + ML

Reddit hopes robots.txt tweak will do the trick in scaring off AI training data scrapers

Pay up or go away, pretty please?


For many Reddit has become the go to repository of community and crowdsourced knowledge, a fact that has no doubt made it a prime target for AI startups desperate for training data.

This week, Reddit announced it would be introducing measures to prevent unauthorized scraping by such organizations. These efforts will include an updated robots.txt — a file found on most websites that provides directions to web crawlers on what they can and can't index — "in the coming weeks." If you're curious you can find Reddit's current robots.txt here.

It should be noted that robots.txt can't force scrapers to do anything; the file's contents are more like guidelines or firm requests. Web crawlers can be made to ignore them, so Reddit says it will continue to rate limit and/or block rogue bots – presumably that includes bad ones that ignore robots.txt – from accessing the site.

Indeed, crawlers that shun robots.txt risk getting blocked entirely, if possible, from sites in general by their administrators.

These measures, vague as they are at the moment, appear to be targeted specifically at those accessing Reddit for commercial gain. The site says that "Good faith actors — like researchers and organizations such as the Internet Archive — will continue to have access to Reddit content for non-commercial use."

The announcement comes just weeks after Reddit unveiled a fresh public content policy, which it spun as a way to more transparently communicate how user data is used and protect user privacy.

"We see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content," the site said.

It seems Reddit execs would much rather interested parties pay it for curated access to its crowdsourced hive mind of knowledge, opinion, trolling, and karma farming, as the announcement ends with a sales pitch for its data access plans.

As we've previously discussed, training large language models, like GPT-4, Gemini, or Claude require a prodigious amount of data. Meta's relatively small Llama3 8B model used some 15 trillion tokens.

Because of this, supplying AI training data used to build these models has become a lucrative business proposition. Last month Scale AI — which sells AI data services including pre-labeled datasets — saw its valuation soar to nearly $14 billion amid a $1 billion funding round led by Nvidia, Amazon, and Meta.

Meanwhile, this week also saw the formation of an AI data trade group called the Dataset Providers Alliance. The group's members include Rightsify, vAIsual, Pixta AI, Datarade, Global Copyright Exchange, Calliope Networks, and Ado.

Naturally, Reddit is keen to cash in on this demand, having already announced an agreement to sell API access to Google in a deal reportedly worth $60 million a year. The Front Page of the Internet last month reached a similar agreement with OpenAI, though the terms of the deal weren't disclosed.

How useful Reddit's data actually is has, however, been called into question in recent weeks after Google started citing obvious troll posts in its AI generated answers. In one case the search engine suggested adding "non-toxic glue" to pizza sauce to keep the cheese from sticking.

The Register reached out to Reddit for comment on its efforts to block rogue web scrapers and on its future plans. ®

Send us news
23 Comments

OpenAI develops AI model to critique its AI models

When your chatbots outshine their human trainers, you could pay for expertise ... or just augment your crowdsourced workforce

So much for green Google ... Emissions up 48% since 2019

AI datacenters blamed for the increase, even as Chocolate Factory bets on AI to fix it

AMD buys developer Silo AI in bid to match Nvidia's product range

First it comes for market leader's GPUs ... now it's nibbling at software

Cloudflare debuts one-click nuke of web-scraping AI

Take that for ignoring robots.txt!

64% of people not happy about idea of AI-generated customer service

Not unreasonably, nearly half worried it would give them the 'wrong answers'

A friendly guide to local AI image gen with Stable Diffusion and Automatic1111

A picture is worth a 1,000 words... or was that a 1,000 TOPS

Meta training AI models on citizen data gets a hard não from Brazil

Zuckerborg's justification isn't good enough, says watchdog

Palantir, Oracle cosy up to offer Karp firm's tech across Big Red's cloud

Foundry and AI plaform available in OCI

China’s homebrew openKylin OS creates a cut for AI PCs

Devs of OS named for a mythical beast join in the 'local models will will deliver legendary productivity' trope

SoftBank buys struggling UK AI chipmaker Graphcore

Potential for combo with Arm tantalizes

With users mostly happy to keep older kit, Macs just ain't selling like they used to

And the AI PC revolution probably won't help, we're told

Babel fish? We're getting there. Reg reviews the Timekettle X1 AI Interpreter Hub

A handy standalone translator, but you'll need deep pockets, both figuratively and literally, if you want one