AI Robs the Internet, but the Internet Fights Back

AI is not magic. Tools that generate essays or hyper-realistic videos from simple user input can only do so because they have been trained on huge data sets. That data, of course, has to come from somewhere, and that somewhere often ends up being something created and written by people on the internet.

The internet has proven to be quite a large source of data and information. As of last year , there were 149 zettabytes of data online . That’s 149 million petabytes, or 1.49 trillion terabytes, or 149 trillion gigabytes, otherwise known as a lot . This collection of text, image, visual, and audio data is irresistible to AI companies, who need more data than ever to continue to grow and improve their models.

So, AI bots scour the web, collecting any data they can to improve their neural networks. Some companies, seeing business potential, have signed deals to sell their data to AI companies, including companies like Reddit , the Associated Press , and Vox Media . AI companies don’t necessarily ask permission before scraping data from the web, and as such, many companies have taken the opposite approach, filing lawsuits against companies like OpenAI, Google, and Anthropic. (Disclosure: Lifehacker’s parent company, Ziff Davis, filed a lawsuit against OpenAI in April, alleging that it infringed Ziff Davis’s copyrights when training and operating its AI systems.)

These lawsuits probably aren’t slowing down the AI ​​vacuum cleaners. In fact, the machines are desperate for more data: Last year , researchers found that AI models have run out of data to continue their current growth rate. Some forecasts put the runway at 2028, which, if true, gives AI companies just a few years to scrub data from the web. While they’ll look to other sources of data, like official transactions or synthetic data (data created by AI), they need the internet more than ever.

If you have any kind of presence on the internet at all, there’s a good chance your data has been sucked dry by these AI bots. It’s disgusting, but it’s also what fuels the chatbots many of us have started using over the past two and a half years.

The Internet will not give up without a fight

But just because things are a little dire for the internet as a whole doesn’t mean it’s giving up entirely. On the contrary, there’s real pushback against this kind of practice, especially when it’s directed at the little guy.

In true David and Goliath style, one web developer has taken on the challenge of creating a tool for web developers to block AI bots from scraping their sites for training data. The tool, Anubis , launched earlier this year and has been downloaded more than 200,000 times.

Anubis is the creation of Xe Iaso, a developer from Ottawa, California. As reported by 404 Media , Iaso launched Anubis after discovering an Amazon bot clicking on every link on her Git server. After deciding not to take the Git server down entirely, she experimented with a few different tactics before finding a way to block these bots entirely: “uncaptcha,” as Iaso calls it.

What do you think at the moment?

Here’s how it works: When Anubis runs on your site, it checks to see if the new visitor is actually human by forcing the browser to run cryptographic math with JavaScript. According to 404 Media, most browsers as of 2022 can pass this test, as these browsers have built-in tools to run this type of JavaScript. Bots, on the other hand, usually have to be coded to run this cryptographic math, which would be too tedious to implement in all bot scrapers en masse. So Iaso came up with a clever way to check browsers with a test that these browsers pass in their digital sleep, while blocking bots whose developers can’t afford the computing power required to pass the test.

This isn’t something the average web surfer should think about. Instead, Anubis is designed for people who run their own websites and servers. The tool is currently completely free and open source, and is in constant development. Yaso told 404 Media that while she doesn’t have the resources to work on Anubis full-time, she plans to update the tool with new features. This includes a new test that doesn’t put as much strain on the end user’s CPU, as well as a test that doesn’t rely on JavaScript, since some users disable JavaScript as a privacy measure.

If you’re interested in running Anubis on your own server, you can find detailed instructions on how to do so on the Iaso GitHub page . You can also check your own browser to make sure you’re not a bot.

Iaso isn’t alone on the web in cracking down on AI crawlers. Cloudflare, for example, will block AI crawlers by default starting this month , and will also allow customers to charge AI companies that want to scrape data on their sites. Perhaps as it becomes easier to stop AI companies from openly scraping the web, these companies will scale back their efforts — or at least offer site owners more in exchange for their data.

I hope to come across more sites that initially load with the Anubis splash screen. If I click on a link and see a message that says “Make sure you’re not a bot,” I’ll know that the site has successfully blocked these AI crawlers. For a while, the AI ​​machine seemed unstoppable. Now it seems like there’s something we can do to at least bring it under control.

More…

Leave a Reply