Artificial Intelligence Companies Are Running Out of Internet
AI is running out of internet to consume. While you and I log onto our worldwide web to enjoy (or perhaps not), learn, and communicate, companies are using this data to train their large language models (LLMs) and expand their capabilities. This is how ChatGPT knows not only factual information, but also how to put responses together: much of what it “knows” is based on a huge database of Internet content.
But while many companies rely on the Internet to train their LLMs, they face a problem: The Internet is limited, and AI companies want them to continue to grow, and quickly. As the Wall Street Journal reports , companies like OpenAI and Google are facing this reality. Some industry estimates are that they will run out of internet access in about two years as high-quality data becomes scarce and some companies keep their data out of the hands of AI.
AI needs a lot of data
Don’t underestimate the amount of data these companies need now and in the future. Epoch researcher Pablo Villalobos told the Wall Street Journal that OpenAI trained GPT-4 on about 12 million tokens, which are words and parts of words broken down so that LLM can understand them. (OpenAI says one token contains about 0.75 words, so 12 million tokens is roughly nine million words.) Villalobos estimates that GPT-5 , OpenAI’s next big model, will need between 60 and 100 trillion tokens to keep up from expected growth. OpenAI estimates that this is between 45 and 75 trillion words. Kicker? Villalobos says that after exhausting all possible high-quality data available on the Internet, you will still need 10 to 20 trillion tokens or even more.
Still, Villalobos doesn’t believe the data shortage will really hit until around 2028, but others aren’t as optimistic, especially artificial intelligence companies. They see the writing on the wall and are looking for alternatives to internet data to train their models.
AI Data Problem
Of course, there are several problems to contend with here. First is the aforementioned lack of data: you can’t train LLM without data, and giant models like GPT and Gemini require a lot of data. Second, however, is the quality of this data. Companies won’t clean up every conceivable corner of the Internet because there’s a ton of junk out there. OpenAI doesn’t want to feed misinformation and poorly written content into GPT, as its goal is to create an LLM that can accurately respond to user queries. (Of course, we’ve already seen plenty of examples of AI spewing out misinformation.) Filtering this content leaves them with fewer options than before.
Finally, there is the ethics of collecting data online in the first place. Whether you know it or not, artificial intelligence companies have likely collected your data and used it to train their LLMs. These companies, of course, don’t care about your privacy: they just want your data. If they’re allowed to, they’ll take it. It’s big business too: Reddit sells your content to artificial intelligence companies , in case you didn’t know. Some places are fighting back —the New York Times is suing OpenAI over this —but until there are real user protections, your public internet data is directed to the LLM nearest you.
So where do companies look for this new information? OpenAI is leading the way. For GPT-5, the company is considering training a transcription model for publicly available videos, such as those taken from YouTube, using the Whisper transcriber. (It’s possible that the company has already used the videos themselves for Sora , its AI video generator.) OpenAI is also working on developing smaller models for specific niches, as well as developing a system for paying information providers based on how good the data is.
Is synthetic data the answer?
But perhaps the most controversial next step, which some companies are considering, is using synthetic data to train models. Synthetic data is simply information generated by an existing data set. The idea is to create a new dataset, similar to the original one, but completely new. In theory, it can be used to mask the contents of the original dataset while providing the LLM with a similar set for training.
However, in practice, training an LLM on synthetic data can lead to ” model destruction “. This is because synthetic data contains existing patterns from the original data set. Once LLM is trained on the same patterns, it cannot grow and may even forget important parts of the dataset. Over time, you’ll find that your AI models return the same results because they don’t have diverse training data to support unique responses. This kills something like ChatGPT and defeats the purpose of using synthetic data in the first place.
However, AI companies are somewhat optimistic about synthetic data. Both Anthropic and OpenAI see a place for this technology in their training sets. These are capable companies, so if they can find a way to incorporate synthetic data into their models without burning the house down, they’ll have a better opportunity. In fact, it would be great to know that my Facebook posts from 2010 are not being used to support the AI revolution.