Study: Chatbot training likely to run out of human writing

BS_IC_1045701969_1045671330_009-0608_AI_Data_Running_Out_49105--13c2a.jpg

Traffic passes a Microsoft data center along Interstate 35 in West Des Moines, Iowa. Microsoft is a commercial partner of ChatGPT developer OpenAI. Charlie Neibergall/AP 2023

PREVIOUS IMAGE

Image

NEXT IMAGE

By Matt O’Brien Associated Press

Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter: the tens of trillions of words people have written and shared online.

A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by the turn of the decade — sometime between 2026 and 2032.

Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing.

In the short term, tech companies like ChatGPT-maker OpenAI and Google are racing to secure and sometimes pay for high-quality data sources to train their AI large language models — for instance, by signing deals to tap into the steady flow of sentences coming out of Reddit forums and news media outlets.

In the longer term, there won’t be enough new blogs, news articles and social media commentary to sustain the trajectory of AI development, putting pressure on companies to tap into sensitive data now considered private — such as emails or text messages — or relying on less-reliable “synthetic data” spit out by the chatbots.

“There is a serious bottleneck here,” Besiroglu said. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times.

But there are limits, and after further research, Epoch foresees running out of public text data in the next two to eight years.

The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks.

Besiroglu said AI researchers realized more than a decade ago that aggressively expanding two key ingredients — computing power and vast stores of internet data — could significantly improve the performance of AI systems.

The amount of text data fed into AI language models has been growing about 2.5 times a year, while computing has grown about 4 times a year, according to the Epoch study. Facebook parent Meta Platforms recently claimed that the largest version of their upcoming Llama 3 model — which has not yet been released — has been trained on up to 15 trillion tokens, each of which can represent a piece of a word.

But how much it’s worth worrying about the data bottleneck is debatable.

“It’s important to keep in mind that we don’t necessarily need to train larger and larger models,” said Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and researcher at the nonprofit Vector Institute for Artificial Intelligence. He was not involved in the Epoch study.

Training generative AI systems on the same outputs they’re producing is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” he said. Not only that, but Papernot’s research has found that it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.