By: Sarah Marquart

Associate Professor of Computer Science at Cornell Tech Alexander “Sasha” Rush and his colleagues from Hugging Face earned an Outstanding Main Track Runner-Up award at the December 2023 NeurIPS Annual Conference on Neural Information Processing Systems.

Their winning paper, Scaling Data-Constrained Language Models, was among six recognized by the awards committee out of a record 13,321 submissions. The team’s research delves into the science of scaling large language models (LLMs), particularly studying the impact of training dataset size.

The authors explain that if LLM training — the technology behind AI chatbots like ChatGPT — continues to scale indefinitely, we will quickly reach the point where there isn’t enough existing data to support further learning. High-quality English language data could be exhausted as soon as this year, with low-quality data following as early as 2030, according to an October 2022 study the authors cite.

Anticipating these impending challenges, Rush and his colleagues explored optimal strategies for scaling large language models in data-limited environments. They focused on solutions that strike a delicate balance between performance and cost, taking into account elements such as computational resources and environmental strain.

Their award-winning research revealed that there are indeed limits on the scaling horizon and suggested the need for more effective utilization of available data. The authors are optimistic that their findings will pave the way for understanding how models gain their capabilities using existing data.

“Large language models are powered by data, and they get better because of high-quality human-written text,” says Rush. “It’s critical to remember that the work of writers, from journalists to stack-overflow experts, forms the basis of what we call generative AI.”