AI has hit another wall. And broke it.

Scaling Laws and the Race Against Time

You may have come across phrases like these in recent months:

"AI has hit an insurmountable wall. It won't grow any further!"

"Are we reaching the scale limits of artificial intelligence?"

"Current AI scalability laws show diminishing returns"

Yet, time and again, something happens that makes us rethink this stance.

The discussion about limits is as old as AI itself. This theme was also central in my 2020 novel Glimpse, where I speculated on the ways AI might find to sustain itself, possibly unsettling some. But has AI truly reached the limit? Hmm...

Warning: this post contains acronyms, technical jargon, and other content that might be intimidating for more sensitive minds. Nonetheless, I encourage you to continue: I have tried to write in the simplest manner possible and created a glossary of the unusual terms at the end of the article.

The Issue

Until a few months ago, it seemed (with reasonable certainty) that it was no longer possible to achieve better reasoning capabilities from a Large Language Model (LLM) simply by adding more input data, creating bigger models, or increasing computational power. This "scalability law" when applied to Generative AI previously held great promise, but it seemed we had "run out of data" and increased computational power alone wasn't improving model performance.

For the past 50 years, technological progress has followed Moore's Law: "Every two years, the transistor density doubles (at the same cost)". It signifies that technological performance doubles every two years—a steady linear growth allowing for many advancements.

The Scaling Laws introduce an idea that, through enhancements and optimizations we'll discuss soon, this doubling, especially for LLMs, transforms to something like "performance doubles every 6 months instead of every 2 years". We've realized that brute force in processing power or an enormous amount of fed data no longer solely determines performance improvements.

What Are Scaling Laws?

Scaling laws are empirical rules (based on experimental observations rather than theoretical principles) indicating that a system's performance improves as more resources are deployed. In Generative AI models, they help predict how increasing parameters, data, or computational power impacts a model’s accuracy, reasoning skills, and efficiency: its performance.

Jensen Huang talking about Scaling Laws in January 2025

A first scaling law is associated with the PRE-TRAINING of a model and explains how its performance increases with the volume of data, computational power used, and the number of tokens the model ultimately possesses. It's thought that OpenAI (and its competitors) have hit a wall here. A point was reached beyond which further improving performance by increasing these parameters was seemingly impossible. For those interested in a deeper dive: here you'll find the original reasoning.

The second law manifests after training: "POST-TRAINING SCALING", a phase where model performance can be enhanced AFTER training, without investing in creating larger models, through simple refinement and optimization.

It's much like a postgraduate master's: where, after an intensive training phase, the necessary components are provided to better practically tackle the world without requiring the student to first graduate in other disciplines.

For the more demanding, in the post-training phase, a model undergoes fine-tuning, Reinforcement Learning with Human Feedback (RLHF), and other optimization and compression techniques like pruning and model compression. Here, DeepSeek made waves illustrating how modifying post-training phases can yield sleeker (hence more economical to produce and operate) yet powerful models.

The third law emerges during model usage, essentially when a prompt is given and, with sufficient computational resources, we attempt to express ourselves clearly to yield better results.

Here, we're talking about "TEST-TIME SCALING," where the effort is to have the model "reason" through highly accurate and specific prompts.

To explain, it’s crucial to mention an antecedent: on January 28, 2022, this paper described how to maximize an LLM well before the world recognized GPT-3 and before ChatGPT with GPT-3.5 was born—a technique called COT: Chain-of-Thought prompting. This allows models to perform a chain of thoughts before providing a solution, and, along with other prompting techniques which are richly described in this excellent site, emerged as one of the most promising methods to get better performance from a model, by making it spend substantially more time preparing a response.

Then, in September 2024, OpenAI released the O1 models, a new series of LLMs that, while sharing the same architecture as GPT models, autonomously activated a new phase, the "reasoning" phase (in quotes not to offend purists who view reasoning possible only within the human domain), incorporating within the model itself a COT response triggered by a user prompt.

This mechanism significantly improved the performance of models capable of thinking without requiring the user to be an expert in prompt engineering, save for a minimum of technique.

The introduction of O1-Preview first, then O1 and O1Pro in December, along with new models like O3-mini and O3-mini-high, parallel to the emergence of DeepSeek, Gemini Flash Thinking models, Grok 3, and an ever-extending line of models, makes it evident that today we're presented with a new set of models capable of automatically reasoning before providing an answer. As I narrated here.

We're facing a further acceleration in the TEST-TIME phase, breaking yet another barrier in yet another wall.

Minor issue: prolonging the reasoning process of a model incurs higher computational and energy-saving costs, reflected in a significant increase (sometimes substantial) in the number of output tokens, which are more expensive. Thus, determining when to employ a "thinking model" is becoming a complex endeavor.

OpenAI announced recently that the much-anticipated GPT-5 will be an LLM that 'autonomously decides' when it's necessary to think before responding or to respond initially, as before. This is creating a bit of a headache for those needing to craft decent prompts.

So what...

I'm not particularly keen on the notion that it's crucial to speed up excessively at any cost.

I firmly believe that even if Generative AI ceased accelerating, we would still reap enormous benefits from everything released over the last two years.

We're faced with years of work merely to uncover what can truly be achieved with these models, and we're only human: we need time to assimilate innovations.

Yet, I don't foresee this slowdown occurring. At this stage, we'll invariably find ways to keep moving forward.

See you soon!

Massimiliano

P.s. If you enjoyed this article, please share it. Discover more on Linkedin or on maxturazzini.com

Here's the Glossary

(kindly provided by GPT-4o)

Term	Clear and Accessible Definition
Generative AI	A type of artificial intelligence creating original content (texts, images, music, videos) based on what it has learned from large volumes of data.
LLM (Large Language Model)	AI model trained on massive text datasets to generate coherent, contextual responses. The larger the model, the more data and computational power required.
Scaling Laws	Empirical rules describing how increased resources (data, parameters, computation) enhance an AI model's performance. They divide into three phases: pre-training, post-training, and test-time.
Pre-training Scaling	The initial phase where the model is trained with massive data volumes and computational power to learn patterns and relationships between words.
Post-training Scaling	Optimizations following pre-training, including fine-tuning, RLHF (human feedback), pruning (reducing unnecessary parameters), and other techniques to improve efficiency and performance.
Test-time Scaling	Strategies to improve performance while the model is in use, through prompting techniques, hardware advancements (GPU, TPU), quantization, and dynamic resource management.
Fine-tuning	Tailoring a pre-trained AI model for specific tasks or industries without retraining it from scratch.
RLHF (Reinforcement Learning with Human Feedback)	A technique where humans evaluate AI responses, helping it identify the most useful and accurate ones.
Prompt	The textual request made to the AI model. A well-crafted prompt can significantly enhance response quality.
Prompt Engineering	The craft of writing optimized prompts to obtain better responses from AI models.
COT (Chain-of-Thought) Prompting	A prompting technique enabling the model to think step-by-step before responding, enhancing response accuracy and complexity.
Token	The base text unit processed by an LLM. It can be a word, word piece, or character. More tokens used, costlier the computation.
Quantization	A technique reducing a model's numerical precision, decreasing memory consumption while enhancing efficiency without sacrificing much performance.
Pruning	Selectively removing model parts to make it leaner and more efficient by eliminating less useful parameters.
Dynamic Resource Allocation	Strategies applied by AI models to determine how much computational power to use based on task complexity. GPT-5 will autonomously regulate the necessary "reasoning" level, according to OpenAI.
Adaptive Attention	A method where AI focuses its resources on key parts of the text to enhance response quality without computational waste.
O1, O1-Pro, O3-mini, DeepSeek, Gemini Flash Thinking	New generations of optimized AI models for advanced reasoning.
Test-time Optimizations	A set of techniques to make AI faster and more efficient during use, including advanced hardware, model compression, and intelligent memory management.