System 3 thinking

I’ve been thinking for a while that there’s a piece missing from LLMs. There are hints that this hole might soon be filled, and it could drive the next leg up in AI capabilities.

Many people have observed that LLMs, for all their abilities, seem to lack “spark”. The new reasoning models are remarkably good at a certain kind of knowledge-based problem solving, based on chaining together obscure facts, but they don’t seem to show the novel creative insights that characterize top human solutions. It’s somewhat reminiscent of the Deep Blue era in computer chess: the models approach problems in a grind-it-out kind of way. Humans sometimes do this too, but also have some other mode which the models seem to lack.

Will this just fall out of further scaling? Or do we need some new ideas? While I am very bullish on scaling, I also think ideas are going to matter.

Modes of cognition

In humans, it’s pretty well accepted that there are two distinct modes of cognition: System 1 and System 2.

System 1 is basically associative memory. I say “Let’s go hiking” and you say “I’ll get my boots”. It’s fast, effortless, and good enough most of the time. The Transformer architecture is an extremely good fit for this, and LLMs are arguably superhuman at it already. However, System 1 has well known limitations and failure modes. Humans use other thinking styles in harder situations.

System 2 is conscious step-by-step thinking. This is what we use when we’re doing long division, planning a holiday, or listing pros and cons of a difficult decision. As of a few months ago¹, LLMs also have this capability. Models like OpenAI O1 and its successors use extended chain-of-thought reasoning, and they are already approaching best-in-the-world level at competitive programming and mathematics questions. We are still only at the bottom of the scaling law for this approach, and a lot more progress seems essentially guaranteed throughout 2025 and beyond. How well the RL training approach will work for other problem types is a big open question. However, there is certainly some transfer to other tasks, as evidenced by progress on problems like ARC².

So is that it? Do we just scale this up and arrive at AGI? I think we are still missing a trick, and it may be required for human level performance on the hardest problems.

Introspectively, humans seem to use at least one additional mode of thinking, which I’ll call System 3. I have in mind the problems you bash your head off for hours or days, and then the answer pops into your head while you’re walking the dog or doing the dishes³. It gives us our “aha! moments” and creative leaps. It’s not a style of thinking we need very often, and I am not even certain if it’s a universal part of adult life. However, it seems to be the source of many of the greatest human insights.

Mechanistically, what is System 3 and how does it work? The clearest fact about it is that it relies on some kind of unconscious processing. Whereas System 2 operates in token space, via internal monologue, visualization, symbol manipulation, and logical steps, System 3 relies on something else that we don’t have direct conscious access to. In the literature it’s referred to as the incubation effect, and neuroimaging evidence suggests it uses different brain regions.

Mechanizing Insight

What might this map onto in an AI context? Here are a few candidates:

Long search

The most prosaic answer is that there’s nothing fundamentally different about it. It’s just regular chain-of-thought, except that the brain runs it for an extended period, in some kind of “background process” outside of conscious awareness.

Comment: Without doubt there is a lot to be gained from longer and better CoT. However, I am skeptical that everything will fall out of simply pushing System 2 hard enough.

Latent space reasoning

A variation would be chain-of-thought, but occurring directly in the high dimensional internal representation space of the model / the brain. The research on latent space reasoning pushes in this direction⁴.

Comment: This is worth exploring and likely plays some role, but I would bet against it being the key to “System 3”. At first glance, it seems powerful to remain in the latent space of the model for a longer time. Casting back to token space loses nuance, and intuitively might make it harder to come up with new ideas in a Sapir-Whorf kind of way. However, Andrej Karpathy makes the very nice analogy to analog / digital computing, where the cast to tokens is seen as a digitization step. You are in a sense losing information, but it’s actually a huge win because it enables coherence over a much longer timescale. Trying to push things longer in analog space runs into a noise wall. This is all hand-waving intuition, but seems pretty plausible to me. That’s not to say that the analog thinking style doesn’t have a place, it’s just less likely to be productive over longer timescales in particular, which is a characteristic of System 3.

Some completely new idea

Maybe System 3 works on a new mechanism we are currently unaware of.

Comment: I doubt that “System 3” relies on a truly separate mechanism. It is not something we seem to use very often in day-to-day life; an entirely distinct system that’s rarely used is a difficult thing to evolve. It seems more likely that these “aha!” moments are a rare consequence of something that we use all the time.

Continual learning & test time training

One of the most significant differences between current AI models and the brain is that models are trained once and then frozen, whereas the brain’s weights are continually being adjusted. Maybe gradient descent at test time holds the answer? In this view “aha!” moments would map onto grokking, which we commonly observe in models at train time. However, it’s impossible at test time because gradient descent isn’t operating then.

Comment: This is the answer that seems most natural to me. It ties to Geoff Hinton’s ideas about the brain’s “three timescales“, and how AI has been missing one of them. It requires no new machinery, just a relaxation of the artificial train/test separation we currently impose. That’s not to say it’s easy to implement – there are good practical reasons for freezing models in deployment. More on that below.

Test time RL

If we’re going to do gradient descent at test time, why not do RL as well? Some recent work does this, albeit in simple proof-of-concept way⁵. (Update: A paper with more general results came out just after I hit publish.)

Comment: If one learning mechanism helps at test time, others likely help too.

Is test time training practical?

Test time training is a fairly simple idea, but it faces some steep practical hurdles. Inference requires only a forward pass, training requires a backward pass⁶. If you’re going to start needing backward passes to serve your model, that’s a dramatic change. For large production deployments, inference often runs on specialized hardware which assumes no backward pass. Also, the whole framework of model performance and safety testing relies on having specific releases that are frozen once deployed. If you’re updating the model in production, you might encounter catastrophic forgetting or some other form of performance loss. A lot of core processes would need to be re-thought.

For those reasons, I would be surprised if we see wholesale continual learning any time soon. However, there are half-way houses that might get you some of the benefits with less work.

Most existing TTT approaches (e.g. this paper⁷) don’t do true continual learning. You just train a temporary copy of the model for a few steps on the specific problem at hand, and then discard the copy once you have the answer. This sidesteps issues around safety or performance drift. However, it’s still pretty expensive. Maybe you can make it cheap by training something like a LoRA or similar?

Test time RL should also be doable this way, and may actually offer implementation advantages over pure CoT scaling. This paper makes the case quite nicely:

This approach (TTRL) allows models to develop problem-specific expertise dynamically, adapting their capabilities to the exact challenges they encounter. Just as humans often need to study and practice similar problems before tackling a particularly challenging question, TTRL enables models to systematically explore and learn from related problems before attempting the target task.

The dominant approach to test-time compute scaling today concentrates on increasing output token length, allowing models to engage in more extensive step-by-step reasoning. However, this approach faces fundamental challenges with extremely difficult problems that require extensive exploration. The sequential nature of token generation creates memory constraints and potential bottlenecks, as each token must be generated and processed in sequence while maintaining the entire chain of reasoning in context.

TTRL is inherently more parallelizable than sequential token generation – variant problems can be generated and solved independently across multiple compute units, with their insights aggregated to improve performance on the original problem. This parallelizability presents a key advantage over traditional test-time scaling approaches.

Despite the technical challenges, I think we’re going to see some flavour of this in future AI models, one way or another.

Footnotes

Hard to believe, but O1 was announced only last September, six months ago at time of writing. Already it seems like a part of the furniture. ↩︎
Some of the ARC training set was actually seen by O3 in training. However, I am pretty confident that performance relies heavily on transfer from all the non-ARC training. ↩︎
In a few famous examples the answers even arrived in dreams, such as the structure of benzene, or some of Ramanujan’s formulas. ↩︎
For an example, Meta’s Coconut system, or this paper (or tweet summary). ↩︎
Quick summary of the basics from the paper: “For each question at test-time, there are two steps. First, we generate a tree of variants for the test question at hand. Second, we perform the reinforcement learning protocol on a base model. The resulting model should then have significantly improved mathematical integration capabilities tuned to the test question at hand. The test question is then answered using the tuned model, and finally the model is rolled back to its original parameters for the next test question.” ↩︎
At least as currently implemented. The brain doesn’t seem to use backprop, and there are plenty of proposals for training methods that don’t involve it. However, that would be a big departure to put it mildly. ↩︎
Tweet summary here. Noam Brown was positive on this direction, and the lead author has since joined OpenAI, so I wouldn’t be surprised if we hear more about it. ↩︎