The Transformer vs Post-Transformer Debate Explained

Arrrreee yooooou reeeaaaaddy to ruuuuuumble?!?!?

So, a week or more ago, the AI startup Pathway staged a very 2026 kind of AI event in San Francisco: four leading AI researchers entered a boxing ring, and the crowd was asked to decide whether the Transformer architecture (of the "Attention is All You Need" paper fame) still rules modern AI.

The cast of this event alone made the bit worth taking seriously. The contenders included Lukasz Kaiser, co-inventor of the Transformer; Llion Jones, another co-inventor now arguing from the Post-Transformer corner; Mathias Lechner of Liquid AI; and Adrian Kosowski of Pathway, who has been pushing the BDH architecture, short for Dragon Hatchling. Don't worry, we'll explain all this below.

Yes, there were boxing gloves. Yes, there was a clapometer. AI research has finally achieved its final form: NeurIPS meets WrestleMania, but everyone is arguing about KV cache.

Now, no spoilers, but the crowd eventually gave Team Transformers the trophy (booo! Boring!). The more useful takeaway was the map of the fight itself. The panel surfaced the real architectural question facing AI labs: can the next jump come from better scaling of the existing Transformer stack, or does AI need a new way to handle memory, reasoning, continual learning, and hardware?

First up: WTF is a Transformer? A.K.A Why Should I Care?
The fight in one sentence
Here's the TL;DR of the full debate:
Why the Transformer still has the belt
The Post-Transformer case starts with waste
Lechner's both-and argument may be the most realistic one
The hardware lottery is the real referee
The intelligence question became a benchmark question
Latent reasoning raises a safety question too
The research behind the Post-Transformer camp
Where the Post-Transformer bets actually differ
So who is actually closest to passing the Post-Transformer test?
The paper trail, if you want to go deeper
Our take: the next architecture has to beat the slope
The frame and who was arguing what
Opening case: Łukasz Kaiser for Transformers
Opening case: Adrian Kosowski for Post-Transformers and BDH
Opening case: Mathias Lechner for architecture pluralism
Opening case: Llion Jones against architectural complacency
Rebuttals: what each side pushed back on
Quick punch round: what is intelligence?
Quick punch round: reasoning beyond language
Quick punch round: scaling laws and the Bitter Lesson
Real-world deployment and benchmarks
Closing arguments
Audience Q&A: hardware lottery and the 10x bar
Audience Q&A: continual learning, dynamic weights, and long context
Audience Q&A: fine-tuning, latent reasoning, and safety
The crowd vote and final beat
Actionable takeaways from the debate

First up: WTF is a Transformer? A.K.A Why Should I Care?

The Transformer is the neural-network architecture behind most modern large language models (called LLMs for short), and it still rules modern AI to this day.

Its core move is self-attention: each token, meaning a small chunk of text rather than necessarily a whole word, compares itself with other tokens in the context so the model can decide what should matter.

It's a neat trick, but for reasons we'll discuss below, comes with many trade-offs.

Now, as for what a Post-Transformer is, you can think of these as meaning alternative architectures that reduce, replace, or reorganize the attention-heavy Transformer stack. The Dragon Hatchling architecture from Pathway, for example, is a proposed large language model design based on locally connected, brain-inspired units rather than standard Transformer blocks.

And what's a KV cache, you ask? A KV cache, or key value cache, is the stored keys and values a Transformer reuses so it does not recalculate the whole past conversation every time it predicts the next token.

So this whole panel surfaced the real architectural question facing AI labs: will the next jump in AI model capability come from:

Better scaling of the existing Transformer stack, meaning capability keeps improving predictably as you add model size, data, and compute... or...
Does AI need a new way to handle memory, reasoning, continual learning, and hardware?

Of the two, continual learning is the tricky part: it requires updating model weights (the numbers that determine how the model should respond to a request) from new experience after deployment without forgetting old skills, learning unsafe behavior, or needing a full retraining run. It's a hard problem, especially in a transformer-first world where models scale quadratically.

What does that mean, quadratically? Well, quadratic scaling means the attention cost grows roughly with the square of the sequence length, so doubling the context can require about four times as many pairwise attention comparisons.

The fight in one sentence

So the simplest read of the match-up this: Transformers are the reigning champion because they scale, run well on modern hardware, and keep absorbing good ideas from other camps.

Post-Transformers are the challenger because the current stack still looks inefficient, static, and strangely dependent on spelling out reasoning one token at a time.

The four opening arguments landed in clean lanes:

Kaiser argued that Transformers still win because they work. He framed the architecture as a simple memory system: write keys, store values, retrieve what matches. AI loves simplicity.
Kosowski argued that AI has reached a moment similar to web search before PageRank, Google's original web-ranking breakthrough that judged pages partly by the quality of other pages linking to them. We have working examples of intelligence, but we have not found the deeper process behind them. Personally, we REALLY like this metaphor and it makes sense.
Lechner argued for architectural pluralism. The model you want on a Raspberry Pi, a tiny low-power single-board computer, differs from the model you want on a massive NVIDIA stack, meaning racks of GPU servers built for heavy parallel computation, so builders need more blocks.
Jones argued that the Transformer may be a very successful local minimum, which is a good solution that traps you because every nearby change looks worse even though a better solution may exist farther away (this is factually where we are now IMO). It is elegant, powerful, and possibly blocking people from looking far enough away.

The trick is that all four can be true at once. That is why the debate matters.

Here's the TL;DR of the full debate:

(06:37) Łukasz Kaiser made the Transformer case in the simplest possible terms: it works. He described attention as a kind of memory system, where the model writes keys and values, then retrieves the most relevant value when a new question comes in. His argument was that this simple mechanism keeps scaling, works well on current hardware, and can absorb many “fixes” around it, like chain-of-thought reasoning, longer context, and memory. His standard for replacement was strict: show a better scaling curve, ideally on a private holdout benchmark, and he would concede.
(10:10) Adrian Kosowski’s counter was that the field has many examples of intelligence now, so the next step is finding the common theme behind them. He argued that Transformers have real weaknesses around continual learning, long-term memory, and reasoning in latent space, meaning reasoning inside compressed internal representations rather than spelling every step out in text. His Post-Transformer case was less “Transformers are bad” and more “reasoning may need a more direct architecture than language-token chains.”
(14:40) Mathias Lechner took the most practical middle ground: Transformers and Post-Transformers. His point was that real systems are built around hardware, use cases, speed, memory, and deployment constraints. Liquid’s bet, as he framed it, is to use whatever works: attention mechanisms, state-space models, RNN-like ideas, convolutions, and future architectures. For edge devices, biomedical signals, proteins, and other non-text data, he argued that the best architecture may depend heavily on the data and hardware.
(17:41) Llion Jones made the strongest philosophical case against architectural complacency. He said the real question is whether the Transformer is the final word, or whether the industry is stuck in a local minimum because the current architecture is so successful. His view: the human brain is proof that something more data-efficient exists, and the next breakthrough will look obvious only after it arrives. Jones eventually conceded the Bitter Lesson point(44:36): a challenger has to scale. Interesting architecture loses if more data and compute keep beating it.
(58:48) The best shared insight was the hardware lottery: Transformers won partly because they fit modern chips so well. Any successor has to overcome that advantage. The audience vote gave the night to Team Transformer, but even Kaiser agreed on the real test: bring a curve that bends down faster, and the belt can change hands.
(1:15:47) The safety wrinkle: visible chains of thought are already incomplete because Transformers have hidden activations above every token.

Why this matters:The debate turned “Post-Transformer” from a vibe into a testable claim. A challenger has to show a better curve: same performance for less cost, better long-context learning, stronger reasoning without long word-by-word traces, or safer adaptation from new experience. You hear that, all you would-be transformer killers? Now y'all have a benchmark, it's time to hill climb!

Once the challengers have a successful enough proof point, the burden of proof will then switch to the big labs to prove their transformer-based language models are worth continuing to invest in at such a massive scale if a lesser-funded upstart can scale more efficiently and achieve better performance on all of the metrics above.

Our take: Transformers keep the belt because they have scale, hardware, benchmarks, and tooling on their side. Post-Transformers matter because those advantages also hide the trade-offs: huge inference cost, brittle long-horizon memory, limited continual learning, and reasoning that often has to talk itself through the problem. The first successor may look worse at first. The one worth watching is the one whose curve bends faster. This is the new AI race. To quote legendary AI researcher Ilya Sutskever, the age of research is upon us...

Why the Transformer still has the belt

The Transformer became the foundation of modern AI because it made sequence modeling, the job of reading ordered data like words, audio frames, code tokens, or DNA bases, parallel enough to scale.

The original Attention Is All You Need paper replaced recurrence, where a model processes a sequence one step after another, and convolution, where a model slides learned filters over nearby chunks, with attention, where each token can weigh the relevance of other tokens directly. It then showed strong machine-translation results, meaning one-language-to-another conversion, with far less training time than prior systems.

That parallelism became the economic engine of the AI boom. Parallelism means many calculations can run at the same time instead of waiting in a long line. It meant bigger models could use bigger datasets on bigger accelerators, the GPUs and TPUs specialized for doing huge numbers of matrix operations quickly.

Once that loop started working, labs had a brutally effective recipe: more data, more compute, better training, bigger context, stronger post-training, meaning the instruction-tuning, preference-tuning, and other finishing steps after the base model learns from raw data, and better tools around the model.

Kaiser made the strongest pro-Transformer case by avoiding mysticism. His claim was practical: the thing works, and many would-be contenders still do not (at least, in our view, at scale).

His library analogy is a clean way to explain attention:

Imagine every new token as a card in a library catalog.
The model stores a key, stores a value, then later compares a query to those keys and pulls the relevant values back into the calculation.
These are learned vectors, or lists of numbers: the query is roughly "what am I looking for," the key is roughly "what does this stored item match," and the value is "what information should I bring back if it matches."
A Transformer keeps extending that memory as the context grows (this is also why the context can't grow past a certain size from a practical perspective, and one of the biggest limitations of them).

All that explains why context windows matter so much:

A context window is the amount of text, images, code, or other input the model can hold in its working area at once.
A longer context gives the model more cards to search.
A better attention mechanism improves how it searches.
And a better system around the model can compress, retrieve, or route information before the model even sees it (retrieval here means fetching relevant outside material, like a document or database row, and inserting it into the model's context instead of hoping the model memorized it during training).

From that view, many so-called limitations look like engineering work:

Long reasoning can be handled with chain-of-thought style computation, where the model writes or internally uses intermediate steps instead of jumping straight to the final answer.
Memory can be extended with retrieval, longer context, and external tools (though this is kind of a hacky workaround to true context window / long term memory / continual learning).
Specialized behavior can be added through fine-tuning, agents, and scaffolding. Fine-tuning adjusts the model's weights on targeted examples; agents are systems that let a model plan and take actions over multiple steps; scaffolding is the prompting, tools, memory, and control logic wrapped around the model.
Efficiency can improve through better kernels, hardware, and model compression. Kernels are the low-level GPU routines that do the math; model compression shrinks or speeds a model through techniques like quantization, pruning, or distillation while trying to preserve capability.

All that is basically the incumbent's advantage. The Transformer stack is no longer one architecture sitting alone in a paper. It is an ecosystem of chips, kernels, serving systems, data pipelines, benchmark habits, developer tools, and company roadmaps all built around their essential functionality. Essentially, they have a whole ecosystem built up around them to make them as powerful as they possibly can be... which is both a boon, and a crutch.

The Post-Transformer case starts with waste

The Post-Transformer side did not need to prove that Transformers are useless. That would be silly. Every major frontier model still owes a huge debt to the Transformer lineage. Unless you're Gary Marcus, you don't seriously believe they don't have value in their current form.

The sharper claim is that the current stack may be solving hard problems in a clumsy way.

Jones put the critique plainly:

Humans do not need to read the entire internet several times to become useful thinkers (or put another way, our human brains don't compare every data point we see to every other data point we see in order to think; we use intellectual shortcuts).
Children learn from small amounts of data, build models of the world, and adapt through experience.
Modern AI models still lean on huge training corpora, meaning the giant collections of text, code, images, audio, or video used during training, huge compute budgets, and inference-time scaffolding, the extra prompts, tools, search loops, verifiers, and planners used while the model is answering.

That critique hits hardest in reasoning. Today's reasoning models often work by producing long chains of text, checking intermediate steps, calling tools, and feeding outputs back into the next step. This is powerful. It also looks like a workaround for an architecture that lacks native, compact reasoning dynamics.

In normal-person terms: the model may need to talk itself through the problem because it does not have a built-in inner scratchpad that can run many steps silently and reliably.

Kosowski's response was to separate learning from reasoning. During learning, backpropagation through layers works well. Backpropagation is the training procedure that measures an error, sends that error backward through the network, and nudges the weights in the direction that would have reduced it. In this case, layers refers to the stacked stages of computation inside the model.

During reasoning, the system may need to run a long algorithm over many steps. For Transformers, those steps often unfold as language. His Post-Transformer bet is that reasoning can happen more directly in latent state, the model's hidden numerical workspace before anything becomes visible text.

That is where Pathway's BDH paper enters the debate:

The paper describes Dragon Hatchling as a biologically inspired LLM architecture based on locally interacting neuron particles, small computational units connected in a graph so nearby units update one another.
It also uses a GPU-friendly state-space formulation, meaning the model keeps and updates a compact internal state over the sequence in a way that can still be expressed as efficient accelerator math.
The authors claim BDH shows Transformer-like scaling laws (the empirical curves that predict how performance changes as model size, data, or compute changes) on language and translation tasks from 10M to 1B parameters (parameters = the learned numbers inside the model).
The paper also claims sparse (meaning only a small fraction are active at once) positive activations and interpretability of state; that means researchers can inspect the model's internal memory-like variables, not only its individual neurons or final outputs.

The official BDH repo frames it as a bridge between Transformers and brain-like models. Its key pitch is state: a model that can carry more of the right information forward, update that information during use, and reason without verbalizing every step.

That word, state, kept coming back. If you don't know, state is the information a system carries from one step to the next, like the running total on a calculator after you press plus. Here's a way to think about it:

A Transformer can attend to prior tokens in context.
A state-space or recurrent system carries an internal state forward.
State-space models describe how a hidden state changes over time as new inputs arrive; recurrent systems feed information from the previous step into the current step.

So, the dream is to get the best parts of both: Transformer-level scaling, richer internal memory, and hardware-friendly computation.

Lechner's both-and argument may be the most realistic one

Lechner's opening brought the debate a bit back down to Earth: His company, Liquid AI, wants the right building block for the task, the hardware, and the deployment target.

That matters because real AI systems rarely live in benchmark land. They run on phones, laptops, browsers, robots, medical devices, factory equipment, and cloud clusters with very different budgets.

For a cloud coding agent, meaning a model-driven system running on remote servers that can write, edit, and test code, the user may accept latency (the delay between asking and receiving a response) if the answer is better.

For an edge robot (meaning a robot running computation locally on or near the device rather than in a distant data center) latency, power, and reliability matter more than squeezing another point out of a benchmark. For genomics or biomedical signals, where the sequence might be DNA bases, protein signals, ECG traces, or sensor time series rather than words, sequence structure may reward architectures that look different from text-first Transformers.

This is the underrated practical case for Post-Transformers. The next architecture does not need to replace every Transformer everywhere.

It can win valuable niches first, like:

Long-sequence tasks where quadratic attention gets expensive.
Edge deployments where power and memory budgets dominate because you're running the model on a local device, sensor, robot, laptop, phone, or nearby gateway instead of sending every request to the cloud.
Scientific domains where data is scarce or shaped differently from text.
Reasoning workloads where compact latent computation beats long token traces. Latent computation means work done inside hidden vectors before it is written out as words.
Continual learning systems that update safely during use.

That is why models like Mamba became part of this broader conversation. Mamba is a selective state-space model: it carries a compact hidden state forward, and the update rules can depend on the current input so the model can choose what to remember or forget. Its authors reported fast inference, meaning the cost of running the trained model, linear scaling in sequence length, meaning cost grows roughly in proportion to the sequence rather than with every token comparing to every other token, and strong results across language, audio, and genomics. Thought TBH, from what I understand, most in the industry are less enthusiastic about Mamba than other potential alternatives.

The broader point is bigger than one model. The Post-Transformer camp is really a bundle of bets: state-space models, recurrent variants, linear attention, gated architectures, liquid networks, biologically inspired systems, and hybrids that blur the lines.

Linear attention tries to redesign attention so the cost grows more gently with sequence length. Gated architectures use learned switches to decide which information flows through. Liquid networks change their dynamics with the input over time. Hybrids mix pieces from multiple families instead of pretending one block should do everything. The goal is to find the best mix of trade-offs that are economically valuable, not just intellectually satisfying.

The hardware lottery is the real referee

The best exchange came when the audience asked how AI can move beyond Transformers while the hardware world is optimized around them.

Jones called this the hardware lottery. The phrase comes from researcher Sara Hooker, whose 2020 essay describes how research ideas can win because they fit available hardware and software, rather than because they are inherently superior.

The Transformer benefited from this lottery because it mapped well to parallel matrix multiplication, the giant grid-of-numbers multiplication that GPUs are especially good at doing in bulk. That made it a perfect companion for GPUs and TPUs.

In case you forogt, GPUs are graphics processors repurposed for parallel AI math; TPUs are Google's tensor processing units, custom chips designed for neural-network workloads. Once the world invested billions into that path, alternative architectures had to fight uphill.

Kaiser's counterpunch was better than a simple defense. He pointed out that early Transformer serving (running trained models for real users rather than training them, also called inference) had its own hardware mismatch, including trouble with softmax on early TPUs. Softmax is the operation that turns raw scores into probabilities that add up to one, which attention uses to decide how much weight to place on each token (hence the term "weights"). The Transformer also had to prove itself before hardware and systems caught up.

His bar for Post-Transformers was clear: the challenger may start 50x slower on today's hardware, but it needs a better scaling slope. The slope is the direction of the curve as you add more compute, data, or sequence length. A constant disadvantage can be engineered away. A worse slope cannot.

That might be the cleanest test in the whole debate. A new architecture earns attention when it shows one of these curves:

Better performance at the same scale, data, and training budget.
Similar performance with much lower inference cost, meaning the price, time, energy, and hardware needed to run the model after training.
Better long-context learning as the task horizon grows. The task horizon is how many steps or how much time the model must reason across before the job is done.
Better reasoning reliability without long token-by-token traces.
Better adaptation from small amounts of new experience.

The hardware lottery cuts both ways. It can trap research in the fast lane it already knows. It can also force new ideas to be dramatically better before the ecosystem rebuilds around them.

The intelligence question became a benchmark question

The panel spent a surprising amount of time on definitions of intelligence, and that detour mattered.

At the intelligence round, Kaiser leaned toward the engineer's view: define intelligence by what the system can do. Kosowski described intelligence as a process, a way of solving things. Jones described intelligence as compression, meaning a system has learned structure when it can represent the same messy world with fewer bits while still predicting it well. Put another way: you can use abstractions to understand things efficiently without needing more brain power. Or, as the popular saying goes, the test of a good writer is how they can say a lot with as few words as possible. I'm working on it... lol!

Anyway, those definitions point toward different tests:

If intelligence is performance, Transformers look great.
If intelligence is process, the current stack feels incomplete.
If intelligence is compression, then perplexity, a measure of how surprised a model is by the next token, remains a surprisingly useful proxy (or indirect measure); lower perplexity means the model assigned higher probability to the actual next token, so therefore a system that can predict the world well has probably learned useful structures about the world, which a proxy for its intelligence.

That is why the benchmark segment was so revealing. Benchmarks are standardized tests for models, like math exams, coding tasks, or long-context retrieval tests. Kaiser argued that many labs already rely heavily on private holdout sets and perplexity, because public benchmarks get gamed or contaminated once released. Holdout sets are test examples kept away from training. Contaminated means the model may have seen the answers, or close copies, during training or evaluation prep. Jones agreed that a renewed focus on perplexity would be useful.

The open problem is that perplexity alone may miss the thing Post-Transformers care about most: learning over time.

Needle-in-a-haystack tests, for example, mostly measure retrieval. They ask whether the model can find a fact placed somewhere in a long context. That is useful, but it is far from human-like learning. A stronger test would force a model to infer a new rule, build hypotheses, revise them after evidence, and use that self-generated history later.

Kosowski described the distinction well near the end. There is context you paste into a model, and there is context produced by the model's own life path: actions taken, reactions observed, lemmas built, mistakes revised, and skills internalized. Oh, and in case you didn't know and because I'm in an explanatory mood and always wondered this myselt: lemmas are smaller helper facts or intermediate rules that make a larger proof or solution easier later.

That is the benchmark the field does not really have yet. It is also the benchmark that could make the Post-Transformer case much easier to evaluate.

Latent reasoning raises a safety question too

The final audience question asked whether latent-space reasoning creates safety problems. Latent space is the model's hidden numerical representation of meaning, the internal coordinate system it uses before producing words. If a model reasons in text, at least we can inspect the words. If it reasons in hidden vectors, lists of numbers that encode internal features, maybe we lose that window.

Kaiser's answer was a useful warning. Text chains are not the whole thought process. A Transformer already has layers of activations above every token, and we barely understand what happens inside them. Activations are the intermediate numbers each layer produces as it transforms an input into the next representation. The visible words may look faithful today, but the hidden computation can drift away from the explanation we see.

Jones offered the optimistic counter-narrative. A Post-Transformer architecture inspired by the brain could become more interpretable, meaning easier for researchers to inspect and explain, if it is designed around understandable state and dynamics (how the internal state changes over time as new input arrives) from the beginning.

That is a real fork in the road. Latent reasoning could make models faster, more compact, and more capable. It could also make their reasoning harder to audit unless interpretability is built into the architecture, training process, and benchmarks (which is possible, but not necessarily "bitter-lesson pilled").

The research behind the Post-Transformer camp

So I wanted to do a little compare and contrast between the so-called Post-Transformer folks, as the boxing-ring version of the debate made it sound like one fight: Transformer or Post-Transformer. Well, the actual paper trail shows three different bets hiding inside that label.

As it turns out, Liquid AI, Sakana AI, and Pathway are all exploring alternatives to the default Transformer stack, but they are aiming at different weak spots:

Liquid is chasing practical edge deployment.
Sakana is chasing evolutionary and collective intelligence (which feels most promising to me personally, for reasons I'll get into below).
Pathway is making the most direct claim that a new architecture can bridge Transformers and brain-like models.

Let's break down all three of those and then compare and contrast them, shall we?

Liquid AI is betting on edge-first hybrids

Liquid neural networks began as a continuous-time modeling idea. Continuous-time means the model treats change as flowing over time, like a physical system, instead of only updating in fixed layer-by-layer jumps. Just saying, but most living intelligences probably do this as we have to predict reality in order to survive it, and in order to predict what happens next, you have to have a good sense of time changing.

The original Liquid Time-constant Networks paper treated neurons as dynamical systems whose behavior changes with the input. A dynamical system is something whose current state determines how it evolves next.
Closed-form Continuous-time Neural Models then made that family easier to run by replacing slow numerical solvers, which approximate equations step by step, with a closed-form approximation, meaning a direct formula-like computation that avoids most of that step-by-step solving.
Liquid-S4 pushed the same intuition into state-space sequence models.
And Liquid AI's newest work, the LFM2 technical report, describes a hardware-in-the-loop architecture search under edge latency and memory constraints.
- Hardware-in-the-loop means candidate model designs are tested against the real speed and memory limits of the target hardware during the search, rather than optimized only on paper.
- The result is a compact hybrid: mostly gated short convolution blocks, which use learned switches around local sliding-window filters, plus a small number of grouped-query attention blocks, which share key-value heads across groups of query heads to reduce memory and compute compared with full multi-head attention. If you barely understand that, same.

What you do need to understand is that supports Lechner's argument from the debate. Liquid is not trying to win a purity contest. It is asking which architecture works for the device, latency budget, and task in front of it. Respect.

The recent release cadence makes that strategy clearer:

LFM2 launched as an on-device foundation model family, meaning general-purpose base models meant to run locally on devices, with 350M, 700M, and 1.2B open-weight variants.
- Open-weight in this context means the trained model weights are released for others to download and run, though the exact license and training data openness can vary.
LFM2-24B-A2B scaled the same architecture into a sparse mixture-of-experts model with 24B total parameters and roughly 2B active parameters. Mixture-of-experts models contain multiple expert subnetworks and routes each token through only some of them; sparse means most experts are inactive for any given token, so active parameters are the learned numbers actually used on that pass.
LFM2.5-350M kept the focus on small models, extended pretraining from 10T to 28T tokens, and added RL improvements.
- If you don't know, pretraining is the broad first training stage on large datasets, what we think of as passing a bunch of training data to the model to "train on"; RL means reinforcement learning, where the system is optimized using specific (and often "verifiable") rewards rather than only next-token prediction.
LFM2.5-VL-450M moved that edge-first philosophy into structured visual intelligence. A VL model is a vision-language model, so the model works across images and text; in this case, it can extract organized information from visuals, like reading charts, forms, layouts, or object relationships.

The important nuance to remember here: Liquid's current frontier is not a clean break from attention. It is a carefully chosen hybrid. That may be exactly what a real Post-Transformer transition looks like at first: less manifesto, more architecture search. We don't really understand how today's closed-source frontier labs work under the hood, so for all we know, they might be running their own kind of hybrid systems.

Sakana AI is betting on evolution, collectives, and time

Sakana AI's work is not necessarily one model family but a collection of unique ideas (it is, as best we can tell, one of the most true pure research labs out there). Llion Jones's company is building around a broader idea: nature-inspired AI systems that search, merge, coordinate, and improve in ways that look less like one frozen model answering prompts.

The clearest architecture-level example is Continuous Thought Machines. CTMs add an internal time dimension, meaning the model can run multiple internal update steps before returning an output, let neurons process histories, meaning each unit can use a short record of prior signals rather than only its latest input, and use synchronization between neuron dynamics as the representation. Synchronization in this context means groups of units falling into coordinated timing patterns; representation means the internal code the model uses to stand for the thing it has understood. In plain English: the model gets room to think internally before producing an answer, rather than turning every step into visible language. Here's our full explainer.

Then there is Sakana's self-improvement track. Darwin Godel Machine uses foundation models, which are broad base models that can be adapted to many tasks, to rewrite and evaluate its own code, preserving a lineage of successful variants rather than committing to one path. The reported gains were striking: SWE-bench, a benchmark where agents fix real GitHub issues in software repositories, improved from 20.0% to 50.0%, while Polyglot, a multi-language coding benchmark, improved from 14.2% to 30.7%.

Sakana's 2026 work widened that loop from models improving code to models coordinating other models:

Fugu packages Sakana's multi-agent orchestration into a commercial API beta (an early developer interface for using it in your own software). Multi-agent orchestration means assigning subtasks to several model-driven agents and coordinating their communication, checks, and handoffs.
Conductor trains a model to manage other models in natural language, deciding when to use a coder, planner, verifier, or direct answer. In this context, a verifier is the checking agent that tries to catch errors before the final output.
Trinity uses a lightweight evolved coordinator to route work across LLMs, with fewer than 20K learnable parameters in the coordinator head (which is the small output module that decides the next routing action, and learnable parameters are the numbers updated during training).
Sparser, Faster, Lighter Transformer Language Models, built with NVIDIA, shows Sakana still cares about making Transformers cheaper while it explores farther-out ideas. Very practical of you, Llion!

The reason I find this idea of creating collective intelligences through orchestration so compelling is that we humans are made up of many individual cells, all coordinating and working together with their own pre-programmed processes from each cell's individual genes.

Yes, we have a brain (the orchestrator, if you will), but also, each individual cell has its own micro functionality inside us that works independently (at least as far as I know) from our "conscious, intelligent mind"... and that, in and of itself, is its own form of intelligence, as simple as it might be. Therefore, could we each ultimately all be our own individual forms of collective intelligence? And therefore, would the most nature-inspired architecture be one that takes that collective intelligence idea seriously? I would think so, but I'm neither a neuroscientist or an AI researcher, so I defer to what they both come up with! i

Finally, the most famous Sakana project is still The AI Scientist. The v2 system generated ideas, wrote code, ran experiments, and produced papers with less human templating than the original version. Sakana later said related work was published in Nature, which makes the project hard to ignore.

Sakana's own writeup lists limitations around rigor, hallucinations, and complex code. An independent evaluation of the earlier AI Scientist system found failed experiments, weak novelty assessment, and hallucinated results. That does not make the research unimportant. It makes it a better example of the real frontier: automated discovery loops, systems that generate hypotheses, run tests, read results, and propose the next experiment, are becoming possible before they are fully trustworthy.

You can kinda think of the AI Scientist as a precursor (or twin of) Andrej karpathy's autoresearcher; or you can think of Autoresearch as the simplest version possible of this same idea.

Pathway's BDH is the most direct challenger

Pathway's Dragon Hatchling paper is the closest thing in this debate to a direct architecture fight. It proposes a scale-free graph of locally interacting “neuron particles,” meaning small computational units connected in a brain-like pattern with a few highly connected hubs and many smaller nodes. Its working memory uses Hebbian-style plasticity, the “neurons that fire together wire together” idea: connections strengthen when connected units are active together. Working memory means information kept available during the current task, not permanently stored knowledge.

BDH is presented as formulated in a GPU-friendly way for language modeling. Scale-free means the connection graph has many small-degree nodes and a few highly connected hubs, a pattern seen in some biological and web-like networks. Language modeling is the training task of predicting text, usually by predicting the next token from prior context.

The headline claim is ambitious: BDH reportedly shows Transformer-like performance and scaling laws on language and translation tasks from 10M to 1B parameters. The paper also argues that BDH brings brain-like ingredients into the LLM world, including working memory through Hebbian-style synaptic plasticity, sparse positive activations, and more interpretable internal state.

The PageRank analogy from Kosowski makes sense in this context: BDH is not merely a faster attention variant, it is an attempt to state an underlying process for intelligence, then implement it in a form that can still run on modern hardware.

That is also why BDH carries the biggest burden of proof. If the paper's scaling story holds up at larger sizes, it becomes a serious Post-Transformer candidate. If it only looks elegant at smaller scales, the bitter lesson (Richard Sutton's argument that general methods that scale with compute tend to beat clever hand-designed structure), gets another data point with which to flex on the industry.

Btw, if you're interested in learning more about BDH, one of our most popular podcast episodes was our interview with Pathway's Zuzanna Stamirowska; check it out below!

Where the Post-Transformer bets actually differ

So, how do all these methods compare and contrast to each other? Let's put all that together and come up with a nice short hand between them, shall we?

1. Pathway's BDH

Where it overlaps: BDH still wants Transformer-like language performance and scaling. It is less “burn everything down” and more “keep sequence modeling, but rebuild the memory and state underneath.”
Where it differs: It makes state, sparse positive activations, and interpretability central. State means the information the model carries forward while processing a task.
Best fit for your test: Same-scale performance and compact reasoning. The paper claims GPT-2-like performance on language and translation tasks from 10M to 1B parameters, using the same training data.
Main caveat: BDH carries the biggest burden of proof. It needs independent replication and results beyond the 1B-parameter range before the field treats it as a real replacement path.

2. Liquid AI 's deployment-first hybrid bet, Liquid neural networks and Liquid-S4 time-dynamics bets

Liquid’s LFM2 is the most practical answer to the debate. It combines gated short convolutions, meaning local sliding-window filters with learned switches, with a small number of grouped-query attention blocks, which reduce memory and compute by sharing key-value heads across groups.

Where it overlaps: Liquid shares Mamba’s state-space lineage and shares the Transformer camp’s willingness to use attention when it helps.
Where it differs: Liquid starts from real deployment constraints: phones, laptops, CPUs, memory limits, and latency. It is architecture search with hardware in the loop, meaning candidate designs are tested against real device limits during the search.
Best fit for your test: Lower inference cost. LFM2 reports up to 2x faster prefill and decode on CPUs versus similar-size models, with open weights and deployment packages for edge use.
Main caveat: Liquid looks like the most useful near-term Post-Transformer transition, but it is a hybrid. It proves that alternatives can reshape the stack, rather than proving attention disappears.

The older Liquid line matters because it explains Lechner’s worldview. Liquid Time-constant Networks model neurons as continuous-time dynamical systems, meaning their behavior evolves like a physical process over time. Closed-form Continuous-time Neural Models made those systems much faster by avoiding slow numerical solvers. Liquid-S4 then pushed Liquid dynamics into long-range sequence modeling.

Where it overlaps: It shares the state-space and biological-dynamics instinct with Mamba, CTM, and BDH.
Where it differs: Liquid’s original bet is time itself: the model’s internal dynamics should adapt as the input changes, instead of treating every step like the same fixed layer recipe.
Best fit for your test: Long-context and non-text signals. Liquid-S4 reports strong Long-Range Arena results, 87.32% average performance, plus 96.78% on raw Speech Commands with fewer parameters than S4.
Main caveat: This line is strongest for time series, signals, and edge robotics-style data. It is less proven as a direct frontier language-model replacement.

3. Sakana's Continuous Thought Machines, DGM, Conductor, and Trinity search-and-orchestration bets

Sakana’s Continuous Thought Machine, or CTM, adds internal time to the model. Neurons process short histories of incoming signals, then use synchronization, meaning coordinated timing patterns, as part of the model’s internal representation.

Where it overlaps: CTM shares BDH and Liquid’s brain-inspired direction: more internal dynamics, more state, and less dependence on writing everything as text.
Where it differs: CTM is the most direct answer to Llion Jones’s complaint that Transformers reason through language. It gives the model room to think internally before answering.
Best fit for your test: Reasoning reliability without long token-by-token traces. The paper reports strong results on tasks like ImageNet-1K classification, 2D mazes, sorting, parity, question-answering, and reinforcement-learning tasks.
Main caveat: CTM was presented as a research step, not a state-of-the-art land grab. Its promise is architectural direction, not proof of frontier-scale dominance.

Sakana’s broader work often changes the system around the model instead of replacing the model block itself. Darwin Gödel Machine uses foundation models to rewrite and test its own coding-agent code. AI Scientist-v2 runs an automated research loop. Conductor learns to coordinate multiple models in natural language. Trinity evolves a lightweight coordinator that routes work among models.

Where it overlaps: These projects share the Bitter Lesson spirit: use search, learning, feedback, and evaluation loops to create better systems.
Where it differs: This is system architecture, not base-model architecture. The intelligence comes from orchestration, meaning one controller routing work among other models, plus self-improvement loops that test and keep successful changes.
Best fit for your test: Adaptation from small amounts of new experience. DGM improved SWE-bench from 20.0% to 50.0% and Polyglot from 14.2% to 30.7% by preserving successful variants and exploring new ones.
Main caveat: The strongest Sakana systems still ride on existing foundation models. They may be the path to discovering a new architecture, rather than the architecture itself.

Sakana and NVIDIA’s sparse Transformer work shows the Transformer side can absorb the efficiency critique. Sparsity means using only a small subset of weights or neurons at once, instead of activating everything.

Where it overlaps: It targets the same cost problem as Liquid and Mamba.
Where it differs: It preserves the Transformer and makes the expensive feedforward layers sparser, then uses CUDA kernels, low-level GPU routines, to make sparse computation actually run fast.
Best fit for your test: Lower inference cost inside the existing stack. The paper reports over 99% sparsity with negligible downstream impact, plus throughput, energy, and memory benefits that increase with scale.
Main caveat: This is the reason Post-Transformer arguments are hard. The incumbent can keep absorbing good ideas without giving up the architecture label.

So who is actually closest to passing the Post-Transformer test?

If we apply the test from the debate, then a new architecture earns attention only when it bends at least one of these five curves:

Better performance at the same scale, data, and training budget.
Similar performance with lower inference cost.
Better long-context learning as the task horizon grows.
More reliable reasoning without long token-by-token traces.
Or better adaptation from small amounts of new experience.

Obviously, none of the architectures we just mentioned clear all five yet (or we'd be having a much different conversation!). But here's where it appears they all stand based on what we know about them right now (data is limited here, so this is mostly a vibe check; we'd be curious to hear more from the labs themselves and their own tests when available):

BDH / Dragon Hatchling is the closest direct Post-Transformer architecture challenger. It best fits the same-scale performance test because the paper claims Transformer-like scaling from 10M to 1B parameters on language and translation tasks with the same training data. It also aims directly at state, working memory, latent reasoning, and interpretability. The gap is scale proof: BDH needs larger independent results before it can claim the belt.
Liquid / LFM2 has the clearest path to product impact. It best fits the lower-inference-cost test because LFM2 is built around real deployment constraints: latency, memory, power, edge devices, and non-text data. Its older liquid and state-space lineage also fits long-range sequence problems better than plain attention in some domains. The trade-off is that Liquid looks more like a practical hybrid deployment recipe than a clean new theory of intelligence. Not a big problem if you're just looking for raw performance I guess?
Sakana CTM is the strongest internal-reasoning bet. It best fits the “reasoning without long text traces” test because Continuous Thought Machines add internal time, letting the model keep computing before it turns that computation into words. That maps directly to the debate’s latent-reasoning question. The caveat is maturity: CTM is a promising research direction, not a proven backbone for modern systems; we're curious if Sakana is scaling this or testing it in new ways... let us know if so!
Sakana’s Darwin Gödel Machine, AI Scientist-v2, Conductor, and Trinity are the strongest search-and-adaptation story. They best fit the small-experience adaptation test because they improve through loops: propose, test, evaluate, coordinate, verify, preserve what works, and try again. That is also the most bitter-lesson-aligned path here because it scales through search and learning. The trade-off is category: these are system architectures, not direct Transformer replacements. This also means we could potentially deploy them in existing ecosystems, though.
Sakana's Sparse Transformers are the best near-term efficiency improvement inside the current paradigm. They fit the lower-cost curve by making Transformer feedforward layers much cheaper, with the paper reporting over 99% sparsity and negligible downstream-performance loss. That could matter commercially, but it strengthens the Transformer stack rather than replacing it. Call this a Transformer+ model?

Judged by the debate’s own test, the Post-Transformer camp has partial answers, but not a clean winner yet. BDH best fits the architecture fight. Liquid best fits real-world deployment. Sakana best fits the bitter lesson at the system level because its work scales through search, coordination, and learning. The missing proof is one architecture that clears all five curves at once. The likely near-term future is hybrid: Transformer systems absorb state, sparsity, routing, internal-time computation, and self-improvement loops until the label “Transformer” tells us less than the actual system design underneath it.

The paper trail, if you want to go deeper

Here are the papers and official project write-ups for the latest company releases of all the key contenders if you want to read more about any of these transformer-alternative ideas:

Pathway / BDH: Dragon Hatchling arXiv and PDF.
Liquid AI: LFM2 Technical Report, PDF, Liquid Time-constant Networks, Closed-form Continuous-time Neural Models, and Liquid-S4.
Sakana AI architecture work: Continuous Thought Machines, PDF, Darwin Godel Machine, and PDF.
Sakana AI agent and orchestration work: AI Scientist-v2, PDF, Conductor, PDF, Trinity, PDF, Sparser, Faster, Lighter Transformer Language Models, and PDF.

Our take: the next architecture has to beat the slope

The crowd gave Transformers the win, and that was probably right... for now. The Transformer stack still has the strongest evidence behind it, the largest ecosystem, and the clearest path to more capability (scale is all you need.... or is it data? Hell, it's probably both).

The Post-Transformer side still made the more interesting future-facing argument. The current stack has obvious pressure points: inference cost, long-horizon memory, continual learning, native reasoning, data efficiency issues, and more.

The likely future is messy. Transformers at the big lab may simply absorb Post-Transformer ideas. Post-Transformer systems may start as specialist systems until someone really big makes a massive bet on them at scale (and at this point, there still doesn't seem to be any momentum behind doing that... which seems foolish, given all the trade-offs that we just discussed).

In the end, hybrid models may make the naming debate feel quaint. Lechner's blurry-border argument may age the best: a Transformer with tiny KV cache and a recurrent model with massive state start to look like two ends of the same design space. Will everything eventually just converge?

The credible counter-narrative is the bitter lesson. Richard Sutton's bitter lesson says general methods that scale with compute tend to beat clever human-designed structure over time. That worldview favors the Transformer until a challenger proves better scaling, not merely better elegance.

For Richard's part, he just posted a compressed version of the bitter lesson on X: do not organize the field around human knowledge; organize it around methods that can create knowledge as compute grows, especially search and learning.

In normal-person terms, that roughly translates to the winning AI strategy is usually the one that gets better when you give it more machines, more time, and more experiments, rather than the one that starts with the most elegant human theory of how intelligence should work.

That is the cleanest way to judge the Post-Transformer camp. BDH, Liquid, Sakana, and future alternatives do not win because they feel more brain-like, more biologically plausible, or more philosophically satisfying. They win only if their learning curves keep improving when the system gets bigger. If their memory gets longer, if their search gets deeper, or if their self-improvement loop gets more chances to run.

It also keeps the Transformer side honest. The bitter lesson does not crown attention as king forever. It crowns the best method that converts more computation into more capability with the least hand-built human scaffolding.

If that remains dense Transformers plus bigger clusters, Kaiser wins by Sutton's own rulebook. If it becomes state, recurrence, evolutionary search, or BDH-style local interaction at scale, the Post-Transformer side wins

We want to end with this key distinction: researchers clearly can come up with something better (meaning more "bitter-lesson pilled", if you will, than what we have today. Clearly it is possible.

Even if the human researchers use AI to assist them to do it. Even if the researchers are ultimately the AI themselves. Believing the current system is perfect as it stands is frankly too absolute of a point of view for us (just like believing that today's transformer based architectures are useless or somehow a dead end is also too absolute of a view and clearly not true).

Perhaps many better architectures have already been created, and have just not been tested at scale yet. So the question is really whether a pioneering researcher can show a curve that bends harder, on a task that matters, under constraints that real systems actually face, in such a way that it bends the will of an entire industry alongside it?

If, or as we believe, when that happens, the hardware vendors, benchmark designers, frontier labs, and product teams will start to vote with their roadmaps. Much like Peter Steinberger changed the entire AI industry around his vision for agents with OpenClaw, it's very possible someone will be able to change the trajectory of these megayachts before they run aground the many lurking icebergs that threaten the modern AI industry with uncontrollable memory, energy, and cost to run these models the way users actually want to use them (in everything, everywhere, all the time... so long as the costs are trivial and the results are good).

Now, for the full debate and all the key moments listed out with time codes, see below.

The frame and who was arguing what

(00:03) The event opens by framing the debate as the start of the Post-Transformer era, with the Wall Street Journal’s prediction used as the setup for the whole fight.
(01:08) Zuzanna Stamirowska and Dexter Horthy frame the question as deeper than model taste: the math behind AI now shapes trillion-dollar markets and the future of the field.
(01:33) Team Post-Transformer is introduced as the camp arguing for dynamical systems, latent reasoning, continual learning, and whatever comes after attention as we know it.
(01:49) Llion Jones is introduced as the rare challenger who helped build the original Transformer and now argues that something after it is needed.
(02:34) Mathias Lechner is introduced as the liquid neural networks and dynamical systems voice, representing Liquid AI’s architecture-first pragmatism.
(03:18) Adrian Kosowski is introduced as the BDH architecture voice, arguing for a more direct route to reasoning.
(03:51) Łukasz Kaiser is introduced as the reigning Transformer defender, a co-inventor of the architecture that changed modern AI.
(04:40) The format is set: opening cases, rebuttals, quick-punch rounds, audience questions, closing statements, and a crowd vote.

Opening case: Łukasz Kaiser for Transformers

(06:37) Łukasz starts by noting that inventors often become their own harshest critics, comparing his earlier skepticism about Transformers to Geoffrey Hinton’s skepticism about backpropagation.
(07:15) His core case is empirical: an extremely simple next-token machine now chats, writes code, clicks around computers, and is starting to do real work.
(07:39) He lands the simplest Transformer argument: it works, and many rival systems still do not work as well in practice.
(08:02) He asks the audience to think of the Transformer as memory rather than attention, which reframes it as a system for writing and retrieving information.
(08:18) His library-card analogy explains keys and values: a librarian writes a searchable key on a card, then stores the value that tells you where to find the relevant book or page.
(08:51) In Transformer terms, new information creates a key and a value; later queries retrieve the most similar key and return the corresponding value, using a soft differentiable version of that lookup.
(09:16) He argues that this is a fundamental form of memory because the model can keep concatenating information as context grows.
(09:32) He treats many modern additions as patches around a strong core: chain-of-thought for reasoning, compaction for long context, and mixture-of-experts when more parameters are needed.
(09:47) His opener ends with the positive Transformer case: the architecture is simple, beautiful, and still works.

Opening case: Adrian Kosowski for Post-Transformers and BDH

(10:10) Adrian says he is making the case for Post-Transformers, but even more for intelligence itself.
(10:20) He defines intelligence as the ability to solve hard problems, especially problems the system has not seen before.
(10:49) He argues that humans are no longer the only apparent intelligent problem-solvers; humans and Transformers now both solve millions of hard problems every day.
(11:14) His Post-Transformer thesis is that having multiple examples of intelligence should push the field to search for the common theme behind intelligence.
(11:52) He names the Transformer’s weaknesses as continual learning, long-term memory, and reasoning in latent space without complex hacks.
(12:29) He uses web search as an analogy: the 1990s had many search engines, then PageRank plus MapReduce gave the field a unifying mathematical theme.
(13:21) His prediction is that intelligence still needs its PageRank moment: one clear theme, equation, or process that makes intelligence scale more directly.
(13:42) He positions BDH as an early clue: latent reasoning in high-dimensional spaces, combining state-space ideas with sequence processing.
(14:14) His final frame is that Post-Transformers are the search for the hidden motif of intelligence, rather than a claim that Transformers are useless.

Opening case: Mathias Lechner for architecture pluralism

(14:40) Mathias rejects the binary framing. Liquid AI’s answer is Transformers and Post-Transformers.
(15:07) He says model design should start from hardware, use cases, capabilities, and speed constraints rather than architecture ideology.
(15:24) He cites Liquid’s GPT-3-level language model running on a Raspberry Pi at about 40 tokens per second as evidence that different requirements call for different designs.
(15:54) He welcomes every useful building block: new attention mechanisms, DeepSeek-style multi-latent attention, compressed attention, state-space models, Mamba, gated linear attention, and convolutions.
(16:20) His broader point is that research is dynamic. Requirements change, hardware changes, and agents may accelerate discovery of new layers and model variants.
(16:55) He contrasts Raspberry Pi deployment with a future NVIDIA Vera Rubin stack drawing megawatts of power, making hardware context central to model design.
(17:13) Liquid’s practical stance is to draw from all available architectures and build the best model for the current constraints.

Opening case: Llion Jones against architectural complacency

(17:41) Llion opens by joking that if Łukasz disagrees with you, that is usually a sign you may be wrong.
(18:18) He concedes that if he were inside OpenAI, he might also defend Transformers, because the leading company has economic reasons to keep improving the thing that works.
(19:17) He argues that startups have a different job: take long bets on what comes after the current dominant architecture.
(19:30) He points out that OpenAI itself once played that challenger role by noticing that Transformers scaled better than other approaches.
(20:02) He admits that scaling Transformers might reach artificial superintelligence, but says his intuition is that there is a better route.
(20:18) He calls Transformers elegant but brute force because they require immense amounts of data and compute.
(20:36) His proof-of-concept for a better architecture is the human brain: humans do not read the entire internet several times to become intelligent.
(21:07) He argues that current reasoning gains feel hacky because a program is wrapped around the Transformer to feed outputs back into it.
(21:30) His challenge is direct: if Transformers were truly powerful enough, they should learn to reason natively instead of relying on external scaffolding.
(21:55) His biggest warning is that Transformer success may be keeping researchers trapped in a local minimum.
(22:25) His prediction is that once a better architecture appears, even Transformer defenders will move to it because performance will force the switch.

Rebuttals: what each side pushed back on

(23:35) Łukasz says the opposing arguments are strong, but compares them to earlier critiques of backpropagation: brute-force methods can keep winning even when they feel biologically ugly.
(24:05) He pushes back on the small-model-on-small-hardware dream by saying he personally picks the best model at the highest thinking budget every time he uses AI.
(24:28) He says recurrent architectures are beautiful, but the Transformer can already be viewed as a recurrent system with a very simple memory.
(24:57) He reports that a small GRU ran about 50x slower than a much larger Transformer on current NVIDIA hardware because recurrence is sequential.
(25:27) He acknowledges hardware shaped the race: without Transformers, hardware might have evolved to run loops faster, but parallel hardware is easier to build than deeply sequential hardware.
(25:51) He concedes the human brain’s data efficiency and generalization are far beyond what current ML explains.
(26:02) His counter is that current machine learning methods and hardware still do not provide a clear replacement path.
(26:27) He closes the rebuttal with a twist: the Transformer itself may be the system that discovers its replacement.
(26:36) Adrian responds that backpropagation is excellent for learning, but reasoning is different because it unrolls many steps over time.
(27:26) His technical point is that reasoning needs long algorithmic paths, while pushing gradients back through those same long paths during learning creates problems.
(28:14) He argues that architecture can shift the depth-versus-recursion compromise to reduce the need for long chains of thought.
(28:23) He says Transformers think in language and memorize thoughts in language, while Post-Transformer systems should reason in latent thought more directly.
(28:59) Adrian clarifies that Team Post-Transformer is not the RNN club. He says better state-propagating designs can still use matrix multiplications and GEMMs effectively.
(29:30) His compact diagnosis of old RNNs: they have too little state, while the brain has a lot of state.
(29:50) Mathias says the border between Transformers and RNNs can become blurry: a Transformer with a tiny KV cache and an RNN with a massive state start to look philosophically similar.
(30:43) He points to fast weights as another kind of memory storage, especially when operating under a fixed budget.
(31:23) His forecast is that autonomous AI agents and the current Transformer-driven progress may help discover the Transformer’s own replacement.
(32:03) Llion says the immediate return to RNN talk proves his point: the field is still trapped in the current paradigm.
(32:13) He argues that researchers misunderstand the Transformer breakthrough when they think the next leap will come from shuffling the same components again.
(32:49) His view of the real Transformer breakthrough is hardware fit: token processing became thousands of times faster, and that optimization cannot simply be repeated.
(33:28) He urges researchers to question hidden assumptions about neural networks, including whether future systems can even be trained by backpropagation.
(34:11) He describes the field’s cognitive dissonance: everyone knows breakthroughs happen, but everyone is still surprised when they arrive.
(34:24) His memorable forecast: Łukasz may be right until the day a new architecture breaks through, and then he will be wrong forever.

Quick punch round: what is intelligence?

(35:36) The first quick-punch round asks the speakers to define the nature of intelligence.
(35:50) Łukasz says the reinforcement-learning definition, the ability to act in the world and get desired outcomes, is useful but not satisfying as a research definition.
(36:30) He says researchers feel there is something deeper to intelligence, but engineers still need to define it by observable system behavior.
(37:09) Adrian argues that intelligence is not a thing or a product. It is a process, like the difference between a Toyota car and the Toyota production process.
(37:37) He frames intelligence as an algorithmic or dynamical process that solves things, rather than a static object.
(37:55) He returns to PageRank as a tiny chunk of intelligence: a process for indexing information effectively.
(38:44) He says Transformers and humans are both embodiments of intelligence, but the underlying process remains unquantified.
(39:14) Llion says the current AI field has landed on a strange but useful proxy for intelligence: how well a system predicts the next word on the internet.
(39:23) He defines intelligence as compression: the better you compress the internet, the more intelligent the model appears.
(39:54) Mathias says intelligence is hard to quantify and may include cultural or subjective components.

Quick punch round: reasoning beyond language

(40:25) The moderators ask whether Transformers fundamentally deal with language, which opens the reasoning-beyond-language debate.
(40:33) Łukasz clarifies that Transformers are sequence models, not inherently language-only models.
(40:44) He says the name Transformer came from the goal of transforming any data into any other data.
(41:08) Llion says language embodies intelligence, which explains why language models work so well, but some human mental processes are not grounded in language.
(41:26) His Post-Transformer requirement is a system that can support those non-language mental processes.
(41:45) Łukasz counters that sequences can represent words, images, proteins, or other modalities; using words is a pretraining choice, not an architectural limit.
(42:13) Adrian asks whether discovery, especially discovering things never said before, is harder to express in language because language must become verbose before the idea exists on paper.
(42:40) Łukasz answers by citing a recent GPT-5.5 solution to an Erdős problem that had been open for about 60 years, arguing that language-token systems can still discover new solutions.
(42:57) He adds that vectors at 8-bit precision are almost like words in another system, so the Transformer may care less about the representation than people assume.
(43:21) He says the more important issue may be speed: children reach impressive conclusions quickly and do not spend hours on chains of thought.
(43:53) His closing thought in this round is that the brain may be even more parallel and faster than Transformers, or perhaps something like an even better Transformer.

Quick punch round: scaling laws and the Bitter Lesson

(44:16) The second quick-punch round turns to scaling laws and whether scale remains the master key.
(44:36) Llion invokes the Bitter Lesson: in AI history, using 10x more compute and data often beats more interesting hand-designed changes.
(44:59) He says the Transformer’s success came from scalability, and any Post-Transformer must demonstrate the same property.
(45:15) Mathias agrees that Liquid sees clear scaling laws across different architectures and across roughly two orders of magnitude in model size.
(45:39) He adds that text-space reasoning exists largely because the data is available, even though it may not be the most efficient way to think.
(45:55) Llion says some architectures may have better scaling laws but worse adaptation to today’s hardware.
(46:19) Łukasz says he is firmly in the scaling camp and that the Transformer helped start that camp.
(46:42) His concession condition is explicit: show him a curve that improves faster than the Transformer scaling curve, and he may have to concede.
(47:13) Llion worries that hyperscalers and the full internet make brute-force scaling feel good enough, even though a more efficient method would be preferable.
(47:43) Łukasz warns about a possible dead end: an architecture that is data-efficient at small scale but does not beat Transformers when scaled up.
(48:17) Adrian says Transformer scaling couples data, model size, and compute, while future architectures may uncouple those scaling dimensions.
(48:35) He uses children and chess grandmasters as examples of systems where limited data can be paired with large amounts of internal compute.
(48:58) His practical point is that limited-data domains still matter, especially in science and enterprise, where more training data may be unavailable.

Real-world deployment and benchmarks

(49:15) The third quick-punch round moves from theory to real-world deployments.
(49:24) Mathias says many important problems are not text problems, including proteins, genetic sequences, and biomedical signals.
(49:45) He says Liquid has seen cases where recurrent or other architectures beat Transformers for certain non-text modalities after substantial testing.
(50:08) His deployment lesson is that speed and hardware matter, and model architectures will co-evolve with hardware over the next months and years.
(50:34) The moderators ask whether benchmarks show real progress or mislead the field.
(50:58) Llion says benchmarks are the best method available, but they are far from ideal and are too easy to game.
(51:22) He clarifies that benchmark optimization is not necessarily cheating, but strong benchmark results may fail to transfer to real capability elsewhere.
(51:49) Łukasz argues from Transformer history: BLEU was the standard translation metric, but perplexity became more useful and outlasted it.
(52:03) He says perplexity, the probability of the next word, turned out to correlate well and stayed useful after older metrics became less meaningful.
(52:17) He says OpenAI and other labs often benchmark models by measuring perplexity on internal codebases.
(52:26) He stresses that good evaluation requires a holdout dataset that is nowhere on the internet and never released publicly.
(52:41) He links perplexity back to compression and says it is hard to imagine a dramatically better general benchmark, though reasoning adds small complications.
(53:13) Llion agrees with the perplexity point and says he would like people to push perplexity again.

Closing arguments

(54:11) Łukasz proposes a private company that maintains held-out text and code datasets, measures perplexity through an API, and charges labs a small fee.
(54:42) He says the benchmark should optimize not merely for the best agent today, but for the best scaling curve and steepest slope.
(55:05) His final verdict is that the Transformer still wins for now, but agreeing on a meaningful metric would be progress.
(55:25) Adrian agrees that metrics matter, then makes explicit that Transformers are optimized for hardware-efficient pretraining.
(55:47) He says the inference and reasoning era raises a new question: is the Transformer also optimal for hardware while reasoning?
(56:18) He argues that chain of thought can express any reasoning, but that does not mean it uses hardware efficiently.
(56:35) His forecast for the next breakthrough is more efficient, more compact reasoning with less hardware use.
(56:52) Mathias repeats Liquid’s hedge: keep improving Transformers because they work, while also researching Post-Transformer approaches.
(57:24) Llion says the Transformer is best for now, but the brain and current model gaps keep his belief in something better intact.
(57:53) His final prediction is that breakthroughs do arrive, and when one arrives in AI architecture, the field will enter a Post-Transformer world.

Audience Q&A: hardware lottery and the 10x bar

(58:48) The first audience question asks how AI can move beyond Transformers if hardware still forces every idea through the same bottleneck.
(59:33) Llion names this the hardware lottery: Transformers were designed to exploit TPUs and massive matrix multiplications, which made the field more stuck in the current local minimum.
(59:55) He argues that the first Post-Transformer iteration will obviously fail to match today’s best Transformer, and the field needs to accept interesting results before state-of-the-art results.
(1:00:15) He says reviewers often need reminding that a paper can be interesting before it is state of the art.
(1:00:44) Łukasz counters that Transformers also had hardware friction at first: early TPUs lacked the exponent needed for softmax, so attention had to offload activations to the CPU.
(1:01:13) His point is that Transformers had to clear a hardware hurdle before hardware companies changed course around them.
(1:01:36) He says a Post-Transformer architecture must clear a much higher bar than a 2x improvement, probably closer to 10x.
(1:01:43) He calls that 10x bar a blessing because it forces researchers to think bigger than small tweaks.
(1:02:02) He notes that a modern laptop can be as fast as the 8-GPU box used to develop the original Transformer, which changes what research is possible.
(1:02:18) His actionable standard is practical: a model can be 50x slower at first if it has a better scaling slope.
(1:02:40) He says agents can now write CUDA or Triton kernels, which can make painfully slow GPU code faster and narrow the hardware barrier.
(1:03:12) He rejects hardware mismatch as a complete excuse. Researchers can still show a curve that bends the right way.
(1:03:33) Llion agrees but says many researchers need permission to take that leap.
(1:03:48) Łukasz joins the permission-giving: do not be scared of being 50x slower and do not be scared of being less accurate than Transformers at first.

Audience Q&A: continual learning, dynamic weights, and long context

(1:04:00) The next audience question reframes intelligence as the ability to keep learning and changing neural connections, like biological brains do.
(1:05:01) The audience member contrasts biological neural networks, which update constantly, with Transformers that are trained, frozen, and then served at inference.
(1:05:39) Adrian says that if you want to mimic human learning, look at in-context learning in Transformers rather than backpropagation.
(1:06:00) He defines the useful analogy: give a Transformer a new riddle or problem plus examples in context, then observe how it adapts.
(1:06:17) His ideal version of intelligence is in-context learning extended toward infinity: an infinite session that forgets nothing and picks up skills over time.
(1:06:40) Llion says the point helps Team Post-Transformer because standard neural networks were built around static weights.
(1:07:03) He says 2026 is supposed to be the continual learning year, but adding dynamic weights on top of static-weight systems feels hacky.
(1:07:18) His preferred future system is designed for dynamic weights from the ground up.
(1:07:42) Łukasz pushes back by saying Transformer forward-pass activations can do something remarkably similar to gradient descent after pretraining.
(1:08:21) He says evaluation should run on very long contexts, not sentence-level fragments, because context changes what models can do.
(1:08:56) He notes that 1M tokens is already a lot of context, so the open question becomes whether learning should happen through gradient updates or activation updates.
(1:10:10) Llion agrees that the Transformer is not going anywhere because it is too successful and useful, even in a Post-Transformer world.
(1:10:48) He accepts Łukasz’s point that attention weights are dynamic weights, meaning the Transformer may already contain something like continual learning.
(1:11:01) Adrian says the painful word is maybe. The field lacks benchmarks that directly measure how good a model’s in-context learning algorithm really is.
(1:11:43) He says many long-context benchmarks are needle-in-a-haystack retrieval tests, not tests of learning.
(1:12:15) Łukasz says Transformer in-context learning is underappreciated and gives the example of text-form tabular data enabling surprisingly good time-series prediction.
(1:12:51) Adrian draws a key distinction: context inserted into the model is different from context earned through the model’s own life path.
(1:13:22) He says human-like context comes from actions taken, reactions received, thoughts formed, and chain-of-thought internalized over time.
(1:13:46) His ultimate benchmark is an architecture that can use millions of tokens of lived context to reason, update beliefs, build hypotheses, prove lemmas, and check itself.
(1:14:08) Mathias predicts the future will blur the extremes, with Transformer and RNN-like ideas merging rather than remaining separate camps.

Audience Q&A: fine-tuning, latent reasoning, and safety

(1:15:03) The final audience question asks how fine-tuning compares with context learning, and whether latent-space reasoning creates safety risks.
(1:15:47) Łukasz says the field may over-rely on the fact that chains of thought are written in text and currently seem fairly faithful.
(1:16:03) He points out that beneath each token are dozens of layers and thousands of floating-point activations, and researchers do not fully know what happens there.
(1:16:31) He warns that the same visible words could eventually correspond to very different internal thoughts.
(1:16:47) His safety takeaway is that researchers should not be complacent and need work that keeps chains of thought faithful.
(1:16:55) Llion offers a hopeful counter: a Post-Transformer system closer to how the brain works might turn out to be more interpretable and safer.

The crowd vote and final beat

(1:17:30) The moderators switch from scientific precision to a crowd noise vote, with four trophies ready either way.
(1:18:08) Team Transformer gets its noise-meter moment first.
(1:18:33) Team Post-Transformer gets its noise-meter moment second.
(1:19:00) The crowd vote is presented as close, but Team Transformer takes the championship.
(1:19:16) The trophy handoff lands the final thesis: Transformers win tonight, and Team Post-Transformer’s job is to prove them wrong.

Actionable takeaways from the debate

(15:07) For builders: choose architecture by workload, hardware, modality, speed, and deployment constraints, not by loyalty to a research camp.
(21:30) For researchers: a real successor should make reasoning feel native, rather than relying mainly on external scaffolding around language outputs.
(28:14) For architecture teams: reducing long chain-of-thought dependence is one promising target, especially if latent reasoning can do the same work more compactly.
(32:49) For investors: hardware fit can matter as much as algorithmic elegance. The winning architecture must ride the available compute substrate or justify new hardware.
(46:42) For anyone claiming Transformer replacement: bring a better scaling curve, not merely a better demo.
(48:58) For enterprise and science teams: limited-data domains are where non-Transformer architectures may find early openings.
(52:26) For benchmark designers: private holdout sets matter because releasing the benchmark eventually corrupts it.
(54:11) For evaluation startups: Łukasz basically described a business, a trusted private perplexity benchmark for code and text, accessed through an API.
(56:35) For product teams: the next architectural win may show up as cheaper, more compact reasoning rather than a splashy chatbot behavior change.
(1:02:18) For early-stage architecture work: being 50x slower can still be acceptable if the scaling slope is better.
(1:03:48) For reviewers and lab leads: do not kill every non-SOTA idea early. Interesting can precede state of the art.
(1:11:01) For benchmark creators: the field needs tests of in-context learning as learning, not just long-context retrieval.
(1:15:47) For safety teams: visible text chains of thought are not the whole model. The activations above the tokens are already a latent reasoning space.

Lastly, can I just say: I simply love stuff like this. It's kooky, it's weird, it's forward looking, and it's acknowledging the industry's flaws while pushing it to be better. We know big tech CEOs can't do stuff like this, but the people involved here are doing (and have done) the really important kind of work to push the needle forward on AI progress outside the big labs, and it's cool to see people debate technical stuff in public and have fun doing it.

IMO, the big labs are too secretive on what they're building to be this interesting, and the big tech giants are too focused on investor sentiment and pushing the current paradigm to be this innovative.

More like this plz.

What the Transformer vs. Post-Transformer debate revealed about AI's next architecture