2026 is the year the magic stops and the engineering begins.
For the last two years, AI got noticeably better every few months. That phase is ending.
Text reasoning is plateauing because the two engines driving progress are hitting limits.
Pre-training is tapping out. We are running out of high-quality web data to feed the models.
Post-training is hitting a wall. RL is expensive and only truly scales for tasks with clear right answers, like math or code, not the messy real world.
So we cannot expect the same massive jumps in performance we saw in the past few years. Not even close.
Since AGI is likely 5–10 years away, we have to ask: Why is AI not driving GDP at the pace expected?
The missing keys
It comes down to two gaps: on-the-job learning and generalization.
When Trump announced tariffs in April, trading logic changed overnight. A human analyst read the news, talked to a few people, and instantly updated their mental model. They generalized from a few data points and adapted.
An AI can’t do that. It is frozen in its training past. It cannot “learn on the job” or generalize to a totally new regime without massive retraining or engineering.
The long-tail problem
This is the core issue. Most real-world jobs are full of these "long-tail" scenarios. They are context-heavy and constantly changing.
Because you cannot pre-train for every possible future scenario, general-purpose models fail to capture the specific nuances needed to do the job. That is why we aren’t seeing the GDP explosion yet.
Fix the context problem
This is where better engineering comes in. If we can’t make the model smarter, we have to make it better informed.
You cannot have reliability if the model is guessing about your specific business context. That is exactly why we are building retrieval engines. Once we start reliably feeding the right facts at the right time to our agents, they stop guessing and start executing.
The goal isn't to replace humans. It's to stop burning creative minds on administrative tasks. Let the agents execute the process so the humans can create the value.
That's exactly what we're building here at EaseFlows AI.
AI’s 2026 Pivot (The Model Layer): After Scaling Peaks, Before AGI Arrives
AI didn’t stall. Scaling did.
That changes the entire playbook for 2026.
It’s my longest write-up so far, but it’s still a compression of the best minds in the field (Ilya, Demis, Richard Sutton, and more).
If you read it, you’ll walk away with sharper instincts for what is real, what is marketing, and what matters in 2026.
No press-release numbers. Just the constraints that actually matter.
EaseFlows AI
18 min
EaseFlows AI
18 min
TL;DR: The Strategic Pivot of 2026
The initial "gold rush" at the foundation model layer has hit a predictable plateau. We are currently sitting in the gap between two S-curves. The first curve, LLM scaling, has leveled off. The second curve, true AGI, is still in its infancy and years away from its vertical climb. In this long interim, the playbook changes. The era of "bigger is always better" is ending. The era of systems engineering and algorithmic discovery has begun.
As we enter 2026, the signal for decision makers is unmistakable. Do not wait for a god-like intelligence to arrive and fix everything (You could wait, of course. It might only take a decade). General-purpose model gains are considerably slowing, and the real alpha is now found downstream in integration, workflow design, and adoption.
What this article covers
Reality Check: AI’s Morning After. Why the gap between benchmarks and business value widened in 2025, and why the "just scale it" era is over.
Scaling Slowdown. A deep dive into the two engines that are stalling (pretraining data limits and RL constraints) and the jagged intelligence that results.
Demystifying LLMs. A practical look at what models actually do (retrieval and recombination) versus what we imagine they do (human-like reasoning).
The AGI Gap. Three core capabilities current models still lack: continual learning, generalisation, and physical grounding.
Execution Playbook. A brief note on how to build reliable systems today by playing to LLM strengths and engineering around their weaknesses.
Paths to AGI. The two competing visions for the future (systems integration vs. algorithmic discovery) and why neither will result in a winner-take-all monopoly.
The Winner’s Curse. Why foundation models will commoditise, and why the real durable value is shifting downstream to data and context.
Reality Check: AI’s Morning After (2026)
Over the last year, I kept hearing two reactions to AI: "wow, this is incredible" and "okay… is that it?" In my experience, the first reaction comes more from engineers, the second from everyone who has to live with the output.
My personal summary of the last year is one word: disappointment. Not because there was no progress, benchmarks still climbed. The disappointment is the gap between the dream and the delivered reality, a gap that even many people building these systems did not fully expect.
For a while, the industry behaved like it was in an endless boom phase. Keep pouring in data and compute, keep scaling, keep getting lift. But when a system reaches the upper part of its S-curve, progress stops compounding on the same trajectory.
Another way to say it is this. The LLM S-curve is already near the top, so the next burst of GDP will not come from squeezing the same curve harder. It will come from finding the next S-curve, then riding it through diffusion.
We are early on that next curve. The foundations are here, but the productivity payoff depends on complements: integration, workflow redesign, data plumbing, and new operating habits. That buildout takes time, which is why waiting for a near-term "GDP boom" on top of today’s models is a longer wait than most narratives admit.
It is like strength training. Early gains are fast. Later gains still come, but they are incremental, and you need smarter programming, not just more volume.
Scaling Slowdown: Pretraining and RL Hit Limits
Why did the last year leave so many people disappointed? Because two core growth engines are running out of easy upside.
1) Pretraining is close to exhausting its "easy gains"
At its core, pretraining is about injecting a vast amount of human knowledge into a model. The constraint is data. Not just any data, but high-quality, complete, semantic data, the kind that captures the full path from zero to one, including the intermediate reasoning steps.
Software is the cleanest example. Commits, pull request reviews, and test outcomes, you can trace the work step by step. Most industries do not produce anything with that level of structured completeness.
The problem now is that public web corpora have been heavily mined. Getting more high-quality text is hard. What is left is mostly private, behind corporate firewalls, and only useful if it is well governed and semantically clean.
At this stage, trying to "make up for it" with clever algorithmic tweaks in pretraining is increasingly a lot of effort for modest return. You mostly have to wait for more good data to accumulate, which takes time.
So the golden era of "just add more data and parameters and performance goes up" is fading. The next leap requires a breakthrough in learning principles, not just larger versions of the same model. High-quality text has been squeezed, and the marginal gains from simply making models bigger are shrinking fast.
It is like optimising a supply chain after the biggest bottlenecks have been fixed. You can still improve, but it becomes incremental work, not a step change.
2) Post-training, meaning reinforcement learning, scales less predictably
If pretraining fills the model with knowledge, post-training teaches it how to retrieve the relevant pieces more reliably when it faces a task.
Unfortunately, reinforcement learning delivers less predictable returns than pretraining, and the results are highly sensitive to reward design, environment coverage, and rollout length.
The fundamental constraint is structural. If pretraining is limited by data, RL is limited by the training environment and by the quality of the learning signal you can generate inside that environment.
Even if you throw more resources at it, the most reliable gains tend to concentrate in domains with clear, verifiable feedback, because the learning signal is harder to game and easier to propagate through long trajectories. Coding is again the obvious case: code compiles or it does not, tests pass, or they do not. That is why AI coding became such a major focus.
Reinforcement learning shines in finite task spaces with crisp rules. With enough training, the upper bound can exceed human performance, like modern systems in Go. The floor can also rise dramatically.
But what about management decisions? What impact did a leadership call have, and relative to what alternative? Attribution is murky. Outcomes are delayed. The reward signal is fuzzy. That ambiguity makes it extremely hard to use RL to improve an LLM's performance as a "manager".
So the conclusion is harsh: under the current regime, text-based reasoning will see incremental gains but is unlikely to deliver a step-change in general reliability without a fundamental breakthrough.
General capability has limited room left to climb. What will continue to improve materially is specialised reliability: within a company with a specific business context, high-quality private data, or a workflow that can be turned into a simulation with verifiable rewards, models can become significantly better at that narrow slice. Outside those conditions, general-purpose models are unlikely to move up materially in real-world reliability.
Jagged intelligence, real economics
That is the gap people feel. Even when models brush against today’s ceiling, they still feel far from the kind of "intelligence" many imagined. They can score at a PhD level on benchmarks, then turn around and make basic mistakes. The result is an awkward mix: impressive and unreliable at the same time.
The industry sometimes calls this jagged intelligence. The label is useful. The word "intelligence" still risks overselling what is really happening.
This also helps explain why the real GDP impact has been modest relative to the biggest narratives. If trillion-dollar "AI economies" were already arriving in practice, major model companies would not still look like they do today: meaningful revenue, but still struggling to close the gap to sustainable profitability.
In 2026, the foundation layer is unlikely to keep steamrolling forward at any cost, the way it did in the earlier years. The easy optimism has cooled.
During the hype phase, people made big predictions: AGI next year, broad labour replacement, exponential growth everywhere. Now, in the first quarter of 2026, the mood is more like the first serious debrief after a major rollout. The questions get more concrete: which claims are actually feasible, how would they work, and what would they cost?
If I had to pick a keyword for AI in 2026, it would be: back to reality.
Demystifying LLMs: What They Actually Are
We have been leaning a bit negative, so let’s pull it back. If LLMs are not that "intelligent", where did the last thousand days of scalp-tingling "aha" moments come from? Why did the world decide intelligence had "emerged"?
The cleanest answer is information asymmetry. An LLM is not a thinking brain. It is closer to an extremely compressed database, plus a strong retrieval and recombination engine.
A small but telling example: Andrej Karpathy has been building an open-source project called NanoChat. You would expect today’s coding assistants to be pure leverage for someone at his level. Instead, he found they often failed to help, and sometimes got in the way, because his design choices were unconventional. The model kept nudging him back toward the most common patterns on the internet.
That points to a basic truth: LLMs are strongest on problems they have effectively "seen". Most engineers love them because much of real-world code is routine work that has been written thousands of times. The training data is full of it, so the model can produce it with high confidence.
This also reframes the supposed "reasoning leap". After reinforcement learning (RL), models often get better at hard math and logic benchmarks. It is tempting to say, "it learned to think", and then label it a "reasoning model". A more accurate interpretation is simpler: many solution patterns were already in the base model, and RL mostly improves the odds of pulling out the right one at the right time. It is a retrieval boost, not a new thinking mechanism.
So why does it feel like "emergent intelligence" to humans?
Because our own knowledge is narrow. A question can sit on top of many intertwined concepts. To a person, those concepts often feel separate. To a model trained on massive corpora, the statistical links are obvious. When it surfaces connections you did not know existed and strings them together smoothly, it can look like insight.
Under the hood, it is more like assembling a puzzle than inventing a new piece.
People often push back here: "If it’s just retrieving, how can it solve math contest problems it has never seen? Are the answers somehow stored inside the model?"
The key clarification is that it is often retrieving rules, not memorised text. Math problems could be new, but the building blocks are old. The model searches and recombines those blocks into a solution.
Think of it as LEGO. The bricks are standard; the arrangement can be novel.
A simple way to see the limits is the nonsense test: prepend an irrelevant sentence to a formal logic problem, something like "cats can sleep 20 hours a day", and performance often drops. Humans can ignore noise. LLMs are more likely to absorb it into the probability mix and derail.
That is the foundation. LLMs can assemble fragments of the known world into remarkably coherent output, and that alone has economic value. But the "aha" moments mostly come from the scale of information and pattern recombination, not from grounded understanding. When you push beyond what the model has seen, or into a world that keeps changing, the system becomes less stable, because it does not truly know what it is saying.
The AGI Gap: What’s Still Missing?
So we are back to the core question. If an LLM is essentially a powerful index rather than a true agent, where is the gap between it and what people imagine as AGI, meaning human-like thinking? Why does this "intelligence" often feel, frankly, underwhelming?
The short answer is three missing capabilities: continual learning, generalisation, and physical understanding.
A concrete example: in April last year, Trump abruptly announced a new tariff framework. What does a human trader do? They read a few reports, talk to a few peers, update their mental model, and adjust decisions immediately. They do not need ten thousand prior tariff cases. A few high-signal inputs are enough to incorporate a new policy variable into their decision logic.
That demonstrates two abilities at once:
Small-sample generalisation: learning a new regime from a handful of signals.
Continual learning: updating the model in real time, without a shutdown and retraining cycle.
Today’s AI largely cannot do this. It is frozen at a training-time snapshot. It cannot reliably learn while doing, and when the regime changes, it typically cannot adapt without heavy retraining or engineering.
This is a big part of why human labour stays valuable. Humans do not need to build a full training loop for every small change in the job. Real work is full of non-standard, long-tail situations where rules, context, and risk tradeoffs shift constantly.
In most roles, you handle dozens of judgment calls every day. The tasks vary across people, and even for the same person, they change day to day. You cannot automate an entire job simply by predefining a fixed skill list for a model.
Why not? This goes back to what an LLM is doing: it is selecting the most statistically likely next step.
Here is a simplified picture. In 99% of training data, word A is followed by word B, so the model learns that high-probability pattern. But in a specific physical context (Context X), word A should be followed by word C. The model often fails to adjust because it does not understand the causal mechanism that makes Context X special. It is pulled by statistical habit and still outputs B.
That is the difference between probability fitting and mechanism understanding.
Missing Capability 2: Physical Understanding
Consider a chaotic edge case: a wobbly plastic table, a wine glass filled with boiling water on the edge, and children running nearby. A human brain reacts with immediate physiological tension. You run a millisecond simulation of the causal chain: the floor vibrates, the unstable table shakes, the top-heavy glass tips, and the boiling water scalds a child. This is not memory retrieval. It is a physics engine combined with a value system that prioritises safety.
Current AI lacks this world model. It sees the objects but misses the dynamic interplay. The combination of "boiling water" and "wine glass" is a statistical outlier that likely fails its probability patterns. Even if it suggests moving the glass, it does so only because the tokens "edge" and "push back" frequently co-occur in its training data. It mimics the advice without understanding the peril. It operates on statistical likelihood, not physical ground truth.
A useful framing from cognitive science is a two-floor building:
Floor one: association, see A, guess B.
Floor two: intervention, simulate what happens next if you do not act.
Modern LLMs mostly live on the first floor. Physical understanding lives on the second.
Operational Impact: Why These Gaps Matter
Stack these gaps together, and you see why naive scaling will peter out. We can climb the current ladder higher, but it does not lead to the next floor. When you ask an LLM to handle a novel situation outside its training distribution, or a task that requires physical interaction, it either freezes or starts hallucinating. That brittleness is not in the same league as human robustness, the ability to see something new and still get the core dynamics mostly right.
And it also suggests a practical boundary: until these gaps shrink, AI will struggle to move beyond screens and into messy physical work at scale.
Multimodal Note: Progress Beyond Text
I have been emphasising that text reasoning is nearing a plateau. That does not mean other modalities are equally capped. Today’s multimodal models, especially for images and video, still have room to improve across generations.
Two points matter here:
Video is structurally harder than text. Time, space, and semantics are entangled, which raises training and inference costs and demands more complex model design. This makes multimodal scaling materially more expensive than pure text scaling. Google has been one of the most aggressive players here, supported by in-house TPU hardware and a highly optimised software stack. Others still trail.
Scaling does not remove the core limitation. Transformers are still fundamentally extracting "useful patterns" from a data distribution. Even if multimodal models reach the performance level people label as "GPT-5 class", improvements will still depend heavily on data quality and coverage. Scale alone does not solve continual learning, generalisation, or physical understanding.
Execution Playbook: Making AI Work in Business
If the real choke points are continual learning, generalisation, and physical understanding, then "making it work" comes down to two things: pick the right problems first, then use engineering to push uncertainty into a controllable range.
Track 1: Play to LLM Strengths
We need to work with what LLMs are structurally good at. They perform best on tasks with clear rules, clear acceptance criteria, and a workflow that can be decomposed into steps. They are not a great fit for long-tail decisions where the rules, context, and risk preferences change constantly. Avoid placing critical functions in areas where LLMs are structurally weak.
Take financial analysis. The valuable part is often rapid absorption of new information, judgment in non-standard situations, balancing risk, and making the call. That space is full of exceptions. It is hard to expect today’s models to replace humans end-to-end. A more realistic approach is to break the chain apart: let AI handle the standardised sub-tasks, then keep final judgment and accountability with humans.
This maps to a practical principle: the biggest productivity gains right now usually do not come from making AI do the "smartest" work. They come from removing low-grade cognitive labor, repetitive tasks, and work with little creative value. The goal is reliability over flexibility, not letting the model freestyle in an open world.
That is why I think most "AI agents" are mostly narrative. Outside unusually well-instrumented domains like coding, letting a model explore an open environment tends to produce shaky reliability and unattractive economics.
The more controllable approach is boring in the best way: map the business process, then design an auditable agentic workflow. Tell the system exactly what step one is, what the input is, what "done" means, and what the fallback is when it fails. Then step two, with its own acceptance criteria. Break the complex into many clean, straightforward micro-tasks. Each step is small and checkable. That is how you automate a meaningful chunk of a process that used to require many people watching it.
A simple example: I use Zoho for bookkeeping. It can take a photo of a receipt and turn it into structured data. Before AI, a human had to do this. It is low-skill, repetitive work with no creative value, exactly the kind of work AI should absorb.
In IT, the advantage is obvious: AI is strong at writing code in a way most industries do not have an equivalent for. Let it write code. But the chain from requirements to architecture is still full of long-tail issues, and humans still need to guardrail it. Developers should be freed from pure code production and spend more time on architecture and product design. Everyone becomes more of an architect.
In plain terms, what LLMs enable is similar to what assembly lines did for physical labor. Assembly lines automated repeatable linear tasks and pushed human effort toward skilled work. LLMs can automate more cognitive steps, including parts that used to feel non-linear, and push white-collar effort toward work that actually requires judgment and creativity. The key is not expecting the model itself to be the creative engine.
A simple division of labour works well: routine goes to AI, long-tail stays with humans. Long-tail cases have sparse data, unique context, and ambiguous rules. They require judgment, tradeoffs, and sometimes the willingness to break the rules when reality demands it. Over time, human value concentrates on handling exceptions, making final calls, and driving execution inside complex organisational constraints.
And there is one more layer: humans are not only the "firefighters". Humans are also the ones who turn firefighting into standard operating procedure. After solving a long-tail case, do not stop at the fix. Capture it as process, templates, few-shot examples, validation rules, and knowledge entries, so the next similar case becomes routine. The loop is: human solves long-tail, system learns, long-tail becomes routine, then AI takes over. That is how the routine expands, the long tail shrinks, and the efficiency gap opens.
Track 2: Engineer the Bottlenecks
After you choose the right problems, the second lever is engineering. Two main lines matter: model fine-tuning and context engineering.
Fine-tuning is only worth it when your data is high quality, your supervision signal is stable, and you can run it as a sustainable training loop rather than a one-off project.
Over time, fine-tuning will make more sense as open-source models get closer to closed models. When open models are far behind, fine-tuning an open model with private data can still underperform an untuned closed model.
The other line is context. Context engineering can partially compensate for weak continual learning and generalisation. If a model does not understand your business context, it has to guess, and the output becomes noise rather than signal.
It is like hiring a Harvard PhD, then not onboarding them. No internal wiki access, no documentation permissions, no map of which resources exist, and then expecting them to make high-stakes calls on day one. They may be brilliant, but they will still be lost.
That is why I keep saying "AI search" is not just a search box. Through this work, we have accumulated foundational modules tailored to our business. Over time, that makes it cheaper to solve context for more scenarios, shifting AI from "guessing answers" toward "executing tasks".
Alright, that is drifting into the application layer. We can save those implementation details for a separate session. Coming back to the theme here, the real question at the foundation layer is: when the return curve on language models flattens, what are people actually researching next?
Paths to AGI & ASI: Swiss Army Knife vs Yogi Master
To set the tone, borrow a line from Ilya Sutskever (former chief scientist at OpenAI): we are moving from an era of scaling into an era of research. The earlier playbook was straightforward: add more data, more compute, more parameters, and expect general capability to keep rising. More people now accept a tougher reality: scale alone no longer explains where the next step-change is supposed to come from.
There is also a basic category error that keeps showing up. Many people talk as if "LLM" equals "AI". But an LLM is fundamentally a next-token prediction machine. It can be extremely useful, but it is still far from human-like intelligence, and it is not a complete answer to AGI. So as the industry shifts from "make language models bigger" to "find the next breakthrough", the question becomes: if AGI needs continual learning, generalisation, and physical understanding, what kind of "brain" are we actually trying to build?
This is where the field splits into two camps:
Systems integration: AGI comes from combining multiple best-in-class capability modules into one coordinated system.
Algorithm-first: modular integration is secondary; the real unlock is finding a more fundamental learning algorithm.
That is the frame: Swiss Army Knife vs Yogi Master.
DeepMind’s Path: Engineer the System (Swiss Army Knife)
DeepMind’s Demis Hassabis is a clear representative of the systems camp. The philosophy is pragmatic: if a single model cannot reliably do everything, stop forcing it. Build specialised components, make each excellent, then integrate them into a working stack.
A simple mental model is a division of labour that mirrors how humans separate fast intuition from slow deliberation:
Gemini (intuition, System 1): a multimodal model that produces plausible candidate answers quickly. Think of it as strong priors, fast and broad, sometimes wrong for the same reason System 1 is sometimes wrong.
Genie (world model): a simulator that can "play out" consequences, especially for physical or interactive situations. The aim is not more fluent text, it is better ability to preview what happens next.
Alpha (planning, System 2): a planner that searches and evaluates options, closer to AlphaGo-style lookahead than quick guessing.
In short: Gemini proposes, Genie simulates, Alpha decides.
If those modules become strong and integrate cleanly, you get a system with breadth of knowledge, better grounding in physical consequences, and more deliberate planning.
Demis has suggested a rough timeline of 5 to 10 years. He has also acknowledged it likely still needs one or two breakthroughs on the scale of the Transformer. Still, he argues scaling remains foundational, especially in multimodal models.
Google is also one of the few organisations positioned to keep scaling multimodal aggressively. Video is vastly more expensive than text, and Google has two structural advantages: in-house TPU hardware and the balance sheet to keep paying the bill.
The obvious risk is integration fragility. Anyone who has dealt with complex software systems knows the pattern: more modules means more interfaces, and more interfaces means more ways to break. You can end up with a stack that is brilliant in parts but brittle as a whole.
Ilya’s Path: Find the Learning Algorithm (Yogi Master)
Ilya Sutskever’s stance is close to the mirror image. He argues the main bottleneck is not engineering. It is discovery. We have not found the fundamental learning algorithm that delivers human-like sample efficiency, continual learning, and robust generalisation.
His ideal AGI is not an internet encyclopedia. It is a "super-smart 15-year-old": not born knowing everything, but able to learn extremely fast from limited input, then generalise immediately.
A counterintuitive implication follows: more human data is not always better. Human experience contains bias and noise, and training heavily on it can bake those limits into the system.
A concrete illustration is the difference between AlphaGo and AlphaZero. AlphaGo became world-class in Go partly by learning from large collections of expert human games. AlphaZero started without human game records. It was given only the rules of Go, then it learned by playing against itself at scale. The result was a system that quickly surpassed the earlier approach, in part because it was not constrained by human habits.
The takeaway is not "human data is bad". It is that if you want a system that goes beyond the frontier of human practice, imitating humans too closely can become a ceiling.
So Ilya is looking for a learning mechanism that can discover structure from raw experience through efficient search and compression, rather than memorising human conclusions more thoroughly.
He also points to something missing in today’s training setups: a dynamic internal value function. His "emotion" argument is not that feelings equal intelligence. It is that human learning constantly reweights what matters. Anxiety makes near-misses feel extremely costly, which pushes caution. Curiosity makes exploration feel worthwhile, which reduces the perceived penalty of failure. Modern model training is closer to a fixed ruler, applied relentlessly, regardless of context.
The upside of this camp is obvious. If you find the algorithm, you get a real step-change, not a marginal improvement.
The risk is equally obvious. Ilya has talked in ranges like 5 to 20 years, which is another way of saying: the search could succeed quickly, take a long time, or fail entirely.
The Core Tradeoff: Engineering vs Algorithmic Discovery
The split between these two camps is, at its core, a collision between design thinking and discovery thinking.
Demis’s view is that AGI is a problem you can decompose and solve through engineering. Like building an aircraft, you need wings, an engine, and control systems. Each component has to be pushed toward excellence, then integrated into a working whole.
Ilya’s view is that AGI is not something you "invent" so much as something you "discover". Like discovering gravity, the learning algorithm is already out there in principle; we just have not found it yet.
Still, whichever camp you prefer, the bottlenecks are the same: continual learning and generalisation are the hard constraints right now. The disagreement is about the path.
Demis would say: build Genie first, use it as a simulated training environment, then let Gemini keep learning through repeated trial and error.
Ilya would say: unless you find the underlying mechanism that lets humans see something new and still get the essence mostly right, stacking modules will not get you there.
My View: Discovery Beats Design
So who is right?
Let me waste a little of your time with my personal view. I am not a core researcher, so this opinion is not particularly important, but I still want to put it on the table.
I suspect Ilya’s path is closer to the real solution.
Here is a counterintuitive question to explain why:
What is more complex, learning to ride a bicycle, or solving a high school physics problem?
Most people’s intuition says the physics problem. It looks harder, it requires formal education, and it punishes you when you get it wrong. But from an information-processing perspective, riding a bicycle is the truly hard problem. Your brain is unconsciously controlling a dynamic system with countless real-time, noisy variables: balance, centre of mass, speed, friction, wind resistance, and micro-adjustments you cannot even name. A physics problem, by contrast, lives in a clean world with clear boundaries and consistent logic.
That contrast points to a deeper question: why can the human brain handle "hard problems" that still trip up supercomputers, yet struggle with "hard questions" that are tidy and logical?
Part of the answer sits in two different ways of creating things: design versus discovery, careful construction versus natural emergence.
Most of human progress, from stone tools to microchips, is based on design, because we can see the causal connections between parts. We have blueprints. We can explain why each transistor goes where it goes. But when the problem is truly complex, like life or intelligence, those causal links are no longer visible. You cannot manufacture intelligence by designing every neuron connection, in the same way you cannot manufacture a tornado by designing the motion of every molecule in the air.
In a strange way, LLMs are one of the first major achievements built through a discovery-style approach.
When the Transformer architecture was created, the team did not "design" every detail of intelligence. They did something closer to meta-design: define a small set of core rules (attention) and a growth environment (massive data and prediction objectives), then step back and let the system grow inside that space.
Even the people who built trillion-parameter models cannot tell you what each parameter is doing.
In truly complex systems, the most important capabilities often come from emergence. They grow naturally inside a well-structured space, rather than being precisely designed part by part.
So my view is that for the hardest problems, the job is to find the most stable meta-rules: set boundary conditions, incentives, and constraints that are fundamental and durable, then let the system grow into them. You are not drawing the path of every blood vessel. You are creating an environment where blood vessels can form.
This way of thinking is not just about AI. It applies to many hard things. Take entrepreneurship. Every startup journey is unique, and other people’s success paths do not copy cleanly onto yours. The practical "meta-design" is to internalise first principles, generate many ideas guided by those principles, run small fast experiments, and let the market tell you what works and what does not. What scales may not be what you originally had the most faith in. But that is not a bug. That is how discovery works.
Anyway, back to the point. If LLMs themselves were created through discovery thinking, then for something even more demanding than LLMs, I do not think the answer is to retreat back into pure design thinking.
The Winner’s Curse: Why AGI Probably Won’t Be Winner-Take-All
Either way, no matter which camp wins, I do not expect a winner-take-all outcome. If anything, I expect a winner’s curse.
The reason is that an existence proof flattens the technical barrier overnight.
In an interview, Ilya used the atomic bomb as a sharp metaphor. Once the world learned that the first bomb had detonated, the difficulty of building one effectively dropped by orders of magnitude. Not because everyone suddenly had the full blueprint, but because the hardest part, proving it was possible, was now public. The possibility itself became the biggest leak.
You can see the same dynamic in AI. When a frontier lab demonstrates a new capability, it often gets replicated quickly. The point is not that copying is trivial; it is that once the direction is proven, competitors can stop debating and start converging.
So if one company builds something that looks like real AGI, or even just a genuinely breakthrough level of reasoning, that signal alone will snap competitors into focus. They will concentrate talent and capital on the now-validated path. The first-mover moat is often more fragile than people want to believe.
This creates a specific industry trap: innovation is extremely expensive, but imitation can be relatively cheap. Frontier labs pay the full cost of exploring the unknown and absorbing dead ends. Once a path works, followers can reproduce similar outcomes at much lower exploratory cost, then compete aggressively on price. The technical "winner" can end up stuck in a commercially awkward position: high burn, limited monopoly upside.
It also makes you wonder what venture capital is really buying when it funds pure research at a massive scale, return on investment, or a ticket to say you participated in history. In traditional ROI terms, the math becomes hard the moment the capability gets commoditised.
That is why I expect base models to commoditise quickly. To avoid a pure price war, companies will move toward specialisation, focusing on different verticals and workflows.
So who wins?
If the model layer is easy to replicate and easy to price down, the durable commercial value tends to escape downstream to what is harder to copy. The real advantage becomes data plus context.
What is genuinely difficult to replicate is:
private enterprise data and its usability (data liquidity)
deep understanding of business processes (context engineering)
the scaffolding that embeds AI into real workflows (systems, guardrails, and handoffs that make it reliable)
If you control those, you can swap underlying models as they get cheaper, without losing the compounding advantage. You are not married to any single model supplier.
So rather than betting at the foundation layer on an uncertain discovery, there is a more controllable path at the application layer: use design to create reliable value.
The real dividend is not who built the bomb. It is who built the power grid.
Next time, we can talk about what it actually means to build that power grid in 2026.
The Future of Enterprise AI
AI isn't a Feature. It's the Foundation.
* * ** * *
Where today's capabilities multiply tomorrow's possibilities.