Brave New Backtest
My last two articles on AI and trading research got more engagement than almost anything I’ve written.
“More of the Disease, Faster” argued that LLMs can’t answer the critical question: who pays you and why?
“AI Will Create Millions of Quants” went deeper on the why: AI makes beautiful backtests trivially easy to produce, which means more false discoveries, more overfitting dressed up as research, and more people confusing a good-looking equity curve with a real edge.
Both pieces argued from experience. I’ve been doing this long enough to know what works and what doesn’t, and I was pretty confident in the arguments. But the most telling thing about the comments wasn’t the disagreement. It was how many people clearly don’t understand that backtesting, statistics, and pattern matching aren’t research.
Their LLMs don’t understand this either. And you can’t outsource the thinking part of trading research to a machine that doesn’t understand what research is.
Two recent papers suggest this isn’t a temporary limitation that better models will fix. It’s structural.
A feature of the system, rather than a bug.
And they provide the scientific backing for what I’ve been saying.
My earlier articles argued that the training data is the problem. That’s true. But it’s only Problem 1. Here are Problems 2 and 3.
Problem 1: The Training Data (Quick Recap)
For anyone who missed the first two articles, the short version:
LLMs are trained on the internet. The internet’s trading content is overwhelmingly noise. “Don’t fight the trend.” “Use RSI for entries.” “Paper trade for six months.” “Validate with out of sample data.” “Cointegration matters for pairs trading”.
The dominant paradigm online is that backtesting and statistics IS research, that finding a pattern in historical data IS finding an edge.
The critical question, “who pays you and why?”, barely exists in the training data.
Mechanism-based thinking, structural edges, understanding participant constraints… this stuff lives in the tail of the distribution. The vast majority of trading content online is conventional wisdom dressed up as insight.
So the LLM doesn’t know that backtesting isn’t research because the internet doesn’t know it either. It reproduces the dominant paradigm with extraordinary confidence. Which is bad.
But some people reasonably argued: fix the training data, and you fix the problem. Train the model on better material. Use RAG. Curate your knowledge base.
Two recent papers suggest that’s only partly true. Even with perfect training data, two more problems remain. And these are architectural, meaning they’re baked into how these models work.
Problem 2: The Forgetful Machine
A paper called “Unable to Forget” (Wang & Sun, 2025) tested something deceptively simple: can LLMs track a value that changes over time?
Imagine a patient’s blood pressure being recorded throughout a hospital visit. BP at triage: 120. Ten minutes later: 128. At discharge: 125. Ask the model: what’s the current blood pressure?
Simple, right? The answer is right there at the end of the context. The model was explicitly told to retrieve the most recent value.
Across 35+ models (GPT, Claude, Llama, Gemini, DeepSeek, the lot), accuracy declined log-linearly toward zero as the number of prior updates increased. The more historical values the model had seen for the same variable, the worse it got at retrieving the current one.
And this wasn’t a gentle, noisy, or hard-to-discern decline. It was consistent, relentless, and universal.
The researchers call this “proactive interference,” borrowed from cognitive science. Earlier values compete with the current value in the model’s retrieval process. Old information interferes with new information. And the interference gets worse, continuously, as context accumulates.
Three things make this particularly frightening:
Prompt engineering doesn’t fix it. They tried telling the model to “forget” old values. They tried “focus on the most recent update.” They tried meta-prompts asking the model to self-assess what to prioritise. Marginal improvement at best. In some cases, the “forget” instruction actually made things worse, anchoring errors around the point where the instruction was inserted.
Bigger context windows don’t help. A bigger window just gives the model more room to accumulate more interference.
It’s universal. Every model tested showed the same pattern. From tiny open-source models to the biggest proprietary ones. The curve shape was the same. Bigger models declined more slowly, but they all declined.
Now think about what trading research requires.
Some trading concepts are evergreen: the theory of edge, the importance of mechanism, portfolio construction principles. An LLM can learn those from training data, and it does a reasonable job of reproducing them, when it isn’t drowning them in conventional wisdom, per Problem 1.
But even when explicitly told to use a specific database of curated material, it will still throw in the conventional wisdom with all the conviction in the world (more on this below).
But applying those concepts requires tracking an ever-changing environment. Market regimes shift. Carry changes direction. Volatility spikes and mean-reverts. Correlations break down. Borrow costs change. Liquidity dries up and returns. New players enter. Regulations get updated and replaced.
A model that can’t reliably track “the current value of blood pressure” in a simple key-value test certainly can’t track evolving states of things that affect markets.
Knowing the theory of edge is one thing. Applying it in a market that adapts and evolves is where the dynamic tracking matters. And the architecture fails at exactly that task.
Crucially, this isn’t a training data problem. You could train the model on perfect data and it would still fail at tracking sequential updates. The architecture can’t handle it.
I’ve long felt in my bones that the conviction with which LLMs speak about trading is truly misplaced. And now there’s some published research that says so too.
Problem 3: The Artificial Hivemind
A second paper, “Artificial Hivemind” (Jiang et al., NeurIPS 2025), measured something equally concerning: when you ask LLMs open-ended questions, how diverse are their answers?
They tested 25+ models across 100 open-ended queries, generating 50 responses per model per query. The results are shocking… but I’d wager they align with your recent experience of online content.
Intra-model repetition: A single model gives you the same answer over and over. In 79% of queries, the pairwise similarity between responses from the same model exceeded 0.8. Ask it the same question fifty times, you get essentially the same answer fifty times. Even with aggressive sampling parameters designed to maximise diversity.
Inter-model homogeneity: Different models, built by different organisations, with different architectures and different training data, produce strikingly similar outputs. Average pairwise similarity between responses from different models ranged from 71% to 82%. DeepSeek-V3 and GPT-4o hit 0.81 similarity. These are supposed to be independent systems.
The paper’s most vivid example asked 25 different models to “write a metaphor about time.” Fifty responses each. 1,250 total responses from independent systems. They cluster into just two metaphors: “time is a river” and “time is a weaver.” That’s it.
Twenty-five different models, all converging on the same two ideas.
And if you’ve been feeling a growing disdain for all that similarly soulless LLM-generated content popping up in your feed, now you know the mechanism. What we consume is collapsing to the mode of anti-creativity.
“Time is a river” dominates because it’s the most common metaphor about time in the training data.
And that’s the mechanism that matters for trading.
Mode collapse means convergence toward the most represented patterns in the training data. For trading, the mode of internet content is conventional wisdom: “use a stop loss,” “paper trade first,” “backtest with moving averages,” “validate with out of sample data,” “cointegration matters for pairs trading”. All the usual bollocks.
The actually useful insights (mechanism-based thinking, structural edges, “who pays you and why?”) live in the tail of the distribution of online content.
Mode collapse systematically suppresses the tail and amplifies the centre. So it’s not just that the training data is bad (Problem 1). The architecture preferentially surfaces the bad stuff because that’s what convergence toward the mode means.
“Use a stop loss” is the “time is a river” of trading content. It’s the modal output. The rare, useful insights get pushed out by convergence.
I experienced this first-hand.
I built a RAG system on our TLQ Bootcamp material, which is high-quality, mechanism-first content. Explicitly told the LLM to only respond using data from the RAG database. It still mixed in conventional wisdom that was clearly not in the database: generic stop loss advice, paper trading suggestions. Both problems at once: couldn’t suppress prior training (proactive interference from Problem 2), and defaulted to the most common trading advice (mode collapse from Problem 3).
The Stepford Quants
Remember the backtest cycle of doom from Article 2? AI makes it trivially easy to produce beautiful backtests that prove nothing. The unsuspecting trader ends up with a complicated algorithm for fitting to past noise and learns nothing useful in the process.
Mode collapse implies an added layer of insidiousness: everyone using LLMs for trading research converges on roughly the same conventional-wisdom-flavoured strategies.
It’s the perfect tool to tell you what already fits into your preconceived frameworks.
The implication is that millions of AI quants aren’t just running the cycle of doom independently. They’re running the same doom cycle. And the “strategies” being discovered aren’t real edges in the first place. They’re the modal output of bad training data, amplified by architectural convergence.
The Stepford Quants: pleasant, productive, and identical.
LLMs Are Brilliant Coders but Terrible Traders
This is the part that makes the whole thing click.
I use AI for coding every day. It’s extraordinary. It writes boilerplate, handles data wrangling, produces charts, writes tests. I’ve said this in both previous articles, and I’ll keep saying it: AI is the best research assistant I’ve ever had.
Why does it work so well for code?
Because mode collapse in coding is convergence toward best practices.
The training data (Stack Overflow, GitHub, documentation) is self-correcting: bad code gets downvoted, good patterns get reinforced. Software engineering has well-established paradigms and right answers (or at least well-established good answers). When the model converges on “use a dictionary for O(1) lookups” or “handle this edge case with a try-except block,” that convergence is helpful.
Mode collapse in trading is convergence toward conventional wisdom, which is mostly wrong. The LLM surfaces “validate with out of sample data” and “use cointegration tests to find pairs trading opportunities” because that’s what most trading content says. The actually useful insights (mechanism-based thinking, structural edges) live in the tail and get suppressed by convergence.
It’s the same architectural property. But it’s beneficial in one domain, lethal in another.
Similarly, proactive interference doesn’t matter much in coding. You’re working on a specific, well-defined task. The context is relatively static: here’s the codebase, here’s what I want to change, here are the tests. But for tracking an evolving, complex state, where the whole point is that yesterday’s values are different from today’s and the model needs to know which is current, interference is devastating.
This explains exactly why the appropriate use of LLMs in trading is to implement a good idea (coding task, well-suited) rather than to find the idea in the first place (trading research, structurally ill-suited).
The Bug Is Not a Bug
There are at least three layers of limitation, and only one is even theoretically fixable:
Problem 1 (training data): Fixable in principle. Better data, better RAG, curated knowledge bases. But fixing it doesn’t solve Problems 2 and 3.
Problem 2 (proactive interference): Architectural. Tested across 35+ models from every major organisation. Bigger models decline more slowly, but they all decline. Prompt engineering doesn’t fix it. Bigger context windows don’t fix it.
Problem 3 (mode collapse): Architectural. Tested across 70+ models. Different organisations, different architectures, different training data, same convergence. Model ensembles don’t help because different models produce the same outputs anyway.
The “just wait for GPT-6” argument fails because Problems 2 and 3 are universal across every model tested, spanning all architectures and all organisations. These aren’t bugs that get fixed with the next release. They’re properties of how these systems work.
And, look, I could be wrong. Maybe some architectural breakthrough changes this. But these papers tested the full spectrum of current models, and the patterns were universal. I wouldn’t bet on it, short of a completely new and different architectural paradigm.
This reinforces the Edge Alchemy framework. The human-driven theory of edge (“who pays you and why?”) is the step that LLMs cannot do, for multiple independent reasons. The training data doesn’t teach it (Problem 1). The architecture can’t track evolving state (Problem 2). And it converges on the most common answers, which are wrong (Problem 3).
Here’s a workflow that stands a chance of working: human generates insight (theory of edge), then AI implements it (coding, data wrangling), then human evaluates results.
AI for the implementation, humans for the thinking.
Go Easy on the Soma
Brave New World is my favourite novel. Huxley’s genius was showing that the most dangerous dystopia isn’t the one where people are oppressed. It’s the one where everyone’s perfectly content. Nobody questions anything because the system feels good from the inside.
That’s AI-assisted trading research right now.
It feels productive. You’re generating strategies, running backtests, getting clean code in minutes.
The dystopia is that you’re converging on conventional wisdom while feeling like you’re doing cutting-edge work. Work that moves your life forward. The machine tells you what you want to hear, confirms what you already believe, and does it with such confidence and speed that you never stop to ask whether any of it is real.
AI is the best research assistant ever built. I use it every day and it makes me faster at everything except the one thing that actually matters: understanding why an edge exists and whether it will persist.
That’s three independent reasons to keep doing the hard work of understanding why edges exist. You only learn this stuff by reading, talking to other people, and, primarily, by doing. There’s no shortcut, no matter how good the AI gets.
I didn’t need research papers to tell me this, but it’s nice to have the receipts.
References:
Wang, C. & Sun, J.V. (2025). “Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length.” arXiv:2506.08184v3.
Jiang, L. et al. (2025). “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond).” 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

