Wall Street Keeps Testing AI Traders, But Most Are Still Underperforming
Recent trading competitions suggest large language models are still unreliable portfolio managers, according to Bloomberg.
Tests involving models from OpenAI, Anthropic, Google, and xAI have often delivered underwhelming results: many lost money, traded excessively, and made erratic decisions despite receiving identical prompts. In several cases, models appeared unable to stick to coherent strategies for more than a few trading sessions.
Bloomberg writes that one of the clearest examples came from Alpha Arena, a competition created by startup Nof1. Eight models were each given $10,000 and asked to trade U.S. tech stocks over a two-week period using different strategies, including defensive approaches and leveraged bets. Across four competitions, the models collectively lost roughly a third of their capital, and only six of 32 outcomes ended in profit.
The gap in behavior was striking: xAI’s Grok 4.20 made just 158 trades in one contest, while Alibaba Group’s Qwen executed 1,418 under the exact same prompt.
The experiments reflect growing interest in whether generative AI can eventually outperform traditional fund managers. Wall Street firms including JPMorgan Chase and Balyasny Asset Management already rely on AI for research, fraud detection, and internal analysis, but they have largely stopped short of handing over actual investment decisions. As Nof1 founder Jay Azhang put it, current models still struggle with basics like “position sizing, timing, signal weighting and overtrading.”
That broader pattern has shown up elsewhere too. Research blog Flat Circle tracked 11 public AI trading competitions and found that while every event produced at least one profitable model, only two generated a profitable median return — suggesting most bots still underperform more often than not. Azhang was even more blunt about the state of autonomous trading: giving an LLM money and letting it invest independently “isn’t a thing yet.”
Some firms are still betting that the technology improves with better tools and tighter guardrails. Intelligent Alpha, for example, runs an AI-driven fund that combines LLMs with earnings transcripts, analyst forecasts, corporate filings, macroeconomic indicators, and web searches to make predictions. In late 2025, OpenAI’s ChatGPT correctly predicted the direction of earnings estimate revisions 68% of the time — its strongest showing so far.
Evaluating these systems remains difficult because traditional backtesting methods can be misleading: models may already have embedded knowledge of past market events, creating look-ahead bias. That has pushed more firms toward live-market experiments, where results so far suggest AI may be useful as an assistant — but not a replacement — for human traders.

