Hi everyone,
A couple weeks back, I ran an experiment where[ I fed 48 years of Buffett's shareholder letters to Claude Opus 4.6](https://www.reddit.com/r/ClaudeAI/comments/1rhbhoq/i_fed_opus_46_all_48_of_warren_buffetts/) and had it pick stocks blind (it matched 6 out of 10 Berkshire holdings without knowing what it was looking at). That experiment got a lot of great feedback and one of the most common requests was to test AI on real Reddit stock advice instead of just Buffett's principles.
I used Claude Code to build a multi-agent pipeline that grabs investing recommendations from r/ValueInvesting subreddit for the month of Feb 2025, strip popularity signals, and have Claude sub-agents score each investing recommendation blind on reasoning quality alone. Then I built three portfolios (10 stocks per portfolio):
* **The Crowd**: top 10 stocks ranked by total upvotes across all mentions
* **Claude's Picks**: top 10 stocks ranked by reasoning quality score
* **The Underdogs**: bottom 10 stocks by upvotes (min 5 upvotes), to test whether the crowd was right to ignore them
I tracked their real returns over a year from Feb 2025 - Feb 2026.
The part I found most interesting was that on data completely outside Opus's training window (Sep 2025 onward), Claude's picks returned +5.2% vs the most upvoted stocks only -10.8% (S&P 2%).
If you prefer to watch the full experiment, I uploaded it to my channel:[ https://www.youtube.com/watch?v=tr-k9jMS\_Vc](https://www.youtube.com/watch?v=tr-k9jMS_Vc) (free).
**The Setup**
I used Claude Code to scrape every single post from [r/ValueInvesting](https://www.reddit.com/r/ValueInvesting/) for the month of February 2025 and filter down to posts and comments where someone was recommending, analyzing, or debating a specific stock. This gave me 1,100+ qualifying threads, 6,000+ comments, and 547 individual stock recommendations across 238 unique tickers.
I then had Opus score every single one on five dimensions: thesis clarity, risk acknowledgment, data quality, specificity, and original thinking.
From there I built the three portfolios of **The Crowd**, **Claude's Picks**, **The Underdogs**.
All portfolios were equal-weight, bought on March 3, 2025 (first trading day of March). They had the same entry, same exit, with no cherry-picking.
Following was my Claude Code setup:
reddit-stock-analysis/
├── orchestrator # Main controller - runs full pipeline per month
├── skills/
│ ├── scrape-subreddit # Pulls all posts + comments for a given month via Reddit API
│ ├── filter-recommendations # Identifies posts where someone recommends/analyzes a stock
│ ├── extract-tickers # Maps mentions → ticker symbols, deduplicates
│ ├── strip-popularity # Removes upvote counts, awards, author karma
│ ├── build-portfolios # Constructs Crowd (by upvotes) vs AI (by score) vs Underdog
│ └── track-returns # Looks up actual price returns for each portfolio
└── sub-agents/
└── (spawned per recommendation) # Blind scoring - no popularity signals, just the post text
├── thesis-clarity # Is there a structured argument for why this stock?
├── risk-acknowledgment # Does the post address what could go wrong?
├── data-quality # Real financials (P/E, margins, debt) or just vibes?
├── specificity # Concrete targets, timeframes, catalysts?
└── original-thinking # Independent analysis or echoing the crowd?
**The Blind Test (Sep 2025 – Feb 2026)**
Before I share the main backtest, I want to start with the result I think matters more.
One fair criticism that keeps coming up in these experiments is that the AI might have seen these stock prices during training. The model I used has a training cutoff of August 2025, so the February recommendations do fall within that window. Even though the AI was only scoring argument quality (not predicting prices), it could theoretically recognize which stocks were being discussed.
So I reran the entire experiment on September 2025 recommendations, which is completely outside the model's training data. It resulted in over 800 threads, 10,500 comments, 2,200 recommendations scored. This guaranteed that the model did not have any knowledge of the stock price movement during this time in its training data.
AI: +5.2%
S&P 500: +2.4%
Crowd: -10.8%
On data the AI couldn't have possibly seen, it still beat the market. The crowd portfolio went negative. I think this is the cleanest result from the experiment because there's no way to argue the AI was cheating.
**The Full Backtest (Feb 2025 – Feb 2026)**
Now here's the full year backtest on the February data:
The Crowd: +39.8% (+20.3% vs S&P)
AI's Picks: +37.0% (+17.5% vs S&P)
S&P 500: +19.5%
Underdogs: +10.4% (-9.1% vs S&P)
The crowd actually won by about 3 percentage points. Both beast the S&P. But when I looked at the individual stocks, the story got a lot more interesting. AI's portfolio had 9 out of 10 winners. The worst performer was OSCR at -12%.
Both portfolios ended up in a similar place but the crowd went from +39.8% to -10.8% across the two time periods which feels quite inconsistent while Opus-filtered recommendations managed to gain both times.
**What I took away from this**
I don't think the takeaway is necessarily that "Opus picks better stocks." It's more that Opus appears to be better at telling apart solid analysis from stuff that just sounds good. It might serve as a good tool to filter out advice posts here down to solid ones that do good due diligence. The most popular advice and the best-reasoned advice had almost nothing to do with each other.
If this was interesting to you the full walkthrough is here including all the data:[ https://www.youtube.com/watch?v=tr-k9jMS\_Vc](https://www.youtube.com/watch?v=tr-k9jMS_Vc) (free)
Thank you so much if you did end up reading this far. Would love to hear if you have been experimenting similarly with Claude, let me know :-).
--- TOP COMMENTS ---
Have you calculated the statistical significance of the result?
What does the distribution of outcomes look like for a random strategy?
---
Methodology:
∙ How were ties handled in the scoring? When multiple sub-agents scored the same recommendation, what was the aggregation method — average, weighted, majority?
∙ Did any single stock dominate either portfolio’s returns? With 10-stock equal-weight portfolios, one outlier (positive or negative) can tell most of the story.
∙ What happened to the Underdogs portfolio in the Sep 2025 blind test? That comparison feels like the missing piece.
On the scoring dimensions:
∙ Were the five dimensions weighted equally? “Original thinking” and “data quality” feel like they should carry more weight than “specificity” if the goal is finding genuinely good analysis.
∙ Did high-scoring posts cluster around any particular sectors, or was it spread across the market?
On replication:
∙ Has anyone tried running the same pipeline on a different subreddit (r/stocks, r/investing) to see if the gap between crowd and reasoning-quality picks holds?
∙ What does the score distribution look like? Were most recommendations clustered in the middle, or was there a clear separation between high and low scorers?
The question I’m most curious about:
∙ Of the posts that scored high on reasoning quality but got almost no upvotes — what did they have in common stylistically? My guess is they were longer, more hedged, and less exciting to read. That would really nail down why popularity and quality diverge.
What did the data actually show on that last one?..
9h ago