TL;DR
A fresh study builds five prompt‑guided LLM agents—each emulating a legendary investor (Buffett, Graham, Greenblatt, Piotroski, Altman)—and backtests them on NASDAQ‑100 stocks from Q4 2023 to Q2 2025. Each agent follows a deterministic pipeline: collect metrics → score → construct a weighted portfolio. The Buffett agent tops the pack with ~42% CAGR, beating the NASDAQ‑100 and S&P 500 benchmarks in the window tested. The result isn’t “LLMs discovered alpha,” but rather: prompts can reliably translate qualitative philosophies into reproducible, quantitative rules. The real opportunity for practitioners is governed agent design—measurable, auditable prompts tied to tools—plus robust validation far beyond a single bullish regime.
What the authors actually did
- Persona‑encoded prompts. Each agent is assigned a tight role (“You are Warren Buffett…”) plus codified beliefs and tone. The prompts also specify which tools to call (e.g., valuation, leverage, F‑score) and exact tie‑breakers.
- Deterministic reasoning pipeline. Every step is fixed: winsorize → scale → score → penalties/bonuses → clip → rank → normalized weights. Given identical inputs, outputs are identical.
- Quarterly rebalance. Portfolios are recomputed each quarter, aligned to the reporting cadence of fundamentals, with minimal transaction cost assumptions.
- Standardized output. Each agent returns a table with Ticker, Score, Weight (%), Reason—ready for backtesting, logging, and comparison.
Why this matters for operators
This isn’t another “LLM reads news and trades” demo. It’s a design pattern for building auditable financial agents:
- lock the philosophy into the system prompt, 2) constrain the data and tools, 3) enforce a deterministic planner, 4) produce machine‑checkable outputs. That’s governance you can take to a model risk committee.
Quick scoreboard (study window)
Agent (style) | CAGR | Ann. Sharpe | Max Drawdown |
---|---|---|---|
Buffett (quality at fair price) | ~42% | ~1.40 | ~−22% |
Piotroski (F‑score value) | ~31% | ~1.14 | ~−23% |
Graham (margin‑of‑safety value) | ~29% | ~1.01 | ~−24% |
Altman (distress risk screens) | ~26% | ~0.96 | ~−22% |
Greenblatt (Magic Formula) | ~19% | ~0.87 | ~−21% |
NASDAQ‑100 (benchmark) | ~29% | ~1.32 | ~−23% |
S&P 500 (benchmark) | ~26% | ~1.47 | ~−19% |
Interpretation: The Buffett‑style pipeline concentrates into durable cash generators (Big Tech included), which—over this specific window—delivered higher return without worse drawdowns. The Piotroski agent pivots more often, mirroring its checklist ethos. Greenblatt underperforms here, reminding us that simple rules can be regime‑dependent.
Where the paper convinces
- Prompt ≠ vibes. The prompts are procedural specs: they define exact metrics, scaling rules, penalties, and tie‑breakers. That’s closer to a policy file than a pep talk.
- Reproducibility by construction. Determinism + standardized tables + quarterly rebalance = auditability. It’s not just “the model said so.”
- Distinct behaviors emerge. Even with the same LLM backbone, prompts produce meaningfully different sector tilts, turnover, and concentration—traceable to each philosophy.
Where I’m cautious
- Short horizon; hot regime. Q4’23–Q2’25 is a Big‑Tech‑led bull. Quality + cash‑flow prompts should shine. To claim durability, we need multi‑cycle, multi‑region tests and rolling OOS.
- Feature leakage via tool design. If tools lean on contemporaneous estimates or noisy TTM reconstructions, subtle look‑ahead can creep in. Every metric needs timestamp discipline.
- Costs and capacity. 0.01% per quarterly turnover is gentle. Realistic slippage, borrow, liquidity limits, and tax frictions could compress edge—especially for higher‑churn agents.
- Benchmark choice. NASDAQ‑100 is growth‑heavy. I’d add value and quality factor proxies (HML, QM) and equal‑weight variants to isolate true agent value add.
Operator’s toolkit: turn this into a governed agent
Architecture sketch
- Data contracts. Immutable, versioned fundamentals (IS/BS/CF) and prices, with as‑of timestamps and point‑in‑time shares.
- Tool layer. Pure functions (e.g.,
metric_interest_coverage
,metric_fcf_yield
) that take a single‑quarter snapshot and return schema‑fixed tables. No side effects. - Prompt template. Philosophy → metrics allowed → scaling & penalties → tie‑breakers → output schema. Treat it as source‑controlled policy.
- Deterministic planner. Enforce the exact pipeline in code; the LLM fills the reason strings and validates schema, not the math.
- Backtest harness. Quarterly rebalance, transaction logs, cash drag, borrow constraints, and compliance checks.
- Governance. Hash every input and output; store in an immutable ledger (object store + audit DB). Generate a one‑page fact sheet per rebalance.
Minimal audit checklist
Control | What good looks like |
---|---|
Data timing | All fundamentals are PTI (point‑in‑time); TTM uses only past quarters. |
Determinism | Same inputs ⇒ byte‑identical portfolio table. |
Cost model | Commissions, slippage, and taxes documented and justified. |
Capacity | Per‑name ADV, % of volume, and market impact caps enforced. |
Benchmarking | Include factor benchmarks and equal‑weight comparisons. |
Drift alerts | Thresholds for weight drift and style drift; human sign‑off on exceptions. |
How to pilot this at Cognaptus
-
Start with two archetypes: “Buffett‑quality” and “Piotroski‑value.” They’re philosophically orthogonal.
-
Lock the math outside the LLM. Use the LLM for schema validation, reason strings, and exception messaging—not for numeric scoring.
-
OOS bake‑off. Run 2014–2020 and 2020–2023 as distinct regimes across US large‑cap, US small‑mid, Europe ex‑UK, and APAC. Demand stability.
-
Stress the pipes. Randomize microstructure costs; simulate stale/late filings; force missing‑data fallbacks. Your agent is only as robust as its worst data day.
-
Ensemble the philosophies. Combine agents by risk parity on their active returns vs a broad benchmark, with drawdown‑aware throttles.
What I’d test next
- Causal tilt attribution. Decompose returns into quality, value, profitability, and concentration; verify the prompt drove the tilt you think it did.
- Counterfactual prompts. Swap one line (e.g., Buffett penalty on high PB) and measure sensitivity; this quantifies prompt brittleness.
- RL‑over‑prompts. Fine‑tune weights inside the prompt (penalty sizes, tie‑breakers) using nested cross‑validation, not the LLM parameters.
- Human‑in‑the‑loop guardrails. Add vetoes: litigation flags, accounting restatements, or ESG exclusions—post‑prompt safety layer.
Bottom line
The paper’s core contribution is a governance blueprint: with the right guardrails, prompts can become investment policies—auditable, repeatable, and attributable. The alpha here isn’t mystical LLM intuition; it’s disciplined system design. If we harden the data timelines, broaden regimes, and pressure‑test costs, prompt‑guided agents can graduate from lab demo to shop floor.
Cognaptus: Automate the Present, Incubate the Future