Investment firms love a house style. Conservative value. Quality growth. Distressed credit. Low-volatility income. The style is supposed to mean something more durable than a portfolio manager’s breakfast mood.

The uncomfortable part is that many “styles” still live in a fog of analyst judgement, committee memory, spreadsheet folklore, and the occasional sacred quote from an investor whose annual letters have been read with the reverence normally reserved for scripture. Everyone claims discipline. Fewer can show exactly how that discipline becomes position weights.

The paper GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents walks straight into that fog and asks a useful question: can a large language model be prompted into behaving like a systematic version of Benjamin Graham, Edward Altman, Joel Greenblatt, Joseph Piotroski, or Warren Buffett?1

The tempting headline is obvious. The Buffett agent achieves a 42.2% CAGR in a backtest on NASDAQ-100 constituents from Q4 2023 to Q2 2025. At this point, the internet would normally declare that Buffett has been reduced to a system prompt, capitalism has been solved, and the next fund prospectus should just be a YAML file.

That is not the interesting reading.

The stronger reading is more operational: the paper shows how a qualitative investment philosophy can be converted into a reproducible agent workflow made of persona instructions, financial tools, deterministic scoring, portfolio construction rules, and standardized rationales. The possible product is not “LLM replaces portfolio manager”. The possible product is “investment doctrine becomes inspectable machinery”.

That difference matters. One is cosplay alpha. The other is governance.

The guru is not the model; the guru is the workflow

The paper’s core mechanism has three layers.

First, each GuruAgent receives a role-based persona. The prompt does not merely say “pick good stocks”. It gives the agent an investor identity, embeds selected beliefs and quotations, and sets tone. Graham is prudent and skeptical. Buffett is patient and business-like. Altman is diagnostic. Greenblatt is rules-driven. Piotroski is checklist-oriented.

Second, the agent receives a defined tool menu. These tools calculate financial metrics: liquidity ratios, leverage, return on equity, profit margin, valuation multiples, free-cash-flow yield, Altman Z-score variants, Greenblatt earnings yield and return on capital, Piotroski F-score signals, and related components. The agent is not free to hallucinate its own financial dataset over a nice cup of vibes. It must call specified tools and use their outputs.

Third, the agent follows a deterministic reasoning pipeline. It collects metrics, scores firms under investor-specific rules, converts scores into portfolio weights, applies tie-breakers, and outputs a table with four fields: ticker, score, weight, and reason.

That table is the crucial interface. It makes the result backtestable. It also makes the agent’s behaviour comparable across philosophies. Without that hard output schema, the system would be a charming financial essay generator, which is useful only if the investment committee has run out of bedtime stories.

The appendix matters here. It is not decorative. It is implementation detail. It shows that each “guru” is not only a persona but a scoring apparatus.

Agent What the prompt turns into machinery Operational behaviour implied
Graham Liquidity, low leverage, profitability, working capital, margin-of-safety language, valuation support Broad value-quality scoring with conservative penalties and bonuses
Altman Z, Z′, and Z″ distress classifications with Safe/Grey/Distress zones Balance-sheet diagnostic selection, prioritising firms further into the Safe zone
Greenblatt Earnings yield and return on capital, with exclusions for invalid or negative economics Simple rank-based selection, periodically refreshed
Piotroski Nine accounting signals across profitability, leverage/liquidity, and operating efficiency Checklist-driven fundamental improvement strategy, naturally higher turnover
Buffett ROE, interest coverage, profit margin, valuation, free cash flow, ROCE, balance-sheet penalties, quality bonuses Concentrated quality-at-a-fair-price behaviour, especially where the universe rewards dominant cash-generative firms

This is why the paper is best read mechanism-first. The benchmark table is interesting, but the mechanism explains why the table exists at all.

The backtest is main evidence, not a divine revelation

The experiment uses GPT-4o as the underlying model, implemented through LangChain and LangGraph. The investment universe is NASDAQ-100 constituents from Q4 2023 to Q2 2025. The authors choose that horizon because it lies beyond GPT-4o’s knowledge cutoff, reducing the chance that the model is simply recalling known historical outcomes.

Each portfolio is rebalanced quarterly, matching the rhythm of financial reporting. Transaction costs are assumed at 0.01% of gross quarterly turnover. The agents are compared against the NASDAQ-100 and S&P 500 using return, volatility, Sharpe ratio, maximum drawdown, VaR, and CVaR.

That gives the paper its main evidence: Figure 1 and Table 1. Figure 1 shows cumulative returns. Table 1 gives summary performance metrics. These are not ablations. They are the primary empirical result.

The results are worth reporting carefully:

Strategy CAGR Annualised Sharpe Maximum drawdown Interpretation
Warren Buffett agent 42.23% 1.3991 -22.34% Best CAGR among all tested agents and benchmarks; strong but not best Sharpe
Joseph Piotroski agent 30.93% 1.1432 -23.07% Beats both benchmarks on CAGR, but not on risk-adjusted performance
Benjamin Graham agent 28.74% 1.0132 -23.89% Beats S&P 500 CAGR, slightly trails NASDAQ-100 CAGR
Edward Altman agent 25.74% 0.9598 -21.71% Slightly below S&P 500 CAGR and below NASDAQ-100
Joel Greenblatt agent 19.38% 0.8652 -20.74% Lowest CAGR and weakest agent-level risk-adjusted performance despite lower volatility
NASDAQ-100 29.36% 1.3151 -22.77% Strong benchmark; beaten by Buffett and Piotroski on CAGR
S&P 500 26.31% 1.4728 -18.76% Lower CAGR than Buffett and Piotroski, but highest annualised Sharpe and lower drawdown

This table is exactly where careless readers can get themselves into trouble. Buffett wins the CAGR contest, yes. But the S&P 500 has the highest annualised Sharpe ratio in the reported table. Buffett’s drawdown is slightly better than NASDAQ-100’s, but worse than the S&P 500’s. Piotroski beats both benchmarks on CAGR but not on Sharpe.

So the result is not “LLM Buffett dominates investing”. The result is narrower and more useful: a prompt-engineered investment workflow can generate meaningfully different portfolio behaviour under the same model backbone, and those differences can produce materially different backtest outcomes.

The paper’s best evidence is behavioural as much as financial.

Prompt design changes the portfolio, not just the explanation

Figure 2 is best read as behaviour diagnostics. It is not a robustness test. It does not prove that the prompt caused every unit of excess return. What it does show is that the agents produce visibly different portfolio weight patterns.

The Buffett agent repeatedly allocates to dominant firms such as AAPL, MSFT, and NVDA. That is consistent with a prompt emphasising high-quality businesses, durable advantages, stable cash flows, conservative balance sheets, and long-term ownership.

The Piotroski agent turns over more often. That makes sense: the F-score is a signal checklist based on recent accounting improvement. If the checklist changes quarter by quarter, the portfolio should change too. The agent is not being fickle. It is obeying the machinery it was given.

Greenblatt sits somewhere between those behaviours. Its Magic Formula logic ranks companies by earnings yield and return on capital, so periodic reshuffling is part of the design.

This is the paper’s strongest practical insight. Prompt engineering is not only a way to change the prose around a decision. It can change the structure of the decision itself: which metrics are used, which names qualify, how concentrated the portfolio becomes, how often positions are replaced, and which sectors are implicitly favoured.

That should make financial firms both interested and nervous. Interested, because investment doctrine can become more scalable. Nervous, because the prompt is now part of the investment policy. A bad prompt is not just bad wording. It is bad governance with nicer typography.

The paper shows philosophy-to-policy translation

The paper positions GuruAgents against prior LLM-in-finance work that treats models mainly as information processors: extracting signals from news, parsing analyst reports, generating views, or supporting optimisation. GuruAgents instead use the LLM as a policy interpreter operating inside a bounded workflow.

That distinction is important.

A normal quantitative strategy starts with explicit variables and rules. A discretionary strategy starts with judgement, philosophy, and experience. GuruAgents try to bridge the two by converting doctrine into an executable prompt-and-tool system.

For a business, this suggests a practical architecture:

  1. Define the investment philosophy in plain language.
  2. Convert it into a role, allowed evidence, scoring policy, and risk boundaries.
  3. Bind the agent to approved financial tools.
  4. Force structured output.
  5. Backtest behaviour.
  6. Audit whether decisions match the intended doctrine.
  7. Only then discuss performance.

The boring middle steps are where the value lives. Naturally, they are also the steps least likely to appear in a LinkedIn demo.

A CIO does not need an agent that says, “As Warren Buffett, I believe in moats.” A CIO needs an agent that can show which variables operationalise “moat”, how missing data is handled, how valuation penalties are applied, how weights are normalised, and why one stock displaced another at rebalance.

The paper gets closer to that version of agentic finance than the usual chatbot-for-stock-picks spectacle.

The useful product is an auditable investment style engine

For financial institutions, the most relevant application is not a public “Buffett bot”. That would be cute, legally irritating, and probably not where durable value sits.

The more serious application is internal style codification.

A firm could use this kind of architecture to encode its own investment doctrines: quality income, defensive credit, emerging-market value, capital-light compounders, small-cap quality, post-restructuring balance-sheet repair, or any other style that currently lives half in research notes and half in partner instinct.

The business value comes from repeatability.

Business use case What GuruAgents-style architecture enables What still needs human control
Investment committee discipline Converts qualitative doctrine into repeatable scoring and portfolio proposals Approval of doctrine, constraints, exceptions, and risk limits
Analyst training Shows junior analysts how a style translates into metrics and trade-offs Teaching when metrics fail or become misleading
Model governance Creates inspectable prompts, tool calls, scores, weights, and rationales Validation, monitoring, compliance review, and escalation paths
Product differentiation Makes a house view operational rather than merely branded Ensuring the style remains economically meaningful
Strategy research Rapidly prototypes systematic versions of discretionary beliefs Out-of-sample testing, transaction cost realism, capacity analysis

This is not just for asset managers. Wealth platforms, research teams, family offices, robo-advisers, and fintech infrastructure providers could all use similar systems. The agent does not need to be a celebrity investor. It can be a risk committee, a credit policy, a sector analyst template, or a client suitability doctrine.

In that sense, “Buffett as a system prompt” is the entertaining wrapper. The enterprise object is something drier and more valuable: an auditable decision policy that can be versioned, tested, compared, and improved.

The absent ablation is the quiet problem

The paper argues that prompt engineering drives the observed behavioural differences. That claim is plausible because all agents use the same GPT-4o backbone and differ in persona, tools, scoring design, and portfolio rules.

But plausibility is not the same as causal isolation.

The paper does not provide a full ablation study. It does not separately remove the persona while keeping the tools. It does not keep the persona while changing scoring rules. It does not test whether Buffett-like concentration comes mostly from the prompt language, the metrics selected, the weighting formula, the NASDAQ-100 universe, or simply the fact that large-cap technology had a generous period.

That does not make the paper weak. It tells us how to read it.

The backtest is main evidence that the complete GuruAgent packages behave differently and can perform differently. Figure 2 supports the behavioural interpretation. The appendix provides implementation detail. The paper’s future-work note on philosophical-alignment metrics is not a side comment; it points directly at the evaluation gap.

For business readers, this matters because “the agent follows our philosophy” is not something to assert after glancing at a rationale sentence. It needs measurement.

Possible alignment tests would include:

  • comparing agent decisions against historical committee decisions;
  • measuring factor exposures against the intended style;
  • testing portfolio stability under small prompt changes;
  • checking whether rationales cite the same variables that actually drive scores;
  • using counterfactual company examples to see whether the agent violates doctrine;
  • running prompt ablations to separate persona effects from metric-rule effects.

Without that layer, firms may end up with a system that sounds like the house style but trades like a caffeinated index tilt.

The backtest boundary is narrow by design

The backtest window is short: Q4 2023 to Q2 2025. The universe is NASDAQ-100. That universe is large-cap, technology-heavy, and unusually friendly to certain quality-growth and mega-cap concentration patterns during the tested period.

That boundary changes interpretation.

The Buffett agent’s 42.2% CAGR is impressive inside this setting. It is not proof of durable alpha across regimes, asset classes, geographies, liquidity conditions, or valuation cycles. It also does not prove that the system would survive live trading, higher transaction costs, capacity constraints, tax effects, delayed fundamentals, changing constituent membership, or stricter compliance rules.

The transaction cost assumption is also light: 0.01% of gross quarterly turnover. That may be reasonable for liquid mega-cap equities, but it should not be lazily carried into small caps, emerging markets, crypto, credit, or private markets. Slippage has a personality. It becomes unpleasant when ignored.

There is another subtle boundary. The paper chooses a post-knowledge-cutoff period to reduce memorisation risk. That is sensible. But a post-cutoff backtest still does not equal a live prospective trial. The data pipeline, rebalance dates, accounting availability, and universe construction all matter. For deployment, the boring question is not “did it beat the benchmark?” It is “could this exact process have been run at the time with the data actually available?”

That is where research becomes operations.

What Cognaptus infers for business use

The paper directly shows that five GPT-4o-based agents, built with distinct investor-inspired prompts, financial tools, and deterministic portfolio rules, generate different portfolios and different backtest results over a defined NASDAQ-100 period. It also shows that the Buffett and Piotroski agents beat the listed benchmarks on CAGR, while risk-adjusted comparisons are more mixed.

Cognaptus infers a broader business lesson: the near-term value of LLMs in systematic investing may come less from autonomous genius and more from policy formalisation. LLM agents can help translate qualitative doctrines into structured decision procedures, especially when constrained by tools, schemas, scoring rules, and audit trails.

What remains uncertain is the expensive part: whether these systems produce robust out-of-sample alpha, whether they truly align with the intended philosophy, whether they remain stable under prompt variation, and whether their outputs survive the operational frictions of live investment management.

That uncertainty does not make the approach unimportant. It makes it implementable only with adult supervision. A tragic requirement, but finance has survived worse.

Promptfolios are governance objects

The phrase “prompt portfolio” sounds like a gimmick until you realise what it could become.

A promptfolio is not a portfolio chosen by a chatbot. It is a portfolio generated by a documented decision contract: persona, permitted evidence, tools, scoring logic, constraints, tie-breakers, output schema, and rebalance policy. In other words, it is a governance object.

That framing is more useful than asking whether the Buffett agent is “really” Buffett. It is not. Buffett is not hiding inside GPT-4o, waiting for a quarterly fundamentals table. The system is a stylised approximation of a philosophy, converted into portfolio machinery.

The business question is whether that machinery is useful.

For research teams, it can accelerate strategy prototyping. For investment committees, it can expose hidden assumptions. For compliance teams, it can make decisions easier to inspect. For product teams, it can turn vague style labels into repeatable workflows. For portfolio managers, it can function as a disciplined second reader, provided nobody lets it drive alone after one attractive backtest.

The paper’s best contribution is therefore not the Buffett number. It is the demonstration that prompt engineering, when tied to tools and deterministic outputs, can become a serious interface between human investment doctrine and systematic portfolio construction.

That is less glamorous than “AI beats the market”.

It is also far more likely to matter.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yejin Kim, Youngbin Lee, Juhyeong Kim, and Yongjae Lee, “GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents,” arXiv:2510.01664, 2025, https://arxiv.org/abs/2510.01664↩︎