Three’s Company: When LLMs Argue Their Way to Alpha

TL;DR for operators

Portfolio teams do not need another chatbot that confidently explains why yesterday’s price move was “driven by sentiment.” They need a system that can split research work into specialised roles, force disagreement into the open, log the reasoning trail, and turn messy inputs into a decision that a human can inspect before money moves.

That is the practical value of AlphaAgents, a role-based multi-agent LLM framework for equity stock selection.¹ The paper builds three specialist agents: a Fundamental Agent for 10-K/10-Q and financial disclosure analysis, a Sentiment Agent for Bloomberg-style news and analyst-rating information, and a Valuation Agent for price, volume, volatility, and return analytics. The agents first produce stock analysis, then participate in a structured round-robin debate until they reach consensus on whether a stock should enter an equally weighted portfolio.

The headline backtest is interesting, but it should not be read as proof of durable alpha. The experiment uses 15 technology stocks, January 2024 data, a February 1, 2024 portfolio start date, and a four-month test window. It focuses on stock selection only. It does not perform portfolio optimisation, transaction-cost modelling, liquidity analysis, compliance gating, tax-aware allocation, or regime stress testing. The sentiment agent is also excluded from the single-agent portfolio comparison because news coverage is insufficient in some cases. In other words: useful prototype, not a CIO in a browser tab.

The most important business lesson is architectural. AlphaAgents looks less like a fully autonomous asset manager and more like an investment-committee operating model: assign specialist research roles, give each role constrained tools, force cross-examination, log the debate, and leave room for human override. That is where near-term ROI lives: faster analyst coverage, better research traceability, more consistent risk framing, and a cleaner handoff into existing portfolio-construction systems.

The risk-tolerance result is also quietly useful. Risk-averse prompting changes behaviour in a recognisable direction: fewer, more conservative selections, lower upside in a bullish tech period, and somewhat reduced downside behaviour. Risk-seeking prompting, however, is nearly indistinguishable from risk-neutral prompting and is dropped from final evaluation. Apparently, telling an LLM to be adventurous is not the same as giving it a risk model. Shocking, but only if one has spent too much time near pitch decks.

The useful object is the committee, not the oracle

The obvious but wrong reading of this paper is: “LLM agents can pick stocks.”

The more useful reading is: “LLM agents can imitate parts of the institutional research process when their roles, tools, data, and interaction protocol are constrained.”

That distinction matters. A stock-picking oracle is judged almost entirely on out-of-sample returns. A research committee is judged on whether it collects relevant evidence, separates short-term and long-term signals, identifies disagreement, produces reviewable reasoning, and gives a portfolio manager a better decision package than raw documents and scattered analyst notes.

AlphaAgents is designed around that second model.

The system has three specialist agents:

Agent	Data and tools	Intended research function	Operational analogy
Fundamental Agent	10-K/10-Q data, report-pull tool, financial-report RAG	Analyse financial disclosures, cash flow, margins, operational progress, and risks	Fundamental analyst
Sentiment Agent	Bloomberg-style news body and summarisation tool	Summarise recent news, analyst ratings, disclosures, and likely sentiment effects	News and event analyst
Valuation Agent	Historical open, high, low, close, volume; volatility and return calculator	Assess recent price/volume behaviour, volatility, and return profile	Quant/valuation analyst

The design choice is simple but important: each agent gets access only to the data relevant to its role. That is not just tidy engineering. It reduces the temptation for one broad LLM prompt to mush every signal into a vague “balanced view.” A fundamental signal should not silently become a momentum signal. A news signal should not masquerade as balance-sheet analysis. A volatility signal should not be dressed up as management quality. The separation creates friction, and in investment research, friction is often where useful judgement begins.

AlphaAgents turns disagreement into an explicit workflow

The paper uses Microsoft AutoGen and AutoGen Studio to coordinate the agents, with GPT-4o as the selected model after experimentation. The workflow has two related modes: collaboration and debate.

In collaboration mode, a group-chat assistant coordinates the specialist agents and consolidates their inputs into a stock analysis report. The group-chat prompt explicitly requires every agent to speak at least twice before the final report is produced. This is a small implementation detail with large governance implications. It prevents the first plausible answer from becoming the final answer simply because it arrived early and sounded adult.

In debate mode, each agent receives the user query plus the peer analyses. The discussion proceeds in a round-robin format until the agents reach consensus. The coordinator is instructed that no single agent can decide for the group and that all agents must be invoked before termination.

That gives the system a structure closer to an investment committee than a single chatbot:

Stock question
   ↓
Role-specific evidence collection
   ↓
Specialist recommendation: BUY or SELL
   ↓
Round-robin debate across agents
   ↓
Consensus inclusion decision
   ↓
Logged reasoning trail for review or override

The paper also emphasises that all discussion histories are logged as supplementary output, allowing users to review and override the agents’ conclusions. This is one of the more business-relevant pieces of the architecture. Investment organisations do not merely need an answer; they need a decision record. Why was the stock included? Which agent objected? Was the concern about volatility, operating losses, insider selling, insufficient news, or valuation momentum? Could a human disagree with the final recommendation and still understand how the system got there?

That audit trail is the difference between “AI said buy” and “the model produced a reviewable investment memo.” The former is a compliance department’s sleep paralysis demon. The latter can at least enter a workflow.

The tools are doing more work than the agent branding

It is tempting to focus on the theatrical part: three agents arguing. The more practical engineering lesson is that AlphaAgents gives agents tools that narrow what they are allowed to do.

The Valuation Agent does not merely “think about price.” It receives a computational tool to calculate annualised cumulative return and annualised volatility. The annualised return calculation assumes 252 trading days, and annualised volatility is computed by scaling daily volatility by $\sqrt{252}$. That matters because numerical finance is one of the places where LLMs can appear competent while quietly fumbling arithmetic. Tool use reduces that risk.

The Sentiment Agent receives a summarisation tool for news items. The prompt asks for a concise summary and recommendation on whether to invest, with reflection-style prompting that encourages the model to reason, critique, and refine before summarising. That is not the same as a full event-driven alpha model, but it is a sensible way to turn unstructured news into a digestible investment signal.

The Fundamental Agent gets both report-pull capability and a financial-report RAG tool. The RAG tool is tailored to financial statement analysis and is used repeatedly to answer questions about cash flow, income, operations, gross margin, concerns, and progress toward stated objectives. The paper also describes using retrieval and relevance monitoring for the RAG-heavy agents through Arize Phoenix, including common faithfulness and relevance metrics.

The business interpretation is straightforward: the system is not valuable because it anthropomorphises three little analysts. It is valuable because it constrains each analyst-shaped process to a defined evidence base and toolset.

Design element	What the paper directly implements	Cognaptus interpretation	Boundary
Role prompting	Fundamental, sentiment, and valuation personas	Useful division of research labour	Persona is not expertise unless data and tools support it
Tool assignment	RAG, summarisation, volatility/return calculators	Reduces unsupported reasoning and numerical hallucination	Tool correctness and retrieval quality still need monitoring
Debate protocol	Round-robin group chat to consensus	Makes disagreement explicit before final recommendation	Consensus can still converge on a wrong answer
Logged discussions	Reviewable debate history	Supports audit, override, and governance	Logging is not the same as compliance approval
Risk prompts	Risk-neutral and risk-averse variants	Lightweight personalisation of decision framing	Prompted risk tolerance is not a formal utility function

This is the right level of ambition for near-term deployment. Use LLM agents to generate better research packets, not to replace the entire investment stack in one heroic sprint.

Risk tolerance works—until the prompt runs out of meaning

One of the paper’s more interesting tests is not whether the agents can produce a stock report. Many systems can produce plausible reports. The sharper question is whether agents behave differently when given different investor risk profiles.

The authors test risk-averse, risk-neutral, and risk-seeking profiles. Risk-averse and risk-neutral behaviour are evaluated in the final experiments. Risk-seeking outputs, however, are reported as nearly indistinguishable from risk-neutral outputs and are excluded from final evaluation.

That is a useful negative result. It suggests that prompt-level risk conditioning may distinguish conservative from ordinary behaviour more easily than it distinguishes ordinary from aggressive behaviour. A risk-averse instruction gives the agent an obvious behavioural move: avoid volatility, favour stability, narrow the portfolio. A risk-seeking instruction is less well-defined unless the system has a formal objective function, leverage constraint, drawdown tolerance, or expected-utility model. Without that machinery, “take more risk” often collapses into “say the same thing with more enthusiasm.” Finance has enough of that already.

In the paper’s examples, risk-averse prompts lead valuation agents to issue more cautious recommendations when volatility is high, while risk-neutral agents are more willing to emphasise momentum and upside potential. At the portfolio level, risk-averse agents select fewer stocks and exclude several names present in the risk-neutral set. The multi-agent risk-averse strategy further consolidates selections to overlapping picks where conservative valuation and fundamental signals agree.

That is a meaningful pattern, but it should not be overinterpreted. The paper shows prompt-conditioned behavioural differentiation, not validated client suitability. A wealth platform cannot simply map a questionnaire to “risk_averse_prompt.txt” and declare fiduciary victory. The prompt is a behavioural nudge, not a risk engine.

The backtest is evidence, not absolution

The paper evaluates AlphaAgents downstream through stock selection and backtesting. The setup is deliberately narrow:

Test component	Paper setup	Likely purpose	What it supports	What it does not prove
Stock universe	15 randomly selected technology-sector stocks	Main evidence environment	Feasibility of agent-based stock selection in a constrained universe	Generalisation across sectors, regimes, geographies, or asset classes
Input period	January 2024 data and news	Decision information window	Agents can use a defined historical information set	Robustness to longer histories or noisy real-time feeds
Portfolio start	February 1, 2024	Backtest anchor	Clear post-decision evaluation period	Live execution realism
Holding/evaluation window	Four months	Downstream performance check	Short-window comparison of selected portfolios	Durable alpha or cycle-tested performance
Weighting	Equal weights based on recommendations	Simplifies evaluation of selection	Isolates inclusion decisions	Portfolio construction skill
Comparators	Benchmark, valuation-agent portfolio, fundamental-agent portfolio; sentiment omitted in some single-agent comparison due to insufficient news	Main comparison	Multi-agent combination versus selected single-agent views	Full comparison against professional quant, sector ETFs, transaction-cost-adjusted strategies
Metrics	Cumulative return and rolling Sharpe ratio	Performance interpretation	Relative return/risk behaviour in window	Statistical significance or investability

In the risk-neutral setting, the paper reports that the multi-agent portfolio outperforms the benchmark and the single-agent frameworks during the test window, both in cumulative return and rolling Sharpe ratio. The authors interpret this as evidence that combining short-horizon valuation and sentiment signals with longer-horizon fundamental analysis creates a better-balanced selection process.

That interpretation is plausible. It is also bounded. The backtest is small, sector-specific, and short. The paper does not report a large panel of repeated trials, transaction costs, turnover effects, liquidity constraints, execution assumptions, or statistical confidence intervals. The chart evidence supports a directional claim: the multi-agent process did better than the selected alternatives in this particular test window. It does not support the larger claim that multi-agent debate reliably generates alpha across market regimes.

The risk-averse results are more instructive than flattering. In the risk-averse setting, agent-selected portfolios underperform the benchmark during a strong technology-stock period. That is not necessarily a failure. If volatile tech names rally, a conservative portfolio should miss some upside. The paper notes that risk-averse portfolios reduce exposure to more volatile technology stocks, limiting downside while also constraining gains.

The multi-agent risk-averse portfolio appears relatively stronger than the valuation-only and fundamental-only approaches, especially early in the test period, with slightly lower volatility and reduced drawdowns. That is exactly the sort of thing a conservative committee process should do: not win the rally, but avoid being the worst version of cautious.

The Zscaler example shows the system’s actual job

The paper includes a sample report and debate around Zscaler, anonymised in parts as “Company Z” but clearly illustrated through the agent screenshots and discussion. The example is useful because it shows the system handling mixed evidence rather than simply ranking stocks by one metric.

The positive case includes January outperformance versus the S&P 500, recognised market leadership in SaaS Security Posture Management, governance continuity from annual meeting outcomes, revenue growth, customer expansion, and progress on corporate initiatives.

The negative case includes high volatility, lower relative trading volume, insider selling, trust-sale concerns, negative operating margin, and the fact that cash flow, net income, and operating profitability paint a less comfortable picture than price momentum alone.

The conclusion is not a single clean slogan. Higher-risk investors may see an attractive entry point, while conservative investors may prefer alternatives or wait for greater price stability. The debate example later shows agents moving toward a sell recommendation for a risk-neutral investor after weighing financial concerns and insider-selling signals against market leadership and bullish price action.

This is where the mechanism matters. The system’s job is not to make every answer “more bullish” or “more quantitative.” Its job is to force contradictory evidence into the same room. Price strength and operating losses can both be true. Insider selling and market leadership can both be true. Revenue growth and negative margins can both be true. A single-agent report may mention all of this, but a structured multi-agent debate gives each type of evidence a representative with permission to object.

That is not alpha by itself. It is a better meeting.

Where this fits in a real investment workflow

AlphaAgents is easiest to misplace. It is not a complete portfolio optimiser. The paper itself states that portfolio diversification and optimisation are outside the current scope and that equal weighting is used based on agent recommendations. It also says future work could adjust stock weights according to LLM confidence, where stronger buy signals receive higher allocations.

That future direction is sensible but incomplete. In a real investment stack, confidence-weighted LLM outputs would still need to pass through conventional portfolio machinery: expected-return estimation, covariance modelling, factor exposure constraints, liquidity limits, risk budgets, tax constraints, compliance rules, and human accountability.

The near-term use case is therefore upstream of execution.

AlphaAgents-style systems are most useful as:

Coverage extenders. A team can produce first-pass research packets across more names than human analysts can review from scratch.
Signal synthesis layers. Fundamental, sentiment, and price/volume views can be surfaced separately before being consolidated.
Committee simulators. The system can pressure-test a recommendation by asking each specialist agent to challenge the others.
Audit-log generators. The debate transcript becomes evidence for why the recommendation was made and where uncertainty remains.
Inputs to existing models. Agent recommendations can feed into mean-variance optimisation, Black-Litterman-style views, scenario analysis, or discretionary portfolio review, as the paper itself suggests.

The economic case is not “replace analysts.” That claim is lazy, and lazy claims have already enjoyed a full bull market. The better case is “compress analyst preparation time while improving traceability.” If a human analyst spends less time collecting scattered documents and more time interrogating the system’s assumptions, the workflow has value even before anyone claims standalone alpha.

What remains uncertain before this becomes production finance

The paper is refreshingly clear that AlphaAgents is not yet a full portfolio optimisation engine. The remaining uncertainty is not cosmetic; it changes how the system should be used.

First, the experimental universe is narrow. Fifteen technology stocks over four months can show feasibility, not broad market validity. Technology-sector behaviour in early 2024 is not a universal stress test. A framework that performs well in a concentrated growth-stock setting may behave differently across banks, utilities, cyclicals, small caps, emerging markets, or crisis periods.

Second, the backtest evaluates selection more than construction. Equal weighting avoids the harder problem of sizing positions. But sizing is where much of portfolio management actually happens. A system that can say “include this stock” still has not answered “how much, relative to what risk, under which constraints, and with what expected tracking error?”

Third, risk tolerance is prompt-based. The risk-averse setting creates conservative behaviour, but the risk-seeking setting fails to separate cleanly from risk-neutral behaviour. For consumer wealth, pension funds, insurance portfolios, or regulated advisory workflows, risk tolerance needs more than a persona label. It needs explicit objectives, constraints, suitability checks, loss capacity, and scenario behaviour.

Fourth, evaluation of reasoning remains difficult. The paper uses RAG faithfulness and relevance checks, monitors tool use, applies human review for debate coherence, and logs discussion history. Those are necessary controls. They are not proof that the final investment judgement is correct. A debate can be coherent and still wrong. Anyone who has attended an investment committee already knows this; the machine version is not exempt.

Fifth, sentiment coverage is incomplete. The sentiment agent is excluded from some single-agent portfolio construction due to insufficient news coverage. This matters because news availability is not random. Larger, more covered companies may receive richer sentiment analysis, while less covered companies may be underrepresented or analysed with weaker context.

Finally, there is no transaction-cost or execution layer. In a short-window backtest, turnover, bid-ask spread, market impact, and implementation timing can matter. The paper’s current setup is appropriate for a prototype study, but not for investable performance claims.

The operator’s takeaway: build the meeting before automating the trade

AlphaAgents is valuable because it chooses the right intermediate target. It does not pretend that an LLM can become a complete portfolio manager by being given a brokerage API and a motivational system prompt. Instead, it shows how LLM agents can be organised into a structured research process: specialist roles, constrained tools, debate, consensus, logs, and human override.

For businesses building agentic finance tools, that is the lesson worth keeping.

Do not start with “autonomous portfolio construction.” Start with the analyst packet. Start with the disagreement map. Start with the audit trail. Start with risk-profile-sensitive recommendation framing. Then connect the output to existing portfolio models and governance processes.

The paper’s backtest gives a useful proof of concept: in a small technology-stock experiment, the risk-neutral multi-agent framework performs better than the selected single-agent alternatives and benchmark during the evaluation period, while risk-averse variants behave more conservatively and sacrifice upside. Good. Interesting. Not enough to retire the investment committee.

But perhaps enough to make the committee less chaotic.

And in finance, “less chaotic” is already a premium product.

Cognaptus: Automate the Present, Incubate the Future.

Tianjiao Zhao, Jingrao Lyu, Stokes Jones, Harrison Garber, Stefano Pasquali, and Dhagash Mehta, “AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions,” arXiv:2508.11152, 2025. https://arxiv.org/abs/2508.11152 ↩︎

TL;DR for operators#

The useful object is the committee, not the oracle#

AlphaAgents turns disagreement into an explicit workflow#

The tools are doing more work than the agent branding#

Risk tolerance works—until the prompt runs out of meaning#

The backtest is evidence, not absolution#

The Zscaler example shows the system’s actual job#

Where this fits in a real investment workflow#

What remains uncertain before this becomes production finance#

The operator’s takeaway: build the meeting before automating the trade#