Opening — The Era of AI Interns Is Over

Most LLM trading systems look impressive in architecture diagrams and suspiciously simple in prompts.

“Be a fundamental analyst.” “Analyze the 10-K.” “Construct a portfolio.”

In other words: Good luck.

The paper “Toward Expert Investment Teams: A Multi-Agent LLM System with Fine-Grained Trading Tasks” (arXiv:2602.23330) asks a deceptively sharp question:

What if the problem isn’t the model — but the job description?

Instead of giving LLM agents vague role labels, the authors decompose real institutional workflows into granular, procedural tasks. The result? Statistically significant improvements in Sharpe ratios, improved semantic signal propagation, and measurable portfolio-level diversification benefits.

For anyone building AI-driven trading systems — or any enterprise multi-agent architecture — this is not a cosmetic tweak. It is a structural redesign.


Background — From Role-Play to Real Workflows

Most multi-agent financial systems follow a familiar template:

Layer Typical Design Problem
Analyst Agents “Fundamental”, “Technical”, “News” High-level instructions, no procedural guidance
Manager Agent Aggregates scores Limited traceability of reasoning
Output Portfolio weights Black-box decision chain

This approach mirrors org charts — not operating manuals.

The paper identifies two structural weaknesses in coarse-grained prompting:

  1. Performance degradation — vague instructions dilute reasoning precision.
  2. Interpretability gaps — intermediate reasoning becomes unobservable or unstable.

The authors hypothesize something elegant and practical:

If human analysts follow standard operating procedures, AI agents should too.

So instead of “analyze fundamentals,” they encode domain-standard evaluation frameworks:

  • Technical indicators: RoC, MACD, RSI, Stochastic
  • Financial ratios: ROE, ROA, FCF margin, D/E
  • Sector benchmarking logic
  • Macro regime scoring
  • Explicit score ranges with calibration anchors (0–100)

In short: from persona to protocol.


Implementation — Building a Hierarchical Investment Team

The system is structured in three levels.

Level 1 — Specialist Analysts

Agent Task Type Output
Technical Indicator-based scoring 0–100 score + rationale
Quantitative Financial metric evaluation 0–100 score + rationale
Qualitative Governance & strategic analysis 1–5 sub-scores
News Event & sentiment assessment 1–5 outlook scores

Crucially, the experiment compares:

  • Fine-grained setting: Pre-computed metrics with structured evaluation rules.
  • Coarse-grained setting: Raw data with high-level instructions.

Same model (GPT-4o). Same market (TOPIX 100). Same period (Sep 2023 – Nov 2025). Leakage-controlled backtest.

Only the task decomposition changes.

That isolation is important.


Level 2 — Sector and Macro Filters

The Sector Agent:

  • Re-evaluates stock scores against sector averages.
  • Adjusts conviction based on relative metrics.

The Macro Agent:

  • Scores market regime across 5 dimensions:

    • Market Direction
    • Risk
    • Growth
    • Rates
    • Inflation

This creates structured upward information flow rather than free-form aggregation.


Level 3 — Portfolio Manager

The PM Agent integrates:

$$ \text{Final Score}_i = f(\text{Sector View}_i, \text{Macro Regime}) $$

Stocks are ranked cross-sectionally and used to construct a market-neutral long-short portfolio.

Rebalanced monthly.

Institutionally realistic.

No prompt theater.


Findings — When Granularity Pays

1. Sharpe Ratio Improvement

Across portfolio sizes (10–50 stocks), fine-grained agents significantly outperformed coarse-grained ones in 4 out of 5 configurations.

Portfolio Size Δ Sharpe (Fine − Coarse)
10 Not significant
20 +0.19 ****
30 +0.08 *
40 +0.17 ****
50 +0.26 ****

(Mann–Whitney significance reported in paper)

Noise dominates very small portfolios. Signal dominates broader baskets.

That alone is interesting.

But the deeper insight is more structural.


2. Ablation — Technical Signals Drive Edge

Removing agents reveals signal concentration:

Setting (Fine-Grained) Effect of Removing Agent
Remove Technical Sharpe drops sharply
Remove Quant Sharpe improves slightly
Remove Qual Mixed
Remove News Mild improvement
Remove Macro Mild improvement

The Technical Agent is the primary alpha driver.

More importantly:

Fine-grained structuring increases semantic similarity between Technical Agent output and Sector Agent reasoning.

In plain English:

The system actually listens to its technical analyst when the instructions are structured.

Without structure, signals leak.


3. Semantic Propagation — Measurable Information Flow

Using embedding-based cosine similarity, the authors quantify upward information transmission.

Agent → Sector Similarity Fine Coarse Diff
Technical 0.419 0.397 +0.022
Quantitative ~0.476 ~0.477 ≈0
Qualitative ~0.514 ~0.514 ≈0

Only technical signals gain transmission strength under fine-grained design.

This aligns perfectly with performance results.

Architecture matters.


4. Portfolio Optimization — Real Deployment Test

They then combine:

  • TOPIX 100 index
  • Risk-parity composite of 6 LLM agent strategies

Observed correlation ≈ 0.4.

Blended portfolio (50/50):

Portfolio Sharpe (Gross) Sharpe (Net 10bps)
Index Only 1.68 1.68
Agents Only 1.22 0.95
50–50 Blend 2.11 1.91

Even naïve blending improves risk-adjusted performance.

This is how institutional validation works: staged integration, not YOLO deployment.


Why This Matters Beyond Trading

The core takeaway isn’t “LLMs can trade.”

It’s this:

Performance improvement comes from structured task engineering — not just model capability.

This has broader implications for any AI multi-agent system:

1. Task Decomposition > Role Labels

Don’t build:

  • “Compliance Agent”
  • “Risk Agent”

Build:

  • “Liquidity Stress Evaluator”
  • “Regulatory Threshold Validator”
  • “Counterparty Concentration Scorer”

Specificity scales. Vagueness drifts.


2. Interpretability as a Design Variable

The paper shows that:

  • Text output vocabulary changes under structured prompting.
  • Semantic similarity can measure information adoption.

This means agent audit trails can be quantified, not just inspected.

For regulated finance, that is not optional.


3. Feature Engineering Is Not Dead

Ironically, LLM-based trading performance depends heavily on traditional feature design.

Structured technical indicators. Structured financial ratios. Structured macro dimensions.

LLMs amplify signal — they do not invent structure.

Anyone promising fully autonomous discovery is selling narrative, not process.


Limitations — And the Right Skepticism

The study acknowledges:

  • Limited time horizon (post-knowledge-cutoff window).
  • Japanese market only.
  • Possible linguistic bias effects.

These are valid constraints.

But the structural insight — task granularity improves signal transmission — is model-agnostic and domain-general.

That is the real contribution.


Conclusion — From AI Interns to AI Analysts

We are moving from:

“Act like an analyst.”

To:

“Follow this institutional procedure.”

Multi-agent LLM systems do not become powerful because we give them titles.

They become powerful because we give them process.

In finance — where noise masquerades as intelligence daily — that distinction is expensive.

Very expensive.

The future of agentic systems will not be defined by larger models.

It will be defined by better workflows.

And that is a much more human problem.


Cognaptus: Automate the Present, Incubate the Future.