Wall Street’s New Intern: How LLMs Are Redefining Financial Intelligence

TL;DR for operators

The paper is best read as a menu, not a victory lap. It surveys how recent research has plugged large language models into financial investment workflows across four design patterns: LLM-based pipelines, hybrid LLM-quant systems, fine-tuned financial models, and agent-based architectures.¹ That taxonomy is more useful than another breathless “AI beats Wall Street” headline, which is convenient because the latter is usually where rigor goes to die in a nice suit.

The practical question is not “Which LLM should we buy?” It is “What job should the model perform inside the investment process?” A model that retrieves 10-K context has a different role from one that converts news into sentiment factors. A model that routes work among specialised agents is different again from one fine-tuned on financial newsflow. Treating these as one category called “AI investing” is how procurement committees accidentally fund science fiction.

For an investment operator, the useful reading is:

Architecture pattern	Best operational use	What it improves	Main boundary
LLM pipelines	First-pass research, report parsing, market narrative synthesis	Analyst coverage and document throughput	Output quality depends heavily on prompt design, retrieval quality, and source discipline
Hybrid LLM-quant systems	Turning language into features for forecasting, optimisation, or risk models	Integration with existing quant infrastructure	Backtests may not survive costs, slippage, leakage, or regime shifts
Fine-tuning and adaptation	Repeated finance-specific tasks such as sentiment classification, filings analysis, and financial QA	Domain consistency and lower per-task friction	Requires curated data and still may fail on novel market conditions
Agent-based systems	Multi-step investment workflows with specialist roles	Coordination, traceability, and division of labour	Agent “debate” is not the same as institutional risk control

The paper does not prove that LLMs can reliably produce tradable alpha. It shows that the field is converging on a more sensible design principle: put LLMs where finance is language-heavy, judgment-heavy, or workflow-heavy, then connect them carefully to numerical models, risk systems, and human review.

The intern is useful because it reads before lunch

A financial analyst’s day contains an absurd amount of text pretending to be data. Earnings calls, annual reports, filings, broker notes, macro commentary, news, social sentiment, regulatory language, and management excuses with unusually polished grammar all arrive before the spreadsheet has even warmed up.

Traditional financial models are very good at structured inputs: prices, volumes, ratios, volatility surfaces, accounting fields, and factor exposures. They are less naturally suited to the messy narrative layer that often explains why those numbers are moving, or at least why people think they are moving. That is the gap LLMs enter.

The surveyed paper starts from that gap. It frames LLMs as tools for combining structured and unstructured financial information across stock selection, risk assessment, sentiment analysis, trading, portfolio optimisation, and forecasting. The paper then organises recent research into four families: pipelines, hybrid methods, fine-tuning approaches, and agents.

That structure matters. It prevents the usual mistake of treating “LLM for finance” as one technology. It is not. It is a set of integration choices.

A language model can be:

a reader of filings;
a summariser of market news;
a generator of structured sentiment scores;
a router inside a mixture-of-experts model;
a front-end to an optimiser;
a fine-tuned classifier;
a simulated trader;
or a coordinator of other agents.

Those are not interchangeable jobs. A screwdriver and a risk committee are both “tools,” technically. One should still avoid confusing them.

The comparison that matters is architectural, not model-branded

Much of the public discussion around financial LLMs collapses into brand comparison: GPT-4 versus Llama, Gemini versus Claude, open models versus closed models. That is a shallow axis. The deeper question is where the language model sits in the investment workflow.

The paper’s four categories form a practical comparison framework:

Question an investment team is asking	Better-matched architecture	Why
“Can we process more filings, news, and reports without adding analysts?”	LLM pipeline	The model turns fragmented text sources into structured intermediate analysis
“Can text improve an existing forecasting or portfolio model?”	Hybrid LLM-quant system	The LLM creates features, signals, or rationales while classical models handle optimisation or prediction
“Can we make the model consistently understand our financial task?”	Fine-tuning or adaptation	Domain training improves repeatable task behaviour and reduces prompt fragility
“Can we automate a multi-step research process with checks and roles?”	Agent-based system	Multiple agents separate fundamentals, sentiment, technicals, risk, and decision approval
“Can we personalise financial guidance around user risk preferences?”	Human-agent or multi-agent system	Agents can combine market analysis with profile-sensitive decision rules, though governance becomes harder

This is the heart of the article: the paper is not most valuable as a literature list. It is valuable as a buyer’s guide for system design.

LLM pipelines are the research-assistant layer

The first family in the survey covers LLM-based frameworks and pipelines. These systems use LLMs as structured research machinery: collect data, summarise it, extract relevant concepts, reason over them, and produce signals or explanations.

MarketSenseAI is one example. It uses GPT-4 with chain-of-thought and in-context learning to synthesise market trends, news, fundamentals, price dynamics, and macroeconomic conditions. The framework breaks analysis into modules: news summarisation, fundamentals analysis, price dynamics, macro context, and final decision generation. The survey reports that its empirical evaluation on S&P 100 stocks over 15 months showed excess alpha between 10% and 30%, though the exact operational robustness depends on the original study’s assumptions.

GPT-InvestAR works differently. It focuses on corporate annual reports. It retrieves relevant sections, embeds documents, stores them in ChromaDB, asks GPT-3.5-Turbo targeted questions about company health, and then uses the responses as features for a downstream regression model. In business terms, the LLM is not “the investor.” It is a filings-to-features machine.

Ploutos adds interpretability to stock movement prediction by combining textual and numerical inputs, then generating rationales for expected movement. LLMoE uses an LLM as a router inside a mixture-of-experts architecture, selecting expert models based on multimodal inputs such as numerical stock data and textual news. A Korean market pipeline reviewed in the paper converts qualitative investor-facing reports into numerical factors for prediction over the KOSPI200.

The common mechanism is not magic reasoning. It is decomposition. These systems split financial research into parts that LLMs can handle: summarisation, retrieval, textual interpretation, signal generation, and explanation.

That makes pipelines attractive for research teams with coverage problems. A small team can widen monitoring across more companies, sectors, and news streams. The likely ROI is not “instant alpha.” It is more first-pass analysis per analyst-hour, faster triage, and better organisation of narrative information.

The risk is equally clear. A bad pipeline can produce beautiful nonsense at industrial speed. If the retrieval layer pulls irrelevant documents, if prompts encourage overconfident recommendations, or if evaluation ignores transaction costs and leakage, the system will still look productive. So does a photocopier during a compliance breach.

Hybrid systems keep the spreadsheet in the room

The second category, hybrid integration, is often the most credible for near-term institutional use because it does not ask the LLM to replace the entire investment process. Instead, the LLM becomes one component inside a more familiar quantitative structure.

The paper reviews a ChatGPT-based portfolio selection method where the model selects a stock universe from the S&P 500, then traditional portfolio construction methods build equally weighted, GPT-weighted, minimum-variance, maximum-return, and maximum-Sharpe portfolios. The interesting point is not that ChatGPT becomes a portfolio manager. It is that the model contributes a candidate universe, while optimisation still handles allocation.

Other hybrid studies follow the same pattern. One approach combines news sentiment, moving averages, options volume, and VIX signals to predict market trends. Another uses an LLM branch to generate financial summaries, FinBERT to vectorise them, a Linear Transformer for price history, and a CNN for candlestick chart patterns. MuSA combines FinBERT sentiment, directional-change analysis, and deep reinforcement learning for adaptive portfolio weights. SEP uses summarisation, explanation, and prediction modules with reinforcement learning to improve explainable stock prediction from noisy social text.

Hybridisation is finance’s natural immune response to AI exuberance. It says: let the LLM read, classify, and explain; let numerical models optimise, constrain, and test.

That division is healthy. Markets are not essays. A financial system must care about drawdowns, exposure limits, liquidity, execution cost, turnover, tax effects, benchmark drift, and the difference between a 55% directional accuracy and a profitable strategy. A language model can help create or interpret signals, but the portfolio still needs mathematics, risk budgets, and execution discipline.

The commercial implication is straightforward: hybrid systems are easier to justify inside existing investment organisations. They can be inserted into current research, factor, portfolio, and risk pipelines without demanding a religious conversion to “autonomous AI trading.” Sensible people do not automate capital allocation because a chatbot sounds confident. They automate bounded subroutines, then make them earn trust one validation report at a time.

Fine-tuning pays when the same financial judgment repeats

Fine-tuning and adaptation approaches address a different problem: general-purpose LLMs often understand finance just well enough to be dangerous. They can describe EBITDA, sentiment, and portfolio diversification fluently, but fluency is not calibration. Finance-specific work needs consistency across repeated tasks.

The paper reviews several adaptation strategies: LoRA, QLoRA, PEFT, instruction tuning, finance-specific sentiment models, time-series-specialised architectures, and retrieval-augmented fine-tuned systems.

The key business distinction is between occasional reasoning and repeated production tasks. If a team asks a model one-off questions about a new market event, fine-tuning may be unnecessary. Retrieval and prompting may be enough. But if the team repeatedly classifies financial headlines, extracts risk factors from filings, generates stock trend labels, or answers domain-specific financial QA, adaptation becomes more attractive.

Examples in the survey show the range:

Adaptation example	What is being adapted	Operational interpretation
Fine-tuned Llama2 and GPT-3.5 for investment research	Model behaviour on reports, news, and time-series-related prompts	Better task alignment for investment queries, though complex novel questions remain difficult
Fine-tuned LLMs on financial newsflow	Text representations for return forecasting	Better mapping from news language to predictive embeddings
Finance-specific LLaMA-2 sentiment models	Sentiment classification over financial documents	More domain-aware sentiment signals for return prediction
Distilled RoBERTa headline emotion model	Emotional tone and intensity extraction	Lightweight text features for next-day direction models
QLoRA-enhanced earnings report prediction	Efficient adaptation to earnings-report inputs	Lower compute burden for task-specific prediction
Stock-Chain / StockGPT	RAG plus LoRA plus financial CoT examples	Better financial QA and stock trend analysis with domain retrieval
FinLlama	Fine-tuned financial sentiment model	Sentiment-driven trading pipeline
StockTime	Time-series representation integrated with LLM latent space	Lower-overhead use of LLM architecture for stock prediction
SAPPO	LLM-derived sentiment added to PPO portfolio optimisation	Sentiment-aware reinforcement learning for allocation

A few reported numbers are attention-grabbing. The survey notes Stock-Chain reporting an annualised return of 30.8% and 55.63% accuracy in stock trend forecasting. It also reports SAPPO achieving a 30.2% annualised return and a Sharpe ratio of 1.90 on a three-stock portfolio over the evaluated period. FinLlama is described as producing stronger cumulative returns, Sharpe ratio, and volatility outcomes than FinBERT and lexicon baselines in a long-short portfolio setting.

These numbers should be read as signals of research interest, not direct procurement evidence. They come from different studies, datasets, universes, time periods, and backtesting designs. The paper surveys them; it does not normalise them into one benchmark. That distinction is small only to people who have never seen a backtest glow in the dark.

The useful conclusion is narrower and stronger: fine-tuning is most valuable when the firm has a stable task, quality data, and a measurable evaluation loop. Without those, fine-tuning becomes expensive prompt theatre.

Agent systems turn LLMs into a workflow, not a trader with a name badge

The agent-based category is the most fashionable and the easiest to misunderstand. An agent system is not automatically smarter because it has more little personas arguing in a Slack channel made of tokens. The value comes from role separation, tool use, memory, critique, escalation, and structured decision protocols.

The survey reviews several agent designs. One framework compares single-agent, dual-agent, and multi-agent configurations for financial report analysis, using tools such as stock data retrieval, forum post retrieval, and RAG over large documents. It reports that single-agent systems can be effective for fundamental analysis, while multi-agent structures perform better for risk assessment. The ensemble approach selects agent structures by task and achieves 66.7% investment decision accuracy with an average deviation of 2.35% in stock price prediction.

Alpha-GPT 2.0 introduces a human-in-the-loop framework for quantitative investment, with specialised agents for alpha mining, alpha modelling, and alpha analysis. TradingAgents models a simulated trading firm with roles such as Fundamentals Analyst, Sentiment Analyst, News Analyst, Technical Analyst, Researcher Trader, Risk Manager, and Fund Manager. FINCON, MarketSenseAI 2.0, StockAgent, FinArena, TwinMarket, and Agent Trading Arena expand this pattern into decision systems, market simulations, behavioural modelling, or human-agent collaboration.

The agent architecture is appealing because finance itself is already agentic. A real investment process separates research, portfolio management, risk, compliance, execution, and client communication. Turning that organisational structure into software is more plausible than asking one monolithic model to be analyst, trader, economist, lawyer, therapist, and scapegoat.

But the danger is anthropomorphic governance. Naming an agent “Risk Manager” does not make it a risk manager. A true risk function has authority, independence, limits, auditability, escalation rights, and consequences. An LLM role description has a prompt. Those are not the same object.

The business case for agents is strongest where workflows are multi-step and require traceable intermediate reasoning:

collecting company-specific information;
separating fundamentals from sentiment;
comparing bullish and bearish interpretations;
checking recommendations against risk tolerance;
generating analyst memos;
routing uncertain cases to humans;
personalising outputs for different investor profiles.

The business case is weakest where the system is asked to autonomously trade capital under live uncertainty without strong validation. That is not innovation. That is a compliance incident with better UI.

The evidence is a map, not a universal alpha claim

The paper’s evidence should be interpreted carefully because the survey itself does not run a new unified benchmark. It summarises and categorises recent research. That is useful, but it changes what the reader can infer.

Material in the paper	Likely purpose	What it supports	What it does not prove
Tables 1–4 grouping studies by architecture	Main evidence for taxonomy	The field is organising around pipelines, hybrid systems, fine-tuning, and agents	That one category universally dominates the others
Reported returns, Sharpe ratios, accuracies, and prediction gains from reviewed studies	Comparison with prior work inside each cited study	Some LLM-enhanced systems outperform selected baselines in specific settings	Reliable live alpha across markets, costs, regimes, or institutions
Descriptions of RAG, CoT, ICL, LoRA, PEFT, RLHF, and MoE	Implementation background	These methods are recurring tools in financial LLM systems	That every method is necessary for every use case
Agent simulations such as StockAgent, TwinMarket, and Agent Trading Arena	Exploratory extension	LLM agents can model trading behaviour, social dynamics, and decision processes	That simulated agent behaviour is equivalent to real investor behaviour
Future directions on continuous learning, financial architectures, benchmarks, and human-AI collaboration	Research agenda	Current systems still lack standardisation, adaptation, and governance maturity	That these gaps are already solved by current commercial deployments

This matters because financial AI papers often mix three different claims:

The model predicts a label.
The prediction improves a backtest.
The strategy can survive live deployment.

Those are not the same claim. A model can improve directional accuracy and still lose money after costs. A strategy can beat a benchmark in one market period and fail under liquidity constraints. A system can generate convincing explanations that are not causally useful. Finance has many ways to punish statistical innocence.

The survey’s strength is that it shows breadth. Its limitation is that breadth creates comparability problems. The reviewed studies differ by market, asset universe, time period, model, task, data source, evaluation metric, and trading assumption. A 30% annualised return in one study, a 55.63% trend accuracy in another, and a lower RMSE in a third cannot be stacked into a league table without doing violence to methodology.

The correct reading is architectural: these systems suggest where LLMs can add value, not how much alpha they will produce in your portfolio next quarter.

The operational checklist: choose by bottleneck, not fashion

For firms considering LLM integration, the paper implies a practical selection rule: start with the bottleneck.

Bottleneck	Use this pattern	Example implementation	Validation question
Analysts cannot process enough filings and news	RAG-backed LLM pipeline	Retrieve relevant filings, news, and fundamentals; produce structured research notes	Does the system retrieve the right source material and cite it correctly?
Narrative data is not entering quantitative models	Hybrid LLM-quant	Convert text into sentiment, event, or risk features for forecasting models	Does the feature add out-of-sample value after costs and controls?
The same financial language task repeats daily	Fine-tuning / PEFT / LoRA	Fine-tune a classifier or QA model on internal labelled financial data	Does fine-tuning improve consistency over prompting on held-out cases?
Investment workflows require multiple perspectives	Agent architecture	Separate fundamentals, news, sentiment, technical, risk, and final decision roles	Do agents reduce errors or merely produce longer rationales?
Outputs must be personalised to investor risk preferences	Human-agent collaboration	Combine market analysis with suitability and risk-profile constraints	Are recommendations compliant, explainable, and auditable?
Market regimes change quickly	Continuous adaptation and monitoring	Online updates, memory, feedback loops, and periodic recalibration	Does adaptation improve robustness without introducing leakage or drift?

The most boring column is the most important one: validation. In finance, a system that cannot be validated is not “emergent.” It is unusable.

The real business value is workflow leverage before trading autonomy

The business value of LLMs in finance is likely to appear first in decision support, not fully autonomous trading. That is not a disappointment. It is how serious systems usually enter serious industries.

The paper points to several practical value pools.

First, document throughput. Filings, earnings transcripts, regulatory updates, analyst reports, and news can be retrieved, summarised, compared, and converted into structured signals faster than manual coverage alone. This supports broader watchlists and faster reaction.

Second, signal enrichment. LLMs can turn qualitative information into inputs that traditional models can test: sentiment scores, event tags, risk categories, management tone, strategic themes, or extracted financial rationales.

Third, interpretability. Systems like Ploutos and SEP are concerned not only with predicting movement but explaining why. In investment organisations, explanation is not decoration. It helps with review, challenge, compliance, and post-mortems.

Fourth, workflow orchestration. Agent systems can structure the research process into specialist roles, making it easier to separate evidence gathering, argument formation, risk critique, and final recommendation.

Fifth, human-AI collaboration. Alpha-GPT 2.0 and FinArena point toward systems where humans refine the model’s output rather than simply receive it. That is probably where many professional firms will land: not AI replacing analysts, but analysts supervising a larger machine-assisted research surface.

The ROI logic is therefore less glamorous than “AI portfolio manager.” It is closer to:

fewer hours spent searching documents;
faster preparation of investment memos;
wider coverage with the same team;
more consistent sentiment and event extraction;
better documentation of reasoning;
earlier identification of risk-relevant narrative shifts.

That is not as cinematic as a robot hedge fund. It is also much more plausible.

The boundary: markets punish models that forget time, cost, and governance

The paper’s future directions are revealing because they identify the problems that remain unresolved.

The first is adaptation. Financial markets change. A model trained or prompted on one regime may fail in another. The paper calls for adaptive fine-tuning, continuous learning, online learning, reinforcement learning, and memory-augmented architectures. The practical issue is that adaptation must be controlled. A continuously learning model can become more current, or it can quietly learn noise with excellent enthusiasm.

The second is financial architecture. General-purpose LLMs are not naturally designed for time-series reasoning, multimodal financial fusion, or strict numerical consistency. Several reviewed studies try to bridge this with time-series modules, visual chart encoders, RAG, and domain-specific training. This suggests that the future of financial LLMs may be less about one giant model and more about specialised systems around a model core.

The third is benchmarking. The survey explicitly calls for standardised financial benchmarks and evaluation metrics that capture both predictive accuracy and investment outcomes. This is crucial. A forecasting model should not be judged only by accuracy if its errors cluster during drawdowns. A portfolio system should not be judged only by return if it achieves that return with fragile leverage, turnover, or concentration. A financial assistant should not be judged only by eloquence if it cannot cite sources or respect suitability constraints.

The fourth is human-AI collaboration. The paper treats this as a future direction, but for business users it is already the operating requirement. Model outputs need review rights, escalation paths, audit trails, and accountability. In regulated finance, “the agent decided” is not a defence. It is an invitation for someone senior to have a very bad afternoon.

The strategic takeaway: build the desk, not the oracle

The strongest lesson from this survey is not that LLMs are ready to run Wall Street. It is that financial intelligence is being decomposed into machine-assistable components.

Reading documents can be assisted. Extracting sentiment can be assisted. Converting qualitative events into structured features can be assisted. Generating first-pass investment memos can be assisted. Simulating trading behaviour can be explored. Coordinating analyst-like workflows can be partially automated.

But capital allocation remains a higher bar. It requires validation, governance, risk control, cost modelling, compliance, and live evidence. The paper’s reviewed studies are promising, but promise is not a mandate to connect the model to the order management system and hope the Sharpe ratio has manners.

The right metaphor is not “Wall Street’s new genius.” It is “Wall Street’s new intern”: tireless, fast, unusually literate, sometimes insightful, occasionally overconfident, and absolutely in need of supervision.

Used properly, that intern can make the desk faster and better informed. Used badly, it can generate polished mistakes at institutional scale. Finance, being finance, will probably try both.

References

Cognaptus: Automate the Present, Incubate the Future.

Sedigheh Mahdavi, Jiating (Kristin) Chen, Pradeep Kumar Joshi, Lina Huertas Guativa, and Upmanyu Singh, “Integrating Large Language Models in Financial Investments and Market Analysis: A Survey,” arXiv:2507.01990, 2025. ↩︎

TL;DR for operators#

The intern is useful because it reads before lunch#

The comparison that matters is architectural, not model-branded#

LLM pipelines are the research-assistant layer#

Hybrid systems keep the spreadsheet in the room#

Fine-tuning pays when the same financial judgment repeats#

Agent systems turn LLMs into a workflow, not a trader with a name badge#

The evidence is a map, not a universal alpha claim#

The operational checklist: choose by bottleneck, not fashion#

The real business value is workflow leverage before trading autonomy#

The boundary: markets punish models that forget time, cost, and governance#

The strategic takeaway: build the desk, not the oracle#

References#