The Lion Roars in Crypto: How Multi-Agent LLMs Are Taming Market Chaos

TL;DR for operators

MountainLion is best understood as a crypto research operating system, not a mystical trading lion that eats volatility for breakfast. The paper introduces a multi-modal, multi-agent LLM framework that combines technical analysis, news retrieval, on-chain signals, chart interpretation, price forecasting, GraphRAG-style semantic reasoning, and user feedback into a structured investment-reporting pipeline.¹

The practical contribution is workflow compression. Instead of asking an analyst to manually reconcile RSI, MACD, ETF flows, whale movements, liquidation events, regulatory news, social sentiment, and investment horizon, MountainLion decomposes the work across specialised agents and recombines the outputs into an interpretable report.

What the paper directly shows is that this architecture can enrich technical-only recommendations with contextual signals across short-, medium-, and long-term horizons. The case study compares a baseline report built from technical indicators against LLM-refined recommendations produced through agents such as ChatGPT-4o, DeepSeek V3, and Grok-3. The refinement adds on-chain wallet accumulation, liquidation-volume signals, ETF inflows, macro policy context, regulatory developments, institutional adoption, and declining exchange reserves.

What Cognaptus infers for business use is narrower but useful: this is a serious template for AI-assisted research desks, crypto advisory dashboards, investor education tools, and platform-side recommendation layers. It is not yet enough to claim production-grade autonomous trading. The paper contains forecasting tables and case-study comparisons, but it does not present a full live-trading audit with transaction costs, slippage, execution latency, drawdowns, compliance controls, or regime-specific stress tests. Tiny details, naturally.

The most interesting business lesson is that LLMs may be more valuable in crypto as explanation integrators than as naked price predictors. The machine does not need to be a perfect oracle to be useful. It only needs to reduce the cost of turning fragmented market noise into a coherent, evidence-linked decision brief.

The system is an analyst factory, not a magic trading oracle

Crypto is not short of signals. It is short of disciplined synthesis.

A trader can watch candlesticks, funding rates, whale transfers, exchange reserves, ETF flows, Telegram rumours, X threads, Reddit sentiment, regulatory headlines, and macro data all at once. This is usually called being “informed”, although after the sixth dashboard it starts to resemble a hostage situation.

MountainLion’s argument is that the hard part is not merely prediction. It is coordination. Crypto decision-making requires fast interpretation across heterogeneous inputs: structured market data, textual news, visual charts, sentiment, on-chain behaviour, and user-specific investment horizons. Traditional deep learning and reinforcement learning pipelines can encode some of this into latent vectors, but they often struggle to explain why a recommendation changed. Static rule engines have the opposite problem: they can be explainable, but they are brittle when the market narrative mutates.

MountainLion responds with a layered architecture. The paper describes four principal layers:

Layer	What it does	Operator translation
User interface	Provides multilingual dashboard, chat, alerts, and visual analytics	The place where investors and analysts ask questions and inspect outputs
Core business logic	Orchestrates report generation, forecasting, and recommendations	The workflow layer that turns raw requests into research tasks
AI engine	Runs LLM agents, retrieval, GraphRAG, and ML forecasting models	The reasoning and prediction layer
Database layer	Stores exchange data, news, embeddings, reports, and predictions	The memory and audit substrate

This matters because the system is not presented as one giant prompt glued to a chart. It is a modular pipeline. That distinction is not cosmetic. In production finance, modularity determines whether a system can be monitored, debugged, refreshed, audited, and partially replaced without detonating the whole stack. The boring engineering nouns are doing real work here.

The paper’s central design move is to split the analyst function into specialised agents and then recombine their outputs. That is where the “multi-agent” label has substance.

Four agents turn market clutter into a reportable decision chain

MountainLion’s report-generation pipeline is built around four specialised agents:

Agent	Input focus	Output role
Technical Analysis Agent	Historical price and volume data; RSI, MACD, Bollinger Bands, support and resistance	Produces chart-based market structure
Market Dynamics Agent	News, regulation, capital flows, sentiment, social and KOL signals	Adds external market context
Trading Recommendation Agent	Technical outputs, market context, broader data	Converts analysis into horizon-specific strategy
Semantic Analysis Agent	Combined agent outputs	Improves coherence, logical consistency, and investor-facing readability

This is the mechanism-first core of the paper. The system does not simply ask an LLM, “Should I buy Bitcoin?” That would be less an architecture than a cry for help. Instead, it decomposes a user request into subproblems, routes each subproblem to an agent, validates retrieved information by relevance, recency, and source credibility, and then synthesises a report.

The retrieval stage is important. Each agent formulates retrieval queries and evaluates the resulting signals. The paper describes a composite scoring logic using relevance, recency, and credibility weights. This is not a guarantee of truth, but it is a necessary guardrail. In crypto, stale information can be actively harmful. A one-week-old rumour can be dead, priced in, or replaced by a completely different regulatory panic wearing the same hat.

The semantic refinement stage is also more than copy-editing. It performs the conversion from “many partial analyses” into “one usable investment narrative”. That is where LLMs have an obvious comparative advantage. They are good at turning multiple textual and semi-structured inputs into a coherent explanation. They are less obviously good at discovering durable alpha from noisy price series. MountainLion sensibly uses both LLM and classical ML components rather than pretending one model family deserves a crown.

The appendix adds operational details that matter for deployment. Agent outputs are cached with time-to-live logic so the system does not recompute everything every time a user asks a nearby question. Freshness checks decide whether cached outputs can be reused or whether an agent must recompute. This is an implementation detail, but in real-time research systems it is also a cost and latency control mechanism. A beautiful AI analyst that takes too long to answer is just a very expensive screensaver.

Forecasting is a fusion problem, not a language-model victory lap

The forecasting module is deliberately dual-track. One path uses an LLM-based predictor that processes OHLCV data and sentiment embeddings from news. The other path uses machine-learning models such as ridge regression, decision trees, polynomial regression, or ensemble tree methods applied to engineered price features.

The final forecast is built as a weighted fusion:

$$ \hat{Y}\ast{final} = \alpha \hat{Y}\ast{LLM} + (1 - \alpha)\hat{Y}_{ML} $$

The key is that $\alpha$ is not meant to be static. The paper describes adaptive updating based on rolling historical accuracy. If the LLM-driven component underperforms, the system can reduce its weight and lean more heavily on the ML predictor. If textual sentiment and macro narrative become informative, the LLM path can matter more.

This is the right instinct. Financial markets are regime machines. Sometimes price structure dominates. Sometimes news dominates. Sometimes everything is noise and everyone pretends their dashboard predicted it afterwards. A fixed model weighting is almost always too confident.

The paper evaluates forecasts with two conceptual metrics: absolute accuracy, which measures price-level error, and directional correctness, or win rate, which measures whether the model predicted the direction of movement. Directional correctness is especially relevant in trading because a model can be imprecise on price level yet useful on direction. However, the quantitative table reported in the appendix mainly gives cross-validation scores and test MSE across tokens. That table is useful, but it is not a full trading-performance study.

The reported forecasting table includes the following token-level results:

Token	Alpha	CV score	Test MSE	Paper’s evaluation note
ADA	1	-0.000496	0.000396	Excellent fit, highly stable
BTC	1	-1,997,859.43	3,211,419.56	Large error, unstable trend
ARB	1	-0.000421	0.000199	Very good
SOL	1	-0.000459	0.000159	Good model, potentially high volatility
XRP	0.1	-0.000221	0.001122	Medium fit, moderate noise
DOGE	1	-0.000362	0.0000425	Very good
TRX	0.01	-0.00000987	0.00000640	Best performer
ETH	1	-2,169,147.17	3,016,065.13	Large error, unstable
MATIC	1	-0.000432	0.000341	Stable, medium confidence
BNB	1	-945.57	180.95	High deviation, unstable in 7-day window

There are two readings of this table.

The generous reading is that the framework adapts differently across assets. Smaller or more locally patterned tokens such as TRX, DOGE, ARB, ADA, SOL, and MATIC appear more stable under the reported setup, while BTC and ETH show large errors because their large-cap dynamics are structurally harder to model with the same simple framework.

The stricter reading is that the evidence is uneven. The table gives model-fit diagnostics, not realised trading returns. It does not tell us how the system performs after bid-ask spreads, market impact, funding costs, slippage, transaction fees, execution delay, risk limits, or position sizing. Nor does it provide a broad regime-by-regime breakdown. The lion can forecast some cages better than others. That is still useful. It is not yet a safari licence.

GraphRAG makes news useful only after the system decides what “useful” means

The news recommendation module is where MountainLion becomes more than a chart annotator.

The system ingests financial news from APIs, RSS feeds, and curated portals. It then performs semantic annotation: sentiment classification, named entity recognition, entity-event extraction, and token association. From there, it constructs a dynamic knowledge graph where nodes represent news items or entities, and edges encode co-occurrence or contextual relationships.

This graph serves a practical purpose. Crypto news rarely arrives as clean, isolated facts. A regulatory headline may affect a token through an exchange, a jurisdiction, a market-maker, an ETF vehicle, or a broader risk narrative. A simple keyword search will retrieve “BTC” mentions. A graph can connect BTC to ETF inflows, Coinbase custody, regulatory clarity, institutional allocation, and macro liquidity narratives. Whether that connection is actually causal is another matter, but the retrieval surface becomes richer.

MountainLion also interprets user intent. The paper describes a user preference vector involving asset category, risk tolerance, and time horizon. That vector shapes retrieval prompts and recommendation generation. A 24-hour trader and a 12-month allocator should not receive the same answer even if both ask about Bitcoin. One cares about funding-rate spikes and exchange inflows. The other cares about ETF adoption, macro rates, institutional allocation, and supply structure.

The prompt appendix makes this horizon split explicit. Short-term prompts focus on signals such as RSI or MACD divergence, whale transfers above $10 million, funding-rate spikes, unusual Twitter or Reddit activity, exchange inflow or outflow alerts, and breaking exchange-related news. Mid-term prompts focus on ETF flows, smart-money positioning, sentiment shifts, KOL perspectives, regulatory news, and network activity. Long-term prompts focus on the FOMC rate path, inflation, recession risk, oil, gold, the dollar, spot ETF adoption, institutional allocation, whale accumulation, and BTC correlations with Nasdaq, DXY, or gold.

This is a strong operational idea: define the signal menu by investment horizon. It prevents the common crypto-research error of using a 24-hour signal to justify a 12-month thesis, or using a 12-month thesis to rationalise a bad 24-hour entry. Both errors are popular because they make people feel strategic while losing money.

The evidence shows better context, not audited alpha

The paper’s strongest case-study evidence compares a technical-only baseline against LLM-enhanced investment recommendations. The baseline relies on support, resistance, and standard chart logic. The enhanced versions incorporate real-time market signals, news narratives, macro conditions, and on-chain data. Three LLM-based agents are evaluated in this report-refinement task: ChatGPT-4o, DeepSeek V3, and Grok-3.

The pattern is consistent across horizons:

Horizon	Baseline weakness	MountainLion-style enhancement	What the result supports
Short term: 1–4 weeks	Technical triggers with limited justification	Adds wallets holding 1+ BTC, liquidation-volume changes, whale or flow indicators	More credible short-term explanation, though signals may decay quickly
Medium term: 1–6 months	Lacks macro and policy context	Adds ETF inflows, regulatory clarity, easing signals, capital-flow logic	Strongest case for integrated research value
Long term: 6+ months	Allocation ratios without enough fundamental rationale	Adds institutional adoption and declining exchange reserves	Stronger thesis framing, but less dependent on short-term signals

The authors argue that the medium-term horizon benefits most. That makes sense. Short-term crypto moves are often too noisy and execution-sensitive; long-term theses are often dominated by structural adoption, liquidity cycles, and allocation logic. The middle horizon is where narratives, flows, policy, and technical positioning can plausibly interact. It is also the horizon where a research-reporting system can add value without pretending to be a high-frequency execution engine.

Appendix E makes this point more directly. Short-term refinements add useful on-chain and liquidation context, but the paper acknowledges that these signals have limited persistence. Long-term refinements improve rationale, but remain largely allocation-oriented. Medium-term recommendations benefit most because the system links price action to external drivers such as ETF flows, policy conditions, and capital allocation.

This is a subtle but important conclusion. The paper is not equally strong across all trading use cases. Its most credible operating zone is not “trade everything all the time”. It is medium-horizon, evidence-enriched investment analysis.

The appendix metrics should be read as support, not proof

The paper contains several appendices that expand the system’s implementation and demonstrations. They are useful, but they should be classified properly.

Paper component	Likely purpose	What it supports	What it does not prove
Section 4 case study	Main evidence and comparison with technical-only baseline	LLM agents enrich investment reports with macro, on-chain, and market-context signals	Live profitability or risk-adjusted alpha
Forecasting Table 1	Quantitative support for the forecasting module	Token-level model fit varies; some assets appear stable while BTC/ETH are difficult	Full trading-system validation
Appendix A demonstration	Implementation walk-through and exploratory extension	End-to-end dashboard, API flow, chat, preset analysis, and user-facing outputs	Independent benchmarked deployment performance
Appendix B formulations	Implementation detail	Agent functions, validation scoring, prompt construction, caching policy	That the chosen scoring weights are optimal
Appendix C forecasting details	Implementation detail and model logic	Dual-track fusion, rolling adaptation, absolute accuracy, directional correctness concepts	Robustness across market regimes
Appendix D news recommendation	Implementation detail	News structuring, semantic enrichment, graph construction, user intent, feedback adaptation	That graph reasoning identifies true causal drivers
Appendix F prompts	Implementation detail	How horizon-specific retrieval is operationalised	That prompt design alone controls hallucination or overfitting

Appendix A reports several attractive numbers: over 40% improvement in relevant information retrieval efficiency versus manual processes, detection of whale activity involving $10 million transactions with confidence above 0.85, 28% improvement in investment decision accuracy versus traditional single-dimension analysis, 15% improvement in short-term price prediction accuracy versus baseline models, and a 35% increase in user engagement duration.

Those numbers should not be ignored, but neither should they be swallowed whole. The appendix does not provide enough detail on experimental design, sample size, user population, baseline construction, statistical uncertainty, or independent replication. For a product roadmap, they are encouraging. For investment committee evidence, they are preliminary.

The right interpretation is that MountainLion demonstrates a coherent architecture with plausible workflow benefits and some reported performance gains. It does not yet establish that the system should be allowed to trade client funds autonomously while everyone goes for coffee. The coffee can wait.

The business value is cheaper synthesis before smarter execution

For a trading desk, the obvious fantasy is autonomous alpha. The more realistic first product is analyst augmentation.

MountainLion’s strongest business pathway is to reduce the cost and time of preparing market briefs. A human analyst can spend hours reconciling price action, macro news, policy signals, ETF flows, whale movements, and sentiment. A system like MountainLion can pre-structure that work, flag the relevant evidence, generate a draft view by horizon, and make the assumptions explicit enough for human review.

That has several immediate use cases:

Business setting	Practical use	Why MountainLion-style architecture fits
Crypto research desk	Daily or weekly token outlooks	Multi-agent decomposition maps naturally onto analyst sub-tasks
Brokerage or exchange platform	Investor-facing insight cards	Semantic agent turns technical signals into readable explanations
Wealth or advisory platform	Horizon-specific crypto allocation notes	User intent modelling can tailor outputs by risk and time horizon
Risk team	Narrative risk monitoring	GraphRAG can connect news, entities, regulation, and token exposure
Education product	Explain why a signal matters	Multi-modal reports lower cognitive load for non-technical users

The ROI is not only better predictions. It is lower analyst throughput cost, faster reaction to information, more consistent report structure, and improved explainability. These are less glamorous than “AI beats the market”, but they are easier to monetise and easier to govern.

There is also a compliance angle. In regulated advisory contexts, a recommendation needs an audit trail: what data was used, what assumptions were made, what risk tolerance was inferred, and why the output changed. MountainLion’s modular design is more audit-friendly than a monolithic chatbot because each component has a defined role. That does not automatically make it compliant. It simply gives compliance teams something to inspect, which is a refreshing improvement over “the model felt bullish”.

The deployment problem is not the demo; it is the control loop

The paper’s reflection and feedback components are among its most interesting features. MountainLion logs user engagement signals such as clicks, dwell time, and explicit ratings, then uses those signals to adapt retrieval and summarisation strategies. It also updates forecasting weights based on rolling accuracy.

This creates a useful control loop. The system can learn which sources, prompts, signals, and model tracks perform better over time. It can adapt when market conditions change. It can personalise output to user interests.

It also creates a governance problem. User engagement is not the same as investment quality. Retail investors may click sensational bearish headlines, overreact to whale transfers, or prefer confidently written nonsense over calibrated uncertainty. If the system optimises too aggressively for engagement, it may become a beautifully personalised volatility amplifier.

The same applies to rolling forecast accuracy. A short validation window may overfit recent conditions. A long window may lag regime changes. Weight adaptation is necessary, but it must be monitored. In crypto, yesterday’s best predictor often becomes tomorrow’s expensive superstition.

Operators building from this architecture should treat feedback as a signal, not a supervisor. Human review, risk controls, source-quality filters, model-performance dashboards, and post-trade analysis are still necessary. The LLM may roar, but someone still needs to own the risk book.

Boundaries: where the lion still needs a leash

MountainLion is promising, but its claims sit inside several boundaries.

First, the paper’s empirical evidence is heavier on report quality and architectural demonstration than on live trading validation. It shows enriched recommendations, token-level forecasting metrics, and case-study improvements. It does not show audited live PnL, Sharpe ratios, maximum drawdown, capacity analysis, fee-adjusted performance, or stress tests across market regimes.

Second, the system depends on retrieval quality. RAG and GraphRAG reduce hallucination risk only if the retrieved sources are timely, credible, and correctly interpreted. A graph can organise misinformation as efficiently as information. Structure is not truth.

Third, multimodal integration is operationally difficult. Charts, OHLCV sequences, social sentiment, news text, and on-chain flows update at different speeds and with different reliability profiles. Aligning timestamps, confidence levels, and signal decay is not a minor plumbing issue. In crypto, plumbing is often where the bodies are buried.

Fourth, user personalisation introduces suitability risk. If the system infers risk appetite or investment horizon incorrectly, the recommendation may be coherent but inappropriate. A polished explanation does not rescue a mismatched strategy.

Fifth, interpretability can become persuasion. A fluent report that links ETF inflows, whale accumulation, and policy easing may feel more credible than a technical-only note. It may also hide weak causal links behind elegant prose. This is the dark art of financial writing, now available with API calls.

The practical conclusion is not to avoid systems like MountainLion. It is to deploy them in the right order: research assistant first, recommendation engine second, execution agent only after serious validation.

The operating lesson: automate judgement support before automating judgement

MountainLion matters because it points to a realistic future for financial AI. The winning near-term product is unlikely to be a fully autonomous crypto trader that simply prints money. Markets have a rude habit of noticing when too many people bring the same magic printer.

The more credible future is a multi-agent research layer that continuously assembles evidence, separates horizons, explains recommendations, adapts to new information, and gives human operators a better starting point. That alone is valuable. In volatile markets, the first bottleneck is often not execution. It is sense-making.

The paper’s mechanism-first contribution is therefore clear: MountainLion shows how an LLM system can coordinate specialised agents, retrieval, graph reasoning, forecasting fusion, report refinement, and feedback adaptation into a single decision-support architecture. Its evidence is strongest where the task is synthesis: transforming technical-only reports into richer, time-aligned investment narratives. Its evidence is weaker where the claim would require full trading validation.

That distinction is not a criticism. It is the product strategy.

Build the system to make analysts faster, recommendations clearer, and market reasoning more traceable. Then test whether that improved reasoning survives contact with execution costs, liquidity, regulation, and panic. Preferably before calling it an autonomous trader. The lion may roar, but in finance, the leash is called risk management.

Cognaptus: Automate the Present, Incubate the Future.

Siyi Wu et al., “MountainLion: A Multi-Modal LLM-Based Agent System for Interpretable and Adaptive Financial Trading,” arXiv:2507.20474, 2025, https://arxiv.org/abs/2507.20474. ↩︎

TL;DR for operators#

The system is an analyst factory, not a magic trading oracle#

Four agents turn market clutter into a reportable decision chain#

Forecasting is a fusion problem, not a language-model victory lap#

GraphRAG makes news useful only after the system decides what “useful” means#

The evidence shows better context, not audited alpha#

The appendix metrics should be read as support, not proof#

The business value is cheaper synthesis before smarter execution#

The deployment problem is not the demo; it is the control loop#

Boundaries: where the lion still needs a leash#

The operating lesson: automate judgement support before automating judgement#