TL;DR for operators
A new paper uses CrewAI to build a multi-agent workflow for crypto portfolio construction, then compares three allocation logics: equal weighting, static mean-variance optimisation, and 30-day rolling Sharpe maximisation across ten major crypto assets from 2020 to 2025.1
The headline result is not that “AI agents beat crypto markets.” Please put that sentence down before it hurts someone. The useful result is narrower and better: in a volatile asset class, a rolling allocation strategy outperformed a fixed one on risk-adjusted metrics, while the agentic architecture turned the research process into a modular, inspectable pipeline.
The paper reports that the rolling strategy, Crew B, achieved a Sharpe ratio of 1.00 in-sample versus 0.60 for the static optimised strategy, and maintained a Sharpe ratio of 0.72 out-of-sample while the static allocation deteriorated. The exact numbers should be treated as backtest evidence, not deployment proof. The study ignores transaction costs, spreads, custody frictions, tax effects, and tail-risk measures such as CVaR.
For asset managers, exchanges, robo-advisers, and crypto wealth platforms, the practical takeaway is this: agentic AI may be most useful not as a magic portfolio brain, but as a structured operating layer around data ingestion, cleaning, optimisation, risk reporting, review, and audit trails. The alpha, if any, still has to come from the allocation logic. The agents mostly make the machinery easier to run, inspect, and extend.
The real comparison is not human versus AI
Crypto portfolio papers often invite the wrong reading. Add “multi-agent system” and “CrewAI” to the title, and readers naturally expect a small committee of digital hedge-fund goblins arguing over Solana exposure while one of them reads Reddit sentiment in a dramatic voice.
That is not what this paper does.
The system is agentic in workflow structure, not in the sense of free-form market judgement. The authors build a Multi-Agent System using CrewAI, where different agents handle specific parts of the investment pipeline: loading data, cleaning it, splitting train and test samples, computing portfolio metrics, optimising weights, running rolling optimisation, checking files, and producing final reports. The agents are wrappers around a conventional quantitative process.
That distinction matters. The comparison is not “AI trader versus old-fashioned portfolio manager.” It is closer to this:
| Layer | What is being compared | What the paper actually tests |
|---|---|---|
| Allocation rule | Equal weighting vs static optimisation vs rolling optimisation | Whether adaptive rebalancing improves portfolio metrics |
| System design | Monolithic script vs multi-agent workflow | Whether the process can be modularised and reported through specialised agents |
| Business interpretation | Autonomous alpha engine vs auditable research pipeline | Whether agentic orchestration makes portfolio automation more operationally usable |
| Deployment readiness | Backtest vs live trading | The paper stays in historical simulation |
This is why the article has to be comparison-based. The paper’s value is not one big claim. It is a stack of contrasts: static versus adaptive, equal-weight versus optimised, portfolio logic versus agentic packaging, backtest improvement versus deployable trading system.
The interesting part is where those contrasts do not line up neatly. The performance gain comes from rolling optimisation. The operational novelty comes from the CrewAI architecture. Confusing the two would be a tidy little way to overclaim the paper.
Three portfolios walk into a backtest
The study uses daily closing prices for ten major cryptocurrencies from 20 August 2020 to 13 March 2025: BTC, ETH, BNB, SOL, XRP, ADA, DOGE, AVAX, DOT, and SHIB. The dataset contains 1,667 daily observations, split 80/20 into 1,333 training rows and 334 testing rows.
The asset universe is intentionally liquid and mainstream by crypto standards. That makes the analysis more realistic for institutional-facing products than a basket of microscopic tokens with liquidity measured in vibes. It also means the results should not be generalised to long-tail altcoins, new listings, illiquid DeFi tokens, or meme assets whose market depth evaporates when someone important changes their profile picture.
The paper evaluates three allocation approaches:
| Strategy | Allocation logic | Rebalancing | Role in the study |
|---|---|---|---|
| Equal weighting | 10% in each asset | None | Baseline |
| Crew A static optimisation | Mean-variance optimisation to maximise Sharpe ratio, with no short-selling and full investment | None after initial optimisation | Conventional optimised benchmark |
| Crew B rolling optimisation | 30-day rolling Sharpe maximisation, with no short-selling and full investment | Every 30 days | Adaptive strategy |
Crew A is essentially the “optimise once, then hold” approach. It uses the training data to find weights that improve risk-adjusted return, but once those weights are set, the portfolio does not adapt to changing market conditions.
Crew B changes one crucial thing: it recalculates weights using the most recent 30-day window, then applies those weights for the next period. This gives the strategy a way to respond to shifting return, volatility, and correlation patterns.
In crypto, that difference is not a minor implementation detail. It is the main plot.
Static optimisation concentrates; rolling optimisation adapts
The weight table in the paper is more revealing than the generic “adaptive strategies work” phrasing around it.
Crew A’s static optimiser produces a highly concentrated portfolio: 64.90% in SOL, 30.99% in SHIB, 4.11% in DOGE, and zero allocation to the other seven assets. That is technically an optimisation result under the given assumptions. It is also the sort of result that should make a risk committee ask whether the optimiser has discovered genius or merely learned to stare lovingly at recent winners.
Crew B’s rolling optimiser is more distributed. Its reported average-style weights include exposure across all ten assets, with AVAX at 18.48%, SHIB at 13.75%, SOL at 12.71%, BNB at 12.13%, BTC at 9.55%, ETH at 8.81%, and smaller allocations to XRP, DOGE, ADA, and DOT.
| Asset | Equal weight | Crew A static | Crew B rolling |
|---|---|---|---|
| BTC | 0.1000 | 0.0000 | 0.0955 |
| ETH | 0.1000 | 0.0000 | 0.0881 |
| BNB | 0.1000 | 0.0000 | 0.1213 |
| SOL | 0.1000 | 0.6490 | 0.1271 |
| XRP | 0.1000 | 0.0000 | 0.0772 |
| DOGE | 0.1000 | 0.0411 | 0.0775 |
| ADA | 0.1000 | 0.0000 | 0.0663 |
| AVAX | 0.1000 | 0.0000 | 0.1848 |
| SHIB | 0.1000 | 0.3099 | 0.1375 |
| DOT | 0.1000 | 0.0000 | 0.0248 |
This is not just a technical curiosity. It explains why static optimisation can look elegant in a spreadsheet and fragile in a market. A single mean-variance optimisation over a historical sample can easily amplify the period’s strongest realised winners. In a regime-shifting asset class, that may become a beautiful backtest with an expiry date.
Rolling optimisation, by contrast, does not need to believe that one training-period allocation is timeless. It keeps updating its view of recent risk and return. That is also where the paper’s empirical result comes from.
The numbers favour adaptivity, not magic
The in-sample comparison is straightforward. The paper reports that Crew A improves the Sharpe ratio from 0.48 for equal weighting to 0.60 for static optimised weights. Crew B raises the Sharpe ratio to 1.00, with lower volatility and higher expected return than the static approach: 10.2% volatility versus 12.4%, and 9.9% expected return versus 7.5%.
A later comparative table reports slightly different rounded metrics: equal weighting has an expected return of 8%, volatility of 15%, Sharpe ratio of 0.53, Sortino ratio of 0.75, and maximum drawdown of -20%; Crew A has expected return of 10%, volatility of 12%, Sharpe of 0.83, Sortino of 1.10, and maximum drawdown of -15%; Crew B has expected return of 10%, volatility of 10%, Sharpe of 1.00, Sortino of 1.30, and maximum drawdown of -15%.
The small inconsistency between the narrative figures and the later report table is not fatal, but it is worth noticing. It suggests the article should lean on the directional conclusion rather than pretending every reported metric forms a perfectly polished institutional factsheet. Academic PDFs are sometimes built like rockets; sometimes they are built like spreadsheets wearing a lab coat.
The core pattern remains clear:
| Metric | Equal weight | Crew A static | Crew B rolling | Interpretation |
|---|---|---|---|---|
| Expected return | 8% | 10% | 10% | Optimisation improves return over naive allocation |
| Volatility | 15% | 12% | 10% | Rolling optimisation reduces realised risk in the reported table |
| Sharpe ratio | 0.53 | 0.83 | 1.00 | Risk-adjusted performance improves most under Crew B |
| Sortino ratio | 0.75 | 1.10 | 1.30 | Downside-adjusted performance also favours Crew B |
| Max drawdown | -20% | -15% | -15% | Both optimised strategies improve drawdown in-sample |
Out-of-sample, the paper reports that Crew A’s Sharpe ratio drops to 0.36, while Crew B maintains 0.72. The agent-generated report for Crew B gives test-set expected return of 8%, volatility of 11%, Sharpe of 0.72, Sortino of 1.1, maximum drawdown of -18%, moderately low liquidity risk, and no detected regime change.
That is the strongest evidence in the paper: not that Crew B wins in-sample, but that it degrades less severely when moved to unseen data. In finance, every overfitted strategy looks clever until the calendar turns the page. Crew B still deteriorates, but it deteriorates less.
The CrewAI layer is operational plumbing with strategic consequences
The most useful business reading is not “use CrewAI and get better returns.” The paper does not isolate CrewAI as the cause of improved portfolio performance. Crew B wins because it uses rolling optimisation. CrewAI is the orchestration layer that structures the process.
That still matters.
In a financial organisation, a portfolio model is rarely just a formula. It is a workflow. Someone must ingest data, clean it, check anomalies, split samples, compute metrics, produce reports, preserve artefacts, compare against benchmarks, and explain the decision path to stakeholders who may not enjoy being told “the Python notebook said so.”
The paper’s architecture maps these tasks into separate agents. That gives the system three operational advantages:
| Agentic feature | Operational consequence | Business relevance |
|---|---|---|
| Modularity | Each component can be upgraded without rebuilding the whole pipeline | Easier experimentation with new assets, cost models, benchmarks, or risk modules |
| Auditability | Intermediate outputs can be logged and reviewed | Better fit for supervised investment processes and compliance review |
| Scalability | New agents can be added for additional data sources, checks, or reports | More flexible research-to-production workflow |
This is where “agentic AI” earns its keep. Not by having mystical market intuition, but by making the investment process less monolithic. A data-cleaning agent can be improved. A risk-reporting agent can be expanded. A benchmark-comparison agent can be added. A file-checking agent can flag incoherent outputs. The workflow becomes inspectable.
For operators, that is often more valuable than a slightly shinier optimiser. Real organisations do not only need models that work. They need models that can be maintained, explained, reviewed, monitored, and blamed in an orderly fashion when something inevitably goes sideways.
What the paper directly shows
The paper directly shows four things.
First, it demonstrates that a CrewAI-style multi-agent workflow can be used to automate a crypto portfolio research pipeline from data handling to final reporting. The agents do not merely chat; they are attached to defined tools and roles in the workflow.
Second, it shows that for the selected ten-asset crypto universe and the specific 2020–2025 sample, rolling 30-day Sharpe maximisation produces better reported risk-adjusted performance than static optimisation and equal weighting.
Third, it shows that the rolling approach maintains stronger out-of-sample Sharpe performance than the static approach, even though performance still deteriorates from the training period.
Fourth, it illustrates that the outputs of agentic portfolio systems can be structured into readable reports, including strategy comparisons, train/test performance, liquidity-risk notes, and regime-change flags.
That is enough to be interesting. It is not enough to be a trading product.
What Cognaptus infers for business use
The business inference is that agentic systems may become a useful operating layer for investment research and portfolio tooling, especially in volatile and data-rich markets such as crypto.
A crypto wealth app could use this pattern to generate explainable allocation reports for users. An exchange could package portfolio analytics as a premium tool. A robo-adviser could use agents to separate suitability intake, strategy construction, risk scoring, and reporting. An asset manager could use this architecture internally for strategy prototyping, where each module leaves a reviewable trace.
But the deployment pathway should look more like an engineering roadmap than a product launch poster.
| Stage | What to build | What must be validated |
|---|---|---|
| Research prototype | Replicate the paper’s pipeline and metrics | Data integrity, metric consistency, reproducibility |
| Backtest extension | Add fees, spreads, slippage, and larger asset universes | Whether performance survives realistic frictions |
| Risk upgrade | Add CVaR, stress tests, liquidity constraints, and drawdown controls | Whether the strategy behaves under tail events |
| Paper trading | Run forward tests without capital | Whether live signals match backtest assumptions |
| Controlled deployment | Use small allocations, strict monitoring, and human approval | Whether operational risk is manageable |
The likely first commercial value is not autonomous trading. It is analyst productivity: faster strategy iteration, cleaner reports, better audit trails, and more disciplined comparison between allocation rules.
In plain business language: the system may help teams ask better portfolio questions faster. It does not absolve them from answering those questions properly.
The out-of-sample result is encouraging, but not a licence to trade
The out-of-sample Sharpe of 0.72 for Crew B is the paper’s most practically relevant performance number. It indicates that the rolling strategy did not collapse when evaluated on the held-out portion of the sample.
Still, several boundaries materially affect interpretation.
Transaction costs are ignored. For a strategy that rebalances every 30 days, this is not a decorative limitation. Crypto markets can impose exchange fees, bid-ask spreads, market impact, transfer costs, and custody-related frictions. The more frequently a strategy updates, the more these frictions matter.
Tail risk is not modelled through CVaR or comparable extreme-loss measures. Maximum drawdown is reported, but crypto losses can be discontinuous, liquidity-dependent, and exchange-specific. A portfolio can look controlled under daily close-to-close data while still being exposed to intraday gaps, depegging events, trading halts, liquidation cascades, or sudden liquidity withdrawal.
The 30-day window is heuristic. It may work well in this sample, but the paper does not establish that 30 days is optimal, stable, or robust across regimes. Shorter windows may react faster but overfit noise. Longer windows may be more stable but slower to adapt. There is no free lunch, only different flavours of indigestion.
The universe is limited to ten major coins. That improves liquidity realism but narrows generalisation. A strategy built on top-market-cap assets may behave differently when applied to mid-cap tokens, DeFi governance tokens, stablecoin pairs, derivatives, or newly listed assets.
Finally, the paper does not run live forward testing. Historical train/test splits are useful, but markets are annoyingly aware of the passage of time. A deployable system would need paper trading, monitoring, model governance, and post-trade attribution before anyone should attach real money and a confident facial expression.
The agent-generated reports are useful, but should not be mistaken for validation
One subtle part of the paper is that the agents produce final reports recommending the optimised or rolling strategies. These reports are readable, structured, and operationally convenient. That is a good feature.
But a generated report is not independent evidence. It summarises the metrics produced by the system; it does not validate the assumptions behind them. A reporting agent can make the workflow more transparent, but it cannot magically solve transaction-cost omission, regime fragility, or tail-risk blindness.
This is where financial AI products often get into trouble. A neatly formatted explanation creates a feeling of control. Sometimes that feeling is deserved. Sometimes it is just PowerPoint wearing a seatbelt.
The right role for these reports is audit support. They should help humans inspect what was done, compare alternatives, and identify weak points. They should not become ceremonial approvals that convert backtest outputs into investment decisions.
The correct takeaway: adaptivity drives performance; agents organise the machine
The cleanest reading of the paper is this:
| Question | Answer |
|---|---|
| What improves portfolio metrics? | Rolling optimisation over recent data |
| What does CrewAI contribute? | Modular orchestration, reporting, and auditability |
| What does the paper not prove? | That LLM agents can autonomously trade crypto profitably in live markets |
| Where is the business opportunity? | Research automation, explainable portfolio tooling, and auditable strategy workflows |
| What must be tested next? | Costs, slippage, tail risk, window sensitivity, broader universes, and forward performance |
That distinction is the article’s main correction to the obvious hype reading. The agents are not the alpha fairy. The allocation rule does the financial work. The multi-agent architecture makes the process easier to assemble, inspect, and extend.
For financial firms, that is still meaningful. The hardest part of portfolio automation is not always choosing a formula. It is building a system around the formula that survives data messiness, governance demands, model revisions, reporting needs, and the cheerful chaos of live markets.
Conclusion: this is a workflow paper disguised as a portfolio paper
The paper’s strongest result is that adaptive allocation beats static allocation in the tested crypto setting. Its more durable business contribution is showing how that allocation process can be decomposed into an agentic workflow.
That workflow framing is the part worth keeping. A CrewAI portfolio system can separate data preparation, optimisation, evaluation, benchmark comparison, report generation, and consistency checks. For volatile markets, this modularity is not just software elegance. It is a way to keep research pipelines flexible without turning every change into a full system rewrite.
But the paper should not be read as evidence that agentic AI has solved crypto portfolio management. It has not. The backtest is limited, frictions are absent, tail risk is thinly handled, and the 30-day rolling window needs sensitivity testing.
So the practical conclusion is sober and useful: agentic AI can make portfolio research more operationally scalable and auditable, while adaptive optimisation can improve risk-adjusted performance in the tested setting. Those are two different claims. Keeping them separate is how one avoids both underestimating the paper and accidentally selling snake oil with a GitHub repo.
Cognaptus: Automate the Present, Incubate the Future.
-
Antonino Castelli, Paolo Giudici, and Alessandro Piergallini, “Building crypto portfolios with agentic AI,” arXiv:2507.20468, 2025, https://arxiv.org/pdf/2507.20468. ↩︎