Mirror, Signal, Trade: How Self‑Reflective Agent Teams Outperform in Backtests

The Takeaway

A new paper proposes TradingGroup, a five‑agent, self‑reflective trading team with a dynamic risk module and an automated data‑synthesis pipeline. In backtests on five US stocks, the framework beats rule‑based, ML, RL, and prior LLM agents. The differentiator isn’t a fancier model; it’s the workflow design: agents learn from their own trajectories, and the system continuously distills those trajectories into fine‑tuning data.

What’s actually new here?

Most “LLM trader” projects look similar: sentiment, fundamentals, a forecaster, and a decider. TradingGroup’s edge comes from three design choices:

Self‑Reflection that’s grounded in outcomes Instead of generic memory or RAG, agents pull labeled wins and losses from recent trades/predictions and prepend an “experience summary” before reasoning. That forces situational lessons (“why did we misread that pullback?”) into the next decision.
Style‑aware risk as a first‑class citizen A Style‑Preference Agent picks aggressive / balanced / conservative and pipes coefficients to a risk manager that sets dynamic take‑profit / stop‑loss from short‑horizon volatility. It also controls allocation (e.g., conservative buys at 50% size).
A data factory for post‑training Every agent’s prompt, CoT, action, PnL deltas, and evaluation labels flow into a distillation dataset. A small LoRA pass on Qwen3‑8B (≈0.53% trainable params, int8) then lifts the whole team’s performance—evidence that workflow → better data → better model is the right loop.

The agents & the gates (in plain English)

News‑Sentiment. Fetch, re‑rank (small reranker), deduplicate, then summarize into a market sentiment score. Financial‑Report. Hybrid retrieval on filings (dense+sparse), re‑rank to isolate price‑linked passages, then summarize “what matters” (guidance, margins, risks). Stock‑Forecasting. Classic features (RSI‑14, deviation from 20‑SMA, distance to 20‑day high/low, HV‑10, a simplified ATR‑20 proxy) combine with a hybrid gate:

If RSI is hot and price hasn’t exceeded an ATR‑scaled breakout threshold, force “sideways” to avoid chasing tops.
Else let the LLM’s up/down probs decide—but only if at least one bullish/bearish technical condition also fires.

Style‑Preference. Looks at recent performance and current account state to choose a style; this directly sets position sizing and risk thresholds.

Trading‑Decision. Aggregates everything, prepends an experience memo (good/bad cases), and issues Buy/Hold/Sell with rationale.

Evidence that the plumbing matters

The paper’s headline is not a magic indicator. It’s that self‑reflection + dynamic risk + curated distillation improves both returns and risk. In ablations, enabling the risk manager generally reduces drawdowns; adding reflection and retrieval quality improves returns without exploding volatility. A LoRA‑tuned Qwen3‑Trader‑8B (on the system’s own dataset) then outperforms the base model across all five tickers.

Strategic read: Your model family matters less than whether your agents write clean, label‑rich logs and use them to improve tomorrow’s prompts.

What Cognaptus should adopt (and how)

Below is a direct mapping from TradingGroup’s ideas to our stack (strategyr + tradesimr + agentr + okxr/binxr).

TradingGroup idea	Why it works	Cognaptus implementation sketch
Outcome‑grounded self‑reflection for forecaster & decider	Prevents repeating recent failure modes; makes reasoning situational	In `agentr`, add `ReflectionStore` that writes: (inputs, CoT, action, PnL_delta, label). Before each decision, inject `summarize_last_k_cases()` with 2 good/2 bad analogs.
Hybrid gates around the LLM	Keeps “reasoning” inside risk‑aware bounds	In `strategyr`, expose `gate_breakout_threshold = max(1%, 0.5 * ATR20%)`; clamp trend=sideways when RSI>70 yet < threshold; require at least one confirmatory pattern before accepting LLM up/down.
Style‑aware risk tied to account state	Turns “risk appetite” into math (size, SL/TP)	In `tradesimr`, add `style_profile` → sets cash allocation (1.0/1.0/0.5) and `TP/SL = m_style * σ10d` (unannualized). Trigger hard exits when	PnL	crosses threshold.
Data‑synthesis → PEFT	Small, frequent fine‑tunes on your own trajectories	Nightly job writes JSONL of agent runs; keep only high‑quality (e.g., `reward_a>0` or `w_hit>0`). Fine‑tune a compact model (e.g., 7–8B) with LoRA int8.
News/file reranking	Cuts noise; focuses on price‑relevant snippets	Use a tiny cross‑encoder (or even heuristic priors) to select “market‑moving” items; fall back to filings when news sparse.

Pragmatic recipe (crypto‑perps, 4H/15m)

Indicators (vectorized): RSI14, ATR20% (true ATR or the paper’s close‑to‑close proxy), dev20SMA%, distance_to_20d_high/low%, HV10% (unannualized for intraday).
Gating rules:
- No‑chase clamp: if RSI14>70 and price < breakout_threshold = max(1%, 0.5*ATR20%) above 20d high → force sideways.
- Soft pass: Allow LLM uptrend only if p_up>τ and either (a) new 20‑bar high or (b) dev20SMA within a healthy pullback band (e.g., 0 to −3%) with RSI<40. Mirror for downtrends.
Style & risk: map aggressive/balanced/conservative to size = 1.0/1.0/0.5 and m_tp, m_sl multipliers on σ(returns) over last N bars.
Reflection windows: keep rolling k=20 decisions; label hits for forecaster (w_hit using volatility‑scaled bands) and rewards for decider (reward_a = r_eq - β·r_bm - γ·cost). Surface two “what changed my mind?” bullets before each new action.
Data discipline: store prompts, CoT, features, decisions, fills, PnL, labels every bar; nightly filter to high‑quality SFT samples.

Where I’m skeptical (and how we hedge it)

Dataset & horizon: Results are on daily equities in a defined window; crypto perps with 15m/4H regimes are noisier and regime‑shift faster. Mitigation: shorten reflection windows; weight by recency and volatility regimes.
News coverage gaps: Two symbols had little or no news; the framework still performs—suggesting gates carry the weight. That’s fine, but don’t over‑index on sentiment plumbing.
“Simplified ATR”: using close‑to‑close stdev is fast but misses gap/true‑range dynamics; perps need true ATR on high/low for TP/SL realism.
PEFT on synthetic labels: Great for alignment, risky for leakage/overfitting to your simulator. Keep a frozen test period and evaluate live‑paper every week.
Baseline fairness: Some baselines are under‑tuned (common in papers). Judge the delta from your current bot, not absolute numbers.

Two‑week build plan (Cognaptus)

Week 1

Implement gate_* helpers and style‑aware risk in strategyr & tradesimr.
Add ReflectionStore with JSONL writes; wire summarize_last_k_cases() into prompts.
Ship a minimal Forecaster → Decider loop for one perp (e.g., ETH‑USDT‑SWAP, 15m).

Week 2

Start nightly data‑synthesis → PEFT (LoRA) on a small local model; keep only positive‑reward/hit samples.
Add a micro news/file reranker (fallback to purely technical when sparse).
Run paper‑trading with the new style+risk discipline and compare against your current agent.

Bottom line

TradingGroup’s contribution is less about inventing a new indicator and more about closing the loop: agents make decisions, label their own outcomes, reflect, and train on their best trajectories. Do that well, and even a modest base model becomes a sharper trader.

Cognaptus: Automate the Present, Incubate the Future.

The Takeaway#

What’s actually new here?#

The agents & the gates (in plain English)#

Evidence that the plumbing matters#

What Cognaptus should adopt (and how)#

Pragmatic recipe (crypto‑perps, 4H/15m)#

Where I’m skeptical (and how we hedge it)#

Two‑week build plan (Cognaptus)#

Bottom line#