Cover image

Hypotheses, Not Hunches: What an AI Data Scientist Gets Right

Most “AI for analytics” pitches still orbit model metrics. The more interesting question for executives is: What should we do next, and why? A recent paper proposes an AI Data Scientist—a team of six LLM “subagents” that march from raw tables to clear, time‑boxed recommendations. The twist isn’t just automation; it’s hypothesis‑first reasoning. Instead of blindly optimizing AUC, the system forms crisp, testable claims (e.g., “active members are less likely to churn”), statistically validates them, and only then engineers features and trains models. The output is not merely predictions—it’s an action plan with KPIs, timelines, and rationale. ...

August 26, 2025 · 5 min · Zelina
Cover image

Mirror, Signal, Trade: How Self‑Reflective Agent Teams Outperform in Backtests

The Takeaway A new paper proposes TradingGroup, a five‑agent, self‑reflective trading team with a dynamic risk module and an automated data‑synthesis pipeline. In backtests on five US stocks, the framework beats rule‑based, ML, RL, and prior LLM agents. The differentiator isn’t a fancier model; it’s the workflow design: agents learn from their own trajectories, and the system continuously distills those trajectories into fine‑tuning data. What’s actually new here? Most “LLM trader” projects look similar: sentiment, fundamentals, a forecaster, and a decider. TradingGroup’s edge comes from three design choices: ...

August 26, 2025 · 5 min · Zelina
Cover image

Stop at 30k: How Hermes 4 Turns Long Chains of Thought into Shorter Time‑to‑Value

TL;DR Hermes 4 is an open‑weight “hybrid reasoner” that marries huge synthetic reasoning corpora with carefully engineered post‑training and evaluation. The headline for operators isn’t just benchmark wins—it’s control: control of format, schema, and especially when the model stops thinking. That last bit matters for latency, cost, and reliability. Why this matters for business readers If you’re piloting agentic or “think‑step” LLMs, two pains dominate: Unbounded reasoning length → blow‑ups in latency and context costs. Messy outputs → brittle downstream integrations. Hermes 4 addresses both with: (a) rejection‑sampled, verifier‑backed reasoning traces to raise answer quality, and (b) explicit output‑format and schema adherence training plus length‑control fine‑tuning to bound variance. That combo is exactly what production teams need. ...

August 26, 2025 · 4 min · Zelina
Cover image

Words + Returns: Teaching Embeddings to Invest in Themes

How do you turn a fuzzy idea like “AI + chips” into a living, breathing portfolio that adapts as markets move? A new framework called THEME proposes a crisp answer: train stock embeddings that understand both the meaning of a theme and the momentum around it, then retrieve candidates that are simultaneously on‑theme and investment‑suitable. Unlike static ETF lists or naive keyword screens, THEME learns a domain‑tuned embedding space in two steps: first, align companies to the language of themes; second, nudge those semantics with a lightweight temporal adapter that “listens” to recent returns. The result is a retrieval engine that feeds a dynamic portfolio constructor—and in backtests, it beats strong LLM/embedding baselines and even average thematic ETFs on risk‑adjusted returns. ...

August 26, 2025 · 5 min · Zelina
Cover image

MoA vs. Moat: Agentic LLMs for Drug Competitor Mapping Cut Diligence Time 20×

The punchline Competitive analysis for drug assets isn’t a tidy table—it’s a scavenger hunt across press releases, registries, investor decks, and alias-riddled drug names. A new paper shows that scaffolded, web-native LLM agents can reliably enumerate true competitors for a given indication, then filter hallucinations with an LLM-as-judge, beating popular “deep research” tools and cutting analyst turnaround from ~2.5 days to ~3 hours. This matters now: the EU’s Joint Clinical Assessments (JCA) regime makes comparator choice visible and consequential; missing a relevant competitor can ripple into pricing, market access, and trial design. In short: MoA (mechanism of action) meets moat (defensible advantage)—and the moat is built from recall. ...

August 25, 2025 · 5 min · Zelina
Cover image

Preference Chains of Command: Making LLM Agents Pick Like People

The gist Most “LLM agents for cities” sound magical until you ask them a basic planning question—which mode would this person actually take at 8am in Cambridge? This paper’s answer is refreshingly concrete: put a belief–desire–intention (BDI) graph around the agent, retrieve analogous people and contexts (Graph RAG), score paths through that graph to get prior choice probabilities, then let the LLM remodel those priors with current conditions (weather, time, place). The authors call this a Preference Chain. ...

August 25, 2025 · 5 min · Zelina
Cover image

Put It on the GLARE: How Agentic Reasoning Makes Legal AI Actually Think

Legal judgment prediction (LJP) is one of those problems that exposes the difference between looking smart and being useful. Most models memorize patterns; judges demand reasons. Today’s paper introduces GLARE—an agentic framework that forces the model to widen its hypothesis space, learn from real precedent logic, and fetch targeted legal knowledge only when it needs it. The result isn’t just higher accuracy; it’s a more auditable chain of reasoning. TL;DR What it is: GLARE = Gent Legal Agentic Reasoning Engine for LJP. Why it matters: It turns “guess the label” into compare-and-justify—exactly how lawyers reason. How it works: Three modules—Charge Expansion (CEM), Precedents Reasoning Demonstrations (PRD), and Legal Search–Augmented Reasoning (LSAR)—cooperate in a loop. Proof: Gains of +7.7 F1 (charges) and +11.5 F1 (articles) over direct reasoning; +1.5 to +3.1 F1 over strong precedent‑RAG; double‑digit gains on difficult, long‑tail charges. So what: If you’re deploying LLMs into legal ops or compliance, agentic structure > bigger base model. Why “agentic” beats bigger The usual upgrades—bigger models, more RAG, longer context—don’t address the core failure mode in LJP: premature closure on a familiar charge and surface‑level precedent matching. GLARE enforces a discipline: ...

August 25, 2025 · 4 min · Zelina
Cover image

ReAct Without the Chaos: AgentScope 1.0 Turns Tools into Strategy

Thesis: AgentScope 1.0 is less a toolkit and more a discipline for agentic software. By pinning everything to ReAct loops, unifying “message–model–memory–tool,” and adding group-wise tool provisioning, it addresses the real failure mode of agents in production: tool sprawl without control. The evaluation/Studio/runtime trio then turns prototypes into shippable services. What’s actually new (and why it matters) 1) A crisp core: Message → Model → Memory → Tool Most frameworks blur these into ad‑hoc objects; AgentScope forces a clean, composable boundary: ...

August 25, 2025 · 4 min · Zelina
Cover image

Spin Doctors: Why RL Fine‑Tuning Mostly Rotates, Not Reinvents

The short of it Reinforcement‑learning fine‑tuning (RL‑FT) often looks like magic: you SFT a model until it aces your dataset, panic when it forgets math or coding edge cases, then run PPO and—voilà—generalization returns. A new paper argues the mechanism isn’t mystical at all: RL‑FT mostly rotates a model’s learned directions back toward broadly useful features, rather than unlocking novel capabilities. In practical terms, cheap surgical resets (shallow layers or top‑rank components) can recover much of that OOD skill without running an expensive RL pipeline. ...

August 25, 2025 · 5 min · Zelina
Cover image

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

Why this paper matters: Retrieval‑augmented generation (RAG) has been the default answer to “how do we make LLMs factual?” But clinical work is not a single hop to a single document; it’s a workflow—observe, hypothesize, retrieve, cross‑check, and only then decide. Deep‑DxSearch reframes RAG as a sequential policy, trained end‑to‑end with reinforcement learning (RL) so the model learns when to reason internally and when to consult guidelines, match similar patients, or search broader knowledge—before committing to a diagnosis. That design change is the story. ...

August 24, 2025 · 5 min · Zelina