Cover image

Quants With a Plan: Agentic Workflows That Outtrade AutoML

If AutoML is a fast car, financial institutions need a train with tracks—a workflow that knows where it’s going, logs every switch, and won’t derail when markets regime-shift. A new framework called TS-Agent proposes exactly that: a structured, auditable, LLM-driven agent that plans model development for financial time series instead of blindly searching. Unlike generic AutoML, TS-Agent formalizes modeling as a multi-stage decision process—Model Pre-selection → Code Refinement → Fine-tuning—and anchors each step in domain-curated knowledge banks and reflective feedback from real runs. The result is not just higher accuracy; it’s traceability and consistency that pass governance sniff tests. ...

August 20, 2025 · 5 min · Zelina
Cover image

Keys to the Kingdom: How LLMs Can Audit Crypto Logic Before It Breaks

We’ve gotten good at spotting API misuse in crypto code (think “don’t use ECB,” “don’t hardcode IVs”). But many production failures don’t come from the obvious API call—they’re born in the logic that surrounds it: the parameter checks, corner-case math, and brittle “optimizations.” That’s where CryptoScope steps in: an LLM-powered framework that reads crypto code like a human auditor, guided by a domain corpus and structured prompts, to uncover logic-level vulnerabilities without executing the code. ...

August 18, 2025 · 4 min · Zelina
Cover image

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

August 18, 2025 · 4 min · Zelina
Cover image

Skip or Split? How LLMs Can Make Old-School Planners Run Circles Around Complexity

TL;DR Classical planners crack under scale. You can rescue them with LLMs in two ways: (1) Inspire the next action, or (2) Predict an intermediate state and split the search. On diverse benchmarks (Blocks, Logistics, Depot, Mystery), the Predict route generally solves more cases with fewer LLM calls, except when domain semantics are opaque. For enterprise automation, this points to a practical recipe: decompose → predict key waypoints → verify with a trusted solver—and only fall back to “inspire” when your domain model is thin. ...

August 18, 2025 · 5 min · Zelina
Cover image

Fast & Curious: How ‘Speed-First’ LLM Architectures Change the Build vs. Buy Math

Executive takeaway: Efficient LLM architectures aren’t just academic: they reset the economics of AI products by cutting context costs, shrinking GPUs per QPS, and opening new form factors—from phone-side agents to ultra-cheap serverless endpoints. The winning strategy is hybrid by default, KV-light, and latency-budgeted. Why this matters now If you ship with AI, your margins live and die by three levers: sequence length, active parameters per token, and memory traffic. Classical Transformers lose on all three. The latest wave of “speed-first” designs offers a menu of swaps that trade negligible accuracy for step-change gains in throughput, tail latency, and $ per million tokens. This survey gives us a clean taxonomy and—more importantly—the design intent behind each family: compress the compute (linear & sparse sequence modeling), route the compute (MoE), restructure the compute (efficient full attention), and rethink the decoder (diffusion LLMs). ...

August 16, 2025 · 5 min · Zelina
Cover image

Forecast: Mostly Context with a Chance of Routing

Large language models can forecast surprisingly well when you hand them the right context. But naïve prompts leave money on the table. Today’s paper introduces four plug‑and‑play strategies—ReDP, CorDP, IC‑DP, RouteDP—that lift accuracy, interpretability, and cost‑efficiency without training new models. Here’s what that means for teams running demand, risk, or ops forecasts. Why this matters for business readers Most production forecasts are numeric workhorses (ARIMA/ETS/TS foundation models), while contextual facts—weather advisories, policy changes, promos, strikes—arrive as text. LLMs can read that text and adjust the forecast, but simply stuffing history+context into a prompt (“direct prompting”) is often fragile. The four strategies below are operational patterns you can drop into existing stacks without re‑architecting. ...

August 16, 2025 · 5 min · Zelina
Cover image

Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

When it comes to reasoning, bigger isn’t always better. Large language models (LLMs) often produce unnecessarily long chains of thought, burning through tokens — and budgets — even for simple problems. While fixed token limits during training can force brevity, they also rob models of the chance to first explore and then compress their reasoning. A new study, Train Long, Think Short, proposes a smarter path: curriculum learning for length control. Instead of a one-size-fits-all cap, the model starts with a generous token budget, learns robust reasoning strategies, and then gradually adapts to shorter limits over time. The result is a model that solves complex tasks with fewer tokens, without losing accuracy. ...

August 13, 2025 · 2 min · Zelina
Cover image

Textual Gradients and Workflow Evolution: How AdaptFlow Reinvents Meta-Learning for AI Agents

From Static Scripts to Living Workflows The AI agent world has a scaling problem: most automated workflow builders generate one static orchestration per domain. Great in benchmarks, brittle in the wild. AdaptFlow — a meta-learning framework from Microsoft and Peking University — proposes a fix: treat workflow design like model training, but swap numerical gradients for natural language feedback. This small shift has a big implication: instead of re-engineering from scratch for each use case, you start from a meta-learned workflow skeleton and adapt it on the fly for each subtask. ...

August 12, 2025 · 3 min · Zelina
Cover image

Fair or Foul? How LLMs ‘Appraise’ Emotions

Most AI conversations equate “emotional intelligence” with sentiment labels. Humans don’t work that way. We appraise situations—Was it fair? Could I control it? How much effort will this take?—and then feel. This study puts that lens on large language models and asks a sharper question: Do LLMs reason about emotions through cognitive appraisals, and are those appraisals human‑plausible? What CoRE Actually Measures (and Why It’s Different) CoRE — Cognitive Reasoning for Emotions evaluates seven LLMs across: ...

August 11, 2025 · 4 min · Zelina
Cover image

From Ballots to Budgets: Can LLMs Be Trusted as Social Planners?

When you think of AI in public decision-making, you might picture chatbots handling service requests or predictive models flagging infrastructure risks. But what if we let large language models (LLMs) actually allocate resources—acting as digital social planners? That’s exactly what this new study tested, using Participatory Budgeting (PB) both as a practical decision-making task and a dynamic benchmark for LLM reasoning. Why Participatory Budgeting Is the Perfect Testbed PB is more than a budgeting exercise. Citizens propose and vote on projects—parks, public toilets, community centers—and decision-makers choose a subset to fund within a fixed budget. It’s a constrained optimization problem with a human twist: budgets, diverse preferences, and sometimes mutually exclusive projects. ...

August 11, 2025 · 3 min · Zelina