LLM | Cognaptus

Spin Doctors: Why RL Fine‑Tuning Mostly Rotates, Not Reinvents

TL;DR for operators If your fine-tuned model gets better on the training task while quietly becoming worse outside it, the problem may not be that the model “lost intelligence”. It may have rotated its useful internal directions away from broadly generalizable behaviour. The paper behind this article studies SFT followed by PPO-style RL on two open LLMs using a controlled arithmetic benchmark, then inspects the weight matrices through singular-value decomposition.1 The pattern is clean enough to be operationally interesting: OOD performance peaks early during SFT, falls as SFT continues, and can be substantially restored by RL when the SFT checkpoint is only moderately degraded. But if SFT pushes the model too far into a specialized regime, RL is no longer a reliable rescue crew. Apparently even reinforcement learning has limits. Who knew. ...

Forecast: Mostly Context with a Chance of Routing

TL;DR for operators Most forecasting teams already have decent numerical forecasters. Their problem is not that ARIMA, ETS, Lag-Llama, Chronos, or internal demand models suddenly forgot how Tuesdays work. The problem is that many important forecast shocks arrive as text: heat-wave notices, maintenance schedules, holiday effects, price caps, promotions, policy changes, store closures, one-off events, and all the other messy little business facts that refuse to fit politely into a clean covariate table. ...

From Zero to Reasoning Hero: How R-Zero Teaches Itself Without Human Data

TL;DR for operators R-Zero is a self-evolving training framework for reasoning LLMs that starts with one base model, splits it into two roles, and lets them co-train: a Challenger generates difficult questions, while a Solver learns to answer them.1 The useful business takeaway is not “models no longer need data.” That is the sort of sentence that should be handled with tongs. R-Zero removes the need for external task datasets and human labels in its training loop, but it still depends on engineered reward signals, majority-vote pseudo-labels, answer-format discipline, filtering, and objective correctness checks. “Zero data” here means zero external tasks and labels, not zero structure. ...

Thinking Without Talking: How SynAdapt Lets LLMs Reason in Silence

TL;DR for operators SynAdapt is not a paper about making models “think secretly” because mystery sells better on conference posters. It is a paper about inference budgeting: when a model should spend tokens explaining its reasoning, and when it can compress that reasoning into latent vectors and move on. The method trains a large language model to use synthetic continuous chain-of-thought—CCoT—as a dense internal reasoning representation instead of generating long natural-language reasoning traces. For easier problems, the model answers using this latent representation directly. For harder problems, a difficulty classifier detects that silent reasoning is likely insufficient and routes the question back to discrete chain-of-thought, with a prompt that keeps the re-thinking concise.1 ...

Echoes in the Algorithm: How GPT-4o's Stories Flatten Global Culture

TL;DR for operators The paper does not merely say that GPT-generated stories contain national clichés. That would be mildly interesting, in the way that discovering a tourist brochure likes sunsets is mildly interesting. The sharper finding is structural. When Rettberg and Wigers prompted gpt-4o-mini to write 1,500-word “potential” stories for 236 demonyms, the model produced surface diversity—olive trees, fjords, forests, trains, village elders, festivals—but repeatedly returned to the same basic narrative machine: someone comes back to a small town or village, discovers that community or tradition has weakened, organises a symbolic event, and restores harmony.1 ...

From Chaos to Care: Structuring LLMs with Clinical Guidelines

TL;DR for operators Patient records are not just long documents. They are timelines with consequences. CliCARE, the framework proposed in the paper, attacks that problem by turning longitudinal cancer EHRs into patient-specific temporal knowledge graphs, then aligning those patient trajectories with clinical guideline knowledge graphs before asking an LLM to generate a clinical summary and recommendation.1 That sounds architectural because it is. The useful lesson is not that “AI can help doctors,” a phrase now so overused it should probably be placed in quarantine. The lesson is that clinical AI improves when the model is given a structured representation of disease progression and a normative map of what should happen next. ...

Factor Factory: How LLMs Are Reinventing Sparse Portfolio Optimization

TL;DR for operators Portfolio teams do not usually fail because they have no models. They fail because the models age, the signals decay, and the process of discovering new sparse selection logic is slow, expensive, and wonderfully allergic to market regime shifts. The paper behind EFS — Evolutionary Factor Search — proposes a useful change in framing: stop asking the LLM to “pick stocks” and ask it to generate executable alpha-factor formulas that can be backtested, filtered, evolved, and used to rank assets under sparse portfolio constraints.1 That distinction matters. The LLM is not the portfolio manager. It is the factor-factory intern with suspicious stamina. The backtest loop is still the adult in the room. ...

The Sentiment Edge: How FinDPO Trains LLMs to Think Like Traders

TL;DR for operators News is only useful when it survives the journey from headline to position sizing. FinDPO, proposed by Giorgos Iacovides, Wuyang Zhou, and Danilo Mandic, is a finance-specific Llama-3-8B-Instruct sentiment model trained with Direct Preference Optimization rather than ordinary supervised fine-tuning.1 The paper’s headline result is not merely that FinDPO scores well on sentiment benchmarks. Plenty of models win benchmarks, then politely disappear when transaction costs arrive. ...

Plug Me In: Why LLMs with Tools Beat LLMs with Size

TL;DR for operators The Athena paper is useful because it makes a simple operational point that many AI buying committees still manage to avoid: a bigger language model is not the same thing as a better workflow.1 An LLM can explain, infer, and format. It is still a poor substitute for a calculator, a live database, a calendar API, a search service, or a domain-specific computation engine. This is not a moral failure. It is just architecture. ...

Beyond the Pull Request: What ChatGPT Teaches Us About Productivity

TL;DR for operators Most companies still ask the wrong first question about LLMs in software development: “Do they make developers write code faster?” That question is not useless. It is just too small. A recent paper by Sardar Bonabi, Sarah Bana, Vijay Gurbaxani, and Tingting Nian uses Italy’s temporary 2023 ChatGPT ban as a natural experiment to examine what happened to public GitHub activity when Italian developers abruptly lost access to ChatGPT, compared with similar developers in France and Portugal.1 The study covers 88,022 open-source software developers and looks at a 16-week window: eight weeks before the ban, four weeks during it, and four weeks after access was restored. ...