LLMs | Cognaptus

Bias, Baked In: Why Pretraining, Not Fine-Tuning, Shapes LLM Behavior

TL;DR for operators Fine-tuning is not a washing machine. It may polish, redirect, or occasionally muffle a model’s behavioural tendencies, but this paper suggests that many cognitive-bias patterns are already substantially shaped before instruction tuning begins. The study separates three possible sources of observed bias in large language models: the pretrained backbone, the instruction dataset, and random variation during fine-tuning. Its main finding is that models’ bias profiles cluster more strongly by pretrained model identity than by the instruction data used later. In plainer operational language: the base model carries a behavioural signature that survives downstream training. ...

LLMs Meet Logic: SymbolicThought Turns AI Relationship Guesswork into Graphs

TL;DR for operators SymbolicThought1 is a useful reminder that relationship extraction is not a vibes problem. It is a graph problem wearing a language-model costume. The paper proposes a human-in-the-loop system for extracting character relationships from narrative text. The pipeline lets an LLM propose characters and relations, then applies symbolic rules to infer missing edges, detect contradictions, retrieve supporting evidence, and ask humans to confirm or correct what matters. That is the important mechanism: the LLM is not trusted as a final judge. It is treated as a noisy extractor inside a controlled annotation workflow. ...

Humans in the Loop, Not Just the Dataset

TL;DR for operators AI-assisted monitoring does not become trustworthy because a human occasionally clicks “wrong label.” It becomes useful when the whole product is designed to capture, validate, resolve, and redeploy human judgement. The paper behind this article studies an open-source Telegram monitoring tool being developed with civil society organisations, using conspiracy-theory classification as the working scenario.1 Its practical contribution is a workflow: Telegram posts are classified, CSO users review labels during their normal monitoring work, their feedback is stored with metadata, and that accumulated feedback becomes a gold-standard dataset for model evaluation and refinement. ...

Delta Force: How Weak Models are Secretly the Best Teachers

TL;DR for operators Training budget is usually where elegant AI strategy goes to die. The paper behind this article argues that preference tuning does not always need a superior teacher response. It may only need a useful contrast. A model can improve by learning that one weak answer is better than an even weaker one, even when neither answer is as good as what the model can already produce.1 ...

The Phantom Menace in Your Knowledge Base

TL;DR for operators The paper’s core warning is simple: a RAG system may not be reading the same document your employee just approved. A PDF, HTML page, or DOCX file can look clean to a human reviewer while carrying hidden text, altered Unicode, poisoned fonts, or layout tricks that a document loader still extracts. ...

Talk is Flight: How RALLY Bridges Language and Learning in UAV Swarms

TL;DR for operators RALLY is not a chatbot with propellers. It is a hybrid control framework for UAV swarms where the LLM supplies structured semantic reasoning and the reinforcement-learning layer decides how agents should divide responsibility.1 The practical insight is the separation of labour. A drone swarm does not only need to know where to fly; it needs to agree who should lead, who should coordinate, who should follow, and when those roles should change. RALLY handles that by combining two-stage LLM consensus with RMIX, a role-value mixing network trained to assign Commander, Coordinator, and Executor roles under partial observability and limited communication. ...

Mind the Gap: Fixing the Flaws in Agentic Benchmarking

TL;DR for operators Agent benchmark scores are starting to function like procurement documents. They appear in model cards, vendor decks, research claims, and internal build-versus-buy decisions. The awkward finding in this paper is that some of those scores do not measure what buyers think they measure. Zhu et al. introduce the Agentic Benchmark Checklist, or ABC, to audit whether an agentic benchmark has valid tasks, valid outcome grading, and adequate reporting.1 Applying it to ten widely used agentic benchmarks, they find task-validity flaws in seven, outcome-validity flaws in seven, and reporting limitations in all ten. ...

The Reasoning Gymnasium: How Zero-Sum Games Shape Smarter LLMs

TL;DR for operators SPIRAL is not interesting because it teaches language models to play TicTacToe, Kuhn Poker, and negotiation games. That would be charming, but not exactly a boardroom emergency. Its real contribution is showing that adaptive competitive pressure can train reasoning behaviours that transfer beyond the game environment.1 The paper’s central lesson is mechanism-first: self-play creates a moving curriculum. The model does not merely imitate expert trajectories or exploit a fixed opponent. It faces a continuously improving version of itself, so yesterday’s shortcut becomes today’s liability. That pressure appears to produce reusable reasoning patterns: case-by-case analysis, expected value calculation, and pattern recognition. ...

When Text Doesn’t Help: Rethinking Multimodality in Forecasting

TL;DR for operators Text does not automatically make forecasts smarter. It often just makes the pipeline heavier. A new AWS study benchmarks multimodal time-series forecasting across 16 datasets and 7 domains, comparing time-series-only models, alignment-based multimodal models, and direct LLM prompting.1 The uncomfortable result is that multimodality is not a universal upgrade. Strong unimodal models still win on a substantial share of the benchmark, and the paper’s statistical tests do not support a blanket claim that adding text reliably improves accuracy. ...

Playing with Strangers: A New Benchmark for Ad-Hoc Human-AI Teamwork

TL;DR for operators Teamwork is the awkward part of agentic AI. It is easy to show a model completing a task when the environment is clean, the instructions are explicit, and the other “teammates” behave exactly as expected. Real deployments are less polite. Humans omit context, follow local conventions, adapt unevenly, and occasionally do something that looks wrong only because the system has misunderstood the room. ...