LLMs | Cognaptus

When Solvers Guess Smarter: Teaching SMT to Think in Functions

When Solvers Guess Smarter: Teaching SMT to Think in Functions Timeouts are where formal verification quietly loses its glamour. A team writes a specification. A solver receives the formula. Everyone expects the machine to answer a clean question: is this system safe, satisfiable, contradictory, or not? Then the solver thinks. And thinks. And returns nothing useful before the clock runs out. ...

When Prompts Learn Themselves: The Death of Task Cues

A database column named CURRENT_BAL_AMT is annoying. A column named gbstk is worse. Somewhere inside an enterprise data warehouse, these names are perfectly normal. Somewhere outside the original engineering team, they are tiny locked doors. The usual solution is not glamorous. Someone asks a data engineer. The data engineer asks an older data engineer. A wiki page is found, partly wrong, last updated during an earlier economic cycle. Eventually, “current balance amount” or “overall processing status of sales document” appears in a data catalog, a semantic layer, a search index, or a text-to-SQL system. Humanity advances by one abbreviation. ...

EverMemOS: When Memory Stops Being a Junk Drawer

Memory sounds simple until the assistant has to remember two incompatible things at once. A customer loves craft beer. The same customer is temporarily taking antibiotics. A flat memory system retrieves “likes IPA” and recommends a variety pack, because apparently “memory” means grabbing the loudest sticky note from a drawer and pretending it is wisdom. A more useful assistant retrieves the preference, the medical constraint, the timing, and the relation among them. It recommends a mocktail and quietly avoids turning personalization into negligence. ...

Crossing the Line: Teaching Pedestrian Models to Reason, Not Memorize

Crosswalks look simple from a spreadsheet. A pedestrian either crosses at the intersection or crosses mid-block. The model sees age group, gender, lane count, lighting, weather, signal timing, maybe a bus stop nearby, and then predicts the choice. Very civilized. Very tabular. Very likely to fail when the same logic is moved to a different road. ...

SAGA, Not Sci‑Fi: When LLMs Start Doing Science

Science usually fails in a boring way. Not with explosions. Not with a robot dramatically discovering penicillin 2.0 while violins swell in the background. More often, a research workflow fails because somebody optimized the wrong thing a little too efficiently. A molecule scores well but is chemically ugly. A nanobody looks good under one predictor but fails to bind. A DNA enhancer activates the target cell line but also lights up the wrong tissue. A separation process reaches high purity by adding pointless unit operations, because the reward function forgot to punish industrial nonsense. The optimizer did its job. Unfortunately, the job description was incomplete. ...

Guardrails Over Gigabytes: Making LLM Coding Agents Behave

The coding agent did not fail quietly. That was the point. A coding agent writes a patch. The patch looks plausible. The imports are clean enough. The function names sound like they belong in the repository. The explanation is fluent, naturally. Fluency is what these systems do best. Then the build breaks. ...

When Policies Read Each Other: Teaching Agents to Cooperate by Reading the Code

A workflow breaks in a familiar way. The planning agent assumes the procurement agent will wait. The procurement agent assumes the planning agent has already revised the forecast. The compliance agent flags the output after both have acted. Everyone had access to the same dashboard. Nobody had access to the thing that actually mattered: the other agent’s decision policy. ...

When Bigger Isn’t Smarter: Stress‑Testing LLMs in the ICU

A hospital does not buy “intelligence.” It buys a workflow. That distinction sounds obvious until an AI vendor arrives with a model that has billions of parameters, a clinical pretraining story, and the gentle implication that smaller models are now museum pieces. In the ICU, however, the useful question is not whether the model can talk like a doctor. It is whether it can detect tomorrow’s clinical deterioration from messy notes better than simpler systems that cost less, run faster, and attract fewer infrastructure headaches. ...

LLMs, Gotta Think ’Em All: When Pokémon Battles Become a Serious AI Benchmark

Game AI usually has a familiar job: lose convincingly. Not too quickly, because that feels insulting. Not too brutally, because that feels like homework wearing a boss battle costume. Good game AI sits in the narrow emotional band between “I can beat this” and “I need to think.” The old solution was scripted behavior, heuristics, difficulty sliders, or reinforcement learning trained until the agent stopped embarrassing itself. The newer temptation is simpler: give the game state to an LLM and ask it to play. ...

Stepwise Think-Critique: Teaching LLMs to Doubt Themselves (Productively)

The useful part of doubt is timing Doubt is not useful after the invoice is paid, the client report is sent, or the model has already produced a confident wrong answer with twelve decorative paragraphs of reasoning. At that point, “let us verify” becomes less like quality control and more like archaeology. ...