Code Generation

Design Patterns Are Not Prompt Decorations

TL;DR for operators A software team can tell an LLM to “use Singleton,” and the model may indeed wrap the code in something that looks satisfyingly architectural. Congratulations: the code has learned to wear a blazer. The useful question is whether that blazer still has pockets. In the paper examined here, Kjellberg, Fotrousi, and Staron test 13 LLMs on 164 Java HumanEval-X coding tasks, asking them to generate code that follows the Singleton design pattern while still passing task tests.1 They compare four strategies: direct instruction, binary automated feedback, predicate-specific automated feedback, and predicate-specific feedback with few-shot Singleton examples. ...

The Code Agent Wasn’t Self-Correcting. The Test Harness Was.

TL;DR for operators Code agents do not become reliable because they are asked politely to “fix the bug.” They become more useful when they are placed inside a loop that can run their output, return structured failure evidence, and decide how many further attempts are worth buying. That is the practical point of Zhang and Kothari’s paper, Unlocking LLM Code Correction with Iterative Feedback Loops.1 The authors evaluate four LLMs across Python and Java using LeetCode problems, then move from ordinary one-shot performance to an automated correction loop: generate code, execute it, feed back compiler/runtime/testcase information, and repeat up to ten iterations. ...

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream

Compile Once, Train Later: Offline RL Moves Code-Model Verification Upstream Code assistants have a small accounting problem. Not the glamorous kind involving model capability, agentic workflows, or yet another dashboard with a glowing neural blob. The ordinary kind: every time a model proposes code during reinforcement learning, someone—or something—has to run it, test it, score it, and feed that score back into training. ...

Tokens, Watts, and Waste: The Hidden Energy Bill of LLM Inference

Tokens are small. That is why they are dangerous. A developer asks an assistant to generate a function, explain a repository, or reason through a failing test. The screen fills with text. Some of it is useful. Some of it is decoration. Some of it is a polite little parade of examples, test cases, alternative implementations, or whitespace that will be thrown away by the next parser in the pipeline. ...

When Three Examples Beat a Thousand GPUs

A GPU bill is usually treated as a hardware problem. Buy faster accelerators, shorten training runs, negotiate a better cloud contract. Less often asked is whether the expensive part of the pipeline began with a badly calibrated prompt. An LLM generating neural-network architectures can create thousands of candidates before training begins. If the prompt provides too little context, the model may repeatedly produce shallow variations of the same familiar design. Add more examples, and it may combine useful ideas across architectural families. Add still more, and the output can become worse, incomplete, or invalid. ...

When Sketches Start Running: Generative Digital Twins Come Alive

Factory sketches are usually where industrial simulation begins, not where it runs. An engineer draws the line, marks the queue, places a processor, adds a conveyor, then disappears into the less glamorous work: configuring objects, assigning arrival distributions, wiring routes, and writing platform-specific logic. The sketch is the easy part. The executable twin is the expensive part. ...

When Small Models Learn From Their Mistakes: Arithmetic Reasoning Without Fine-Tuning

Numbers are where language models usually stop sounding impressive. Ask a model to summarize a financial report and it may produce a fluent paragraph with just enough confidence to make everyone in the meeting relax. Ask it to calculate a percentage change from a table, preserve the correct scale, and return a verifiable number, and the poetry ends. Suddenly the model must select the right values, understand the wording, apply the right operation, avoid sign mistakes, avoid scale mistakes, and not hallucinate a formula because the word “change” appeared nearby. ...

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework Instructions are usually treated as the beginning of an AI workflow. A user, developer, or system designer writes a prompt. The model produces an output. Then, if the output looks wrong, someone writes another prompt telling the model how to check it, another prompt telling it how to repair it, and eventually a small mountain of prompt glue accumulates around what was supposed to be an automated system. ...

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development

When Agents Compare Notes: How Shared Memory Quietly Rewires Software Development Software teams already know the problem. One developer discovers the weird edge case. Another developer repeats the same mistake three weeks later. A third person writes a Slack explanation that disappears into the corporate sedimentary layer, next to the launch checklist from 2019 and that one blessed Docker command nobody can find anymore. ...

Parallel Minds: How OMPILOT Redefines Code Translation for Shared Memory AI

Parallel Minds: How OMPILOT Redefines Code Translation for Shared Memory AI Backlogs are where technical debt goes to become architecture. Somewhere inside a simulation company, an engineering team knows that a large body of C++ could run faster if more of it used shared-memory parallelism. The CPUs are already multicore. The workload already begs for concurrency. The obstacle is not theory. It is the miserable little detail that correct OpenMP is easy to write incorrectly. ...