TL;DR for operators
A software team can tell an LLM to “use Singleton,” and the model may indeed wrap the code in something that looks satisfyingly architectural. Congratulations: the code has learned to wear a blazer.
The useful question is whether that blazer still has pockets. In the paper examined here, Kjellberg, Fotrousi, and Staron test 13 LLMs on 164 Java HumanEval-X coding tasks, asking them to generate code that follows the Singleton design pattern while still passing task tests.1 They compare four strategies: direct instruction, binary automated feedback, predicate-specific automated feedback, and predicate-specific feedback with few-shot Singleton examples.
The operational result is quite practical. Automated feedback can push many models toward very high or perfect Singleton predicate compliance. Direct instruction alone is often surprisingly effective. Predicate-specific feedback improves structural alignment for several models. But functional correctness does not automatically follow structural compliance. Some models become better at Singleton and worse at solving the task. That is not a philosophical paradox. It is Tuesday in AI-assisted software development.
For business use, the paper supports a modest but valuable governance pattern: pair architectural prompts with automated structural checks and regression tests. Do not accept a generated implementation because it names the right pattern. Do not assume few-shot examples are free improvement. And certainly do not treat successful Singleton wrapping in benchmark Java tasks as evidence that the model understands when Singleton should be used in a real system. The study deliberately uses tasks where Singleton is redundant, so it tests steerability, not architectural judgment.
The familiar mistake: asking for architecture and getting costume jewelry
Software architecture is full of words that sound precise until they are handed to a generative model: Factory, Observer, Strategy, Singleton, Adapter. Human developers use these terms as compact references to shared design constraints. A model may use them as tokens associated with familiar-looking code shapes.
That distinction matters. When an LLM generates code, a pattern request is not a proof obligation unless the development process makes it one. “Use the Singleton pattern” may trigger a private constructor, a static instance, and a public accessor. It may also trigger a decorative scaffold around broken task logic. The paper’s best contribution is to make this difference measurable.
The authors choose Singleton for a sensible reason. Singleton is not the most beloved pattern in software engineering, and many architects would prefer it not be sprinkled over code like parmesan. But it is structurally detectable. A class either has something close to the expected private constructor, static instance mechanism, and global access point, or it does not. That makes Singleton useful as an experimental probe: can LLMs be steered toward a higher-level design structure while still solving the requested coding task?
The answer is: often yes, but not in the way procurement slides would prefer.
What the experiment actually compares
The study is comparison-based at its core. It is not just “LLMs can generate design patterns.” That would be the greeting-card version. The paper compares guidance strategies across models and measures two separate outcomes: Singleton adherence and functional correctness.
The authors use 164 Java tasks from HumanEval-X. Importantly, these tasks do not require Singleton in their canonical solutions. That design choice prevents the models from accidentally receiving tasks where Singleton is naturally expected. It also means the experiment asks models to impose a redundant design structure on otherwise ordinary coding tasks. That is a feature for steerability research and a boundary for business interpretation.
The experiment uses 13 models from several families, including GPT, Llama, Qwen, Gemma, Mistral, and DeepSeek variants. Each model is evaluated under a baseline and four guidance strategies. Functional correctness is measured using pass@1 over the dataset’s tests. Singleton alignment is measured using a “Singleton Score,” based on whether generated classes satisfy three regex-detected predicates:
| Singleton predicate | What it checks | Why it matters |
|---|---|---|
| Private constructor | The class has a private constructor and no public constructor | Prevents direct external instantiation |
| Instance mechanism | The class includes a private static instance field | Provides the single reusable instance |
| Global access point | The class includes a public static accessor method | Allows code to retrieve that instance globally |
This is not a full semantic verifier. It is a structural checker for a particular interpretation of Singleton. That is enough for the paper’s immediate purpose: testing whether guidance strategies move models toward detectable architectural form.
The four strategies are not interchangeable knobs
The paper’s comparison is easiest to understand as a ladder of guidance. Each rung gives the model more external structure, but “more structure” does not always mean “better code.”
| Strategy | Likely experimental purpose | What it tests | What it does not prove |
|---|---|---|---|
| Baseline | Main reference point | How well each model solves Java tasks without being asked for Singleton | Whether the model understands design patterns |
| Direct instruction | Main evidence for prompt-only steering | Whether the model can use its internal knowledge of Singleton when asked | Whether the pattern is implemented appropriately |
| Binary feedback | Main evidence for iterative correction | Whether models can repair structure after being told the output is not Singleton | Which missing predicate caused failure |
| Predicate-specific feedback | Main evidence plus diagnostic ablation against binary feedback | Whether explicit missing-property feedback improves convergence | Whether the resulting design is semantically superior |
| Predicate-specific feedback plus few-shot examples | Exploratory extension / sensitivity test | Whether correct examples add value beyond explicit feedback | That few-shot prompting is generally helpful |
This table matters because otherwise the experiments blur into a familiar AI ritual: prompt, iterate, add examples, admire bar chart. The actual result is sharper. The system’s behavior depends on the interaction among model capability, structural feedback, and task correctness. Few-shot prompting, which often arrives in AI discussions as a magic garnish, is not reliably helpful here.
Baseline: strong coding models are not spontaneous architects
In the baseline condition, models are simply asked to solve the Java tasks. No model spontaneously generates Singleton-style classes for these problems. The average Singleton Score is 0.0 across models.
That is unsurprising but useful. The dataset tasks do not require Singleton, so the models do not naturally impose it. The baseline also gives the functional reference: GPT-5 mini has the highest pass rate at 83.5%, followed by GPT-OSS at 75.6% and Gemma3 27B at 72.0%. At the lower end, GPT-4o Mini has a baseline pass rate of 4.3%, Mistral 7B has 11.0%, and Code Llama 70B has 18.9%.
This spread is a reminder that the design-pattern question is layered on top of a basic coding-capability question. A strategy that improves Singleton adherence on a weak baseline may still produce code that fails. A strategy that slightly reduces a strong model’s pass rate may still be operationally acceptable if architecture conformance is mandatory. The paper does not collapse these into one score, which is good. Collapsing architecture and functionality into a single leaderboard would be tidy and mostly useless.
Direct instruction works better than cynics expect, worse than governance requires
The first strategy is simple: tell the model that the primary class should follow the Singleton design pattern. The model gets no explicit definition of Singleton and no predicate list. It must rely on what it has learned.
The Singleton results are strong for many models. Llama3.3 70B and Gemma3 27B reach a Singleton Score of 100. GPT-4o mini, GPT-OSS, GPT-5 mini, Qwen3 Coder, Qwen3 32B, and Qwen3 8B all reach roughly 98.6 to 99.8. Even Mistral 7B reaches 91.8. In other words, many models can produce the surface structure of Singleton when directly asked.
The functionality results are less polite. Some models improve. Llama3.3 rises by 34.1 percentage points over baseline. Qwen3 Coder rises by 32.9 points. Qwen3 32B rises by 27.4 points. GPT-4o Mini rises by 55.5 points from its very low baseline. But other models degrade sharply. DeepSeek Coder v2 drops by 26.2 points, Gemma3 4B drops by 21.3 points, and Code Llama drops by 9.1 points.
The practical interpretation is not “direct instruction is good” or “direct instruction is dangerous.” It is narrower and more useful: direct instruction can activate learned architectural templates, but the cost of fitting task logic into that template depends heavily on the model.
The paper’s discussion of Mistral and Code Llama makes the point concrete. Mistral fulfills Singleton predicates at a high rate under instruction, yet achieves only a 9.8% test pass rate; many generated solutions contain compiler errors, and a manual evaluation attributes 55% of those compiler errors to references to unavailable external libraries. Code Llama performs poorly on both Singleton predicates and test passing; the authors report that 76% of its compiler errors come from failing to generate code and instead returning textual output.
That is the first business lesson. A model can obey the architectural part of a request while failing the software part. This is not “alignment.” It is compliance theater with syntax highlighting.
Binary feedback: the cheap loop often does real work
The second strategy adds an automated feedback loop. If the generated code does not satisfy all three Singleton predicates, the system tells the model that the code does not include a correctly formatted Singleton class and asks it to correct the code. It does not say which predicate failed.
This is the paper’s simplest governance-style intervention: generate, check, reject, retry. No grand agent framework. No orchestration novella. Just a structural checker attached to an iterative loop.
The results show why simple loops deserve more respect. Several models reach perfect Singleton Scores under binary feedback: Llama3.3, Gemma3 27B, GPT-OSS, GPT-4o Mini, and GPT-5 mini. Qwen3-based models also perform strongly, with Singleton implementation rates above 97%. Qwen3 8B reaches 99.2. Mistral improves to 95.3. Lower-performing models improve as well, though not enough to become reliable: DeepSeek Coder rises to 54.5, DeepSeek R1 to 72.2, and Code Llama to 29.5.
Functionality is again model-specific. Seven models improve their test pass rates compared with baseline. Llama3.3 improves by 28.7 percentage points, reaching 60.5%. Qwen3 8B improves by 17.7 points, reaching 58.6%. GPT-4o Mini improves by 61.0 points from its low baseline. But DeepSeek Coder and Gemma3 4B still decline significantly, by 9.8 and 21.3 percentage points respectively.
One observation should get more attention than it usually would in a surface summary: the paper notes that DeepSeek Coder increases Singleton implementation while losing functional correctness. The authors describe this as the model losing context and implementing the pattern correctly for the wrong functionality.
That is the governance hazard in miniature. Structural validation can prove that a pattern exists. It cannot prove that the task survived the refactor.
Predicate-specific feedback: better diagnosis, not automatic salvation
The third strategy gives the model more information. Instead of saying only “this is not a correctly formatted Singleton class,” the system reports which Singleton checks failed. This is tool-supported self-improvement with explicit predicate-level feedback.
As an engineering pattern, this is more attractive than binary feedback because it converts rejection into diagnosis. The checker is no longer just a gate; it becomes a teacher with a very small vocabulary.
The Singleton gains are broad. The number of models reaching a perfect Singleton Score increases to seven: Gemma3 27B, GPT-OSS, Llama3.3, Qwen3 Coder, Qwen3 32B, GPT-4o Mini, and GPT-5 mini. Qwen3 8B and Mistral exceed 97.0. DeepSeek Coder improves substantially to 80.1. Code Llama improves to 36.4. The missed-predicate pattern resembles the binary-feedback case, but the total number of fulfilled predicates rises.
This is the clearest evidence that predicate-level feedback is useful for structural conformance. It gives the model enough information to repair missing architectural elements rather than guessing what went wrong.
But the functionality story remains uneven. Llama3.3, Qwen3 Coder, GPT-4o Mini, and Qwen3 32B show significant improvements over baseline. DeepSeek R1 benefits more here than under direct instruction. Yet Gemma3 4B and DeepSeek Coder still show significant negative effects against baseline.
This is exactly where business teams should resist the pretty dashboard. A structural checker can make the generated code look more architecturally compliant. It does not eliminate the need for task-level tests. In the paper’s setup, functionality is still evaluated separately through HumanEval-X test cases. In production, that separation should become a rule, not a courtesy.
Few-shot examples are not free improvement
The fourth strategy adds two examples of correctly implemented Singleton classes alongside predicate-specific feedback. If one were writing a generic prompt-engineering brochure, this would be the obvious winner: instruction plus feedback plus examples. Very wholesome. Very workshop-friendly.
The results are less obedient.
Six models reach perfect Singleton implementation under this strategy, and three more score 95.3 or above. DeepSeek R1 and Code Llama achieve their best Singleton Scores here, at 85.4 and 43.3 respectively. But Gemma3 4B collapses to a Singleton Score of 40.7, with a particularly sharp problem on the private constructor predicate, present in only 20.7% of generated code.
Functionality also becomes less attractive. Only five models significantly improve pass rate under this strategy, while four perform significantly worse than baseline. The authors explicitly state that adding Singleton examples does not improve the ability to implement Singleton compared with relying on predicate feedback alone, and for some models the test pass rate is significantly lower than under tool-supported self-improvement.
This is the paper’s most useful anti-cliché result. Few-shot prompting is not a universal additive. Examples can help some models and confuse others. They can shift attention toward the example format, interfere with task-specific logic, or simply fail to add information beyond what the predicate feedback already provides.
For operators, this means prompt recipes should be treated like deployable configuration, not folklore. If adding examples changes behavior, measure both structure and functionality. If it does not help, remove them. The cheapest prompt is the one you did not stuff with motivational furniture.
The real comparison is not prompt versus tool; it is structure versus behavior
The most tempting reading of the paper is that automated feedback improves design-pattern adherence. True, but incomplete. The more important reading is that the paper exposes two axes that AI coding workflows often blend together:
| Axis | What the paper measures | What a team might mistakenly infer |
|---|---|---|
| Structural conformance | Whether the code satisfies detectable Singleton predicates | The design is architecturally sound |
| Functional correctness | Whether generated code passes task tests | The design constraint did no harm |
| Model sensitivity | Different models respond differently to the same strategy | One prompt policy can be standardized across all models |
| Example sensitivity | Few-shot examples sometimes reduce performance | More context is always better |
| Checker scope | Regex-based predicate detection | Full semantic verification of a design pattern |
The paper’s evidence supports structural steerability. It does not support architectural understanding in the stronger sense. The models are not being asked whether Singleton is appropriate. They are being asked to impose Singleton on tasks that do not require it. That distinction is not a footnote-level limitation; it is central to how the result should be used.
A business process should therefore frame this kind of guidance as constrained generation, not autonomous design. The model is not acting as an architect deciding whether a pattern belongs. It is acting as a code generator being forced through a pattern-shaped aperture. Sometimes the output still works. Sometimes it comes out dented.
What software teams can take from this
The paper suggests a useful operating model for AI-assisted development, especially in teams trying to standardize code conventions, secure coding idioms, or architectural templates.
First, separate pattern instruction from pattern verification. A prompt is not a control. It is an intention. The control is the checker that tests whether required properties exist.
Second, separate pattern verification from functional testing. The Singleton checker can confirm a private constructor, static instance mechanism, and global accessor. It cannot confirm that the algorithm still solves the task. A code review pipeline needs both structural checks and regression tests.
Third, tune guidance by model. The same strategy produces sharply different outcomes across models. Stronger generalist or coder models may benefit from direct instruction or simple binary feedback. Weaker or more brittle models may increase conformance while losing functionality. “We tested the prompt on one model” is not a model governance policy. It is a diary entry.
Fourth, treat examples as an empirical variable. Few-shot prompting did not reliably improve results over predicate-level feedback. In some cases, it damaged pass rates. Teams should version prompts and examples the same way they version code: with evaluation, rollback, and some humility, ideally before the incident review.
Fifth, use structural feedback where the property is detectable. Singleton is convenient because its predicates can be checked by a script. This lesson transfers most cleanly to conventions with machine-checkable signatures: required constructors, access modifiers, interface usage, security wrappers, naming constraints, dependency boundaries, or prohibited imports. It transfers less cleanly to fuzzy architectural qualities like modularity, maintainability, domain fit, or “clean design,” a phrase that has launched many meetings and resolved very few of them.
A practical workflow implied by the paper
The paper does not propose an enterprise SDLC framework, but its experiment points toward a disciplined workflow:
| Stage | Operator action | Tooling implication |
|---|---|---|
| Specify | Ask the model for the coding task and required design constraint | Prompts should state both task and structural requirement |
| Generate | Produce candidate code | Capture model, prompt, and strategy metadata |
| Check structure | Run automated predicate or static-analysis checks | Reject outputs that fail detectable architectural constraints |
| Give feedback | Provide binary or predicate-level correction | Prefer predicate-level feedback when the missing property is known |
| Test behavior | Run unit, regression, and compile checks | Never replace tests with pattern conformance |
| Review appropriateness | Human or policy layer decides whether the pattern belongs | Especially important because this paper does not test pattern selection |
| Promote | Accept only if both structure and behavior pass | Store evaluation traces for audit and prompt tuning |
This is not glamorous. That is the point. The operational future of AI coding is probably less “autonomous software engineer” and more “fast generator surrounded by annoyingly competent gates.” The gates are where the value becomes durable.
The limitations are not generic; they define the usable boundary
The study’s limitations are unusually important because they prevent overclaiming.
The dataset has 164 Java tasks. That is large enough to compare strategies in a controlled way, but not enough to generalize across software systems, languages, frameworks, and codebases. The results are about Java generation under a benchmark setting.
The tasks do not require Singleton. This is deliberate, because the authors want to test whether models can be guided to implement the pattern without accidental natural occurrence. But it also means the paper does not show that models know when Singleton is appropriate. It shows they can often be made to implement it.
The Singleton checker is regex-based and tied to a specific definition of Singleton. That makes the metric operationally tractable, but it does not capture all valid Singleton variants or semantic properties. A real codebase may implement lazy initialization, thread-safe variants, dependency injection alternatives, or language-specific idioms that a simple checker could misread.
The prompts are not optimized per model. This supports fair comparison across models, but it means a production team might get better results with model-specific prompt tuning. It also means the paper’s ranking of strategies should not be treated as permanent. Models and prompting interfaces change; the underlying lesson is the need to evaluate the loop, not memorize this month’s ordering.
The functionality tests come from HumanEval-X. Passing those tests is useful evidence, but tests are incomplete by nature. In production, generated code can pass local unit tests and still violate performance, concurrency, maintainability, security, or integration requirements. Shocking, yes. Software remains software.
The decision rule: use LLMs for pattern-constrained generation, not pattern judgment
The strongest business interpretation is not that LLMs are ready to become architecture engines. It is that LLMs can be useful inside a constrained generation workflow where architectural rules are externally specified and automatically checked.
That distinction matters for investment. Buying or building AI coding tools should not focus only on model access. The useful layer is the harness: static analysis, pattern recognizers, test execution, compile checks, prompt/version management, and review gates. A model alone can generate a plausible Singleton. A governed workflow can decide whether the Singleton is present, whether the code still works, and whether the pattern should have been requested in the first place.
The paper’s comparison also suggests that evaluation should be two-dimensional. If a vendor shows “architecture compliance,” ask for functional preservation. If they show pass rates, ask whether the requested design constraints were satisfied. If they show both, ask how results vary by model and prompt strategy. This will mildly irritate them, which is often how one knows a useful procurement question has been found.
The conclusion: the prompt is not the architecture
Kjellberg, Fotrousi, and Staron show that LLM-generated Java code can be steered toward the Singleton pattern using simple instructions and automated feedback. The effect is real. It is also uneven. Some models reach near-perfect structural conformance while maintaining or improving functionality. Others learn the costume and forget the play.
The best lesson is therefore not “prompt better.” It is “build the checking loop.” Design-pattern prompts should become enforceable constraints, not aesthetic suggestions. Automated feedback can correct structural misses, especially when it tells the model which predicate failed. But functional tests remain non-negotiable, and human judgment is still required to decide whether the pattern belongs.
Singleton, in this paper, is a laboratory animal. The larger species is architecture-aware code generation. We are not yet watching models make reliable design decisions. We are watching them respond to externally imposed structure. That is useful. It is also smaller than the headline some people will try to sell.
Cognaptus: Automate the Present, Incubate the Future.
-
Viktor Kjellberg, Farnaz Fotrousi, and Miroslaw Staron, “Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton,” arXiv:2605.26898, 2026. https://arxiv.org/abs/2605.26898 ↩︎