Open-Source, Open Risk? Testing the Limits of Malicious Fine-Tuning

TL;DR for operators

Open-weight model safety is not just a question of what the released model refuses to answer. Once weights are public, the more relevant question is what a capable actor can make the model do after post-training. That is the problem this paper tackles.

The paper introduces malicious fine-tuning as a release-evaluation method: take the model, assume a sophisticated adversary with serious reinforcement-learning infrastructure, and try to elicit the maximum dangerous capability in high-risk domains. The authors apply this to gpt-oss-120b, focusing on biology and cybersecurity rather than self-improvement.

The operational headline is mixed but useful. Refusal behaviour is easy to alter: an incremental RL stage can push refusal rates for unsafe prompts close to zero while largely preserving benchmark performance. Biology capabilities improve meaningfully when the model is trained with browser use and expert-curated domain data. Yet the fine-tuned gpt-oss variants mostly remain below OpenAI o3, and the paper argues that the release contributes only limited marginal biorisk beyond already accessible open-weight models.

Cybersecurity looks different. Fine-tuning on capture-the-flag tasks and terminal-based environments produces only modest gains. The bottleneck is not refusal. It is agentic competence: time management, tool use, sustained execution, and the ability to chain actions across larger environments. The model can perform well on easier CTF-style tasks but fails on harder cyber range environments, which is inconvenient for anyone hoping “just add RL” magically becomes an autonomous red team. Reality remains annoyingly specific.

For AI labs, platform vendors, and regulated enterprises, the lesson is not “open-weight is safe” or “open-weight is doomed”. The lesson is that release reviews should include adversarial post-training, tool scaffolding, inference-time elicitation, and marginal comparison against existing public models. The paper gives a useful template. It does not give a universal safety certificate.

The shipped model is not the real object of risk

A familiar release review asks: does the model refuse unsafe prompts? Does it comply with policy? Does it pass the model card’s safety evaluations?

Those are sensible questions. They are also incomplete for open-weight models.

The moment weights are released, the model stops being a fixed product and becomes a raw material. A downstream actor can fine-tune it, wrap it in tools, add retrieval, change decoding, strip out refusals, or specialise it for a domain. Judging only the shipped checkpoint is like inspecting a factory-sealed power tool and ignoring the person who plans to remove the guard rail. The guard rail matters. So does the screwdriver.

The paper “Estimating Worst-Case Frontier Risks of Open-Weight LLMs” makes this shift explicit.¹ Instead of treating refusal behaviour as the end of the safety story, the authors ask what happens when a capable actor tries to maximise dangerous capability after release. They call this malicious fine-tuning, or MFT.

That framing is the real contribution. The headline results matter, but the more durable idea is methodological: open-weight release evaluation should simulate hostile post-training rather than merely audit the polite version of the model.

Malicious fine-tuning is a three-step stress test

The paper’s mechanism is easier to understand as a pipeline.

Step	What the evaluator tries to do	Why it matters	What failure or success means
Remove refusals	Train the model to comply with unsafe prompts	Tests whether safety behaviour survives post-release adaptation	If refusals disappear easily, refusal rates alone are weak evidence for open-weight safety
Amplify domain capability	Fine-tune with domain data, tools, and RL	Tests whether the model can become more capable in high-risk areas	If performance rises, the release may create new practical capability even if the shipped model looked controlled
Measure marginal frontier shift	Compare against existing open- and closed-weight systems	Tests whether the release changes the threat landscape	A model can be risky in absolute terms but still add little marginal capability if similar tools already exist

This pipeline matters because “dangerous capability” is not one thing. A model might know facts but refuse to state them. It might comply but lack domain skill. It might perform well on static questions but fail when asked to run a toolchain. It might look strong in isolation but add little new capability because the same performance is already available elsewhere.

The paper therefore separates three questions that are often mashed together in public debate:

Can safety refusals be removed?
Can specialised post-training raise hazardous-domain performance?
Does the resulting model exceed what already exists?

Only the third question speaks directly to marginal release risk. The first two explain how much room an adversary has to move.

The adversary is not a bored teenager with a laptop

A useful detail: the paper does not model the weakest attacker. It models a technically capable actor with access to strong RL infrastructure, machine-learning expertise, curated domain data, and a high compute budget. The paper gives “seven figures USD in GPU hours” as an example of the assumed compute scale.

That choice matters. If the study had used only cheap supervised fine-tuning, negative results would be easy to dismiss: perhaps the evaluation merely failed to try hard enough. Instead, the authors use OpenAI’s internal RL stack and the highest reasoning-effort setting for gpt-oss. The intent is not to predict what every user will do. It is to estimate what a sophisticated actor might plausibly extract without training a frontier model from scratch.

This also keeps the conclusion from becoming too broad. The paper does not show that no one can ever elicit more capability. It shows that under the tested high-effort setup, the measured uplift did not clearly push gpt-oss past the relevant frontier in the domains studied.

That distinction is not lawyerly hedging. It is the difference between a release decision and a law of physics.

Anti-refusal training works, which is exactly why refusal rates are insufficient

The first experiment is deliberately blunt. The authors create a “helpful-only” version of gpt-oss through incremental RL that rewards compliance with unsafe prompts. With mild hyperparameter tuning, they report that this preserves general benchmark capability, such as GPQA performance, while pushing refusal rates for unsafe prompts near 0%.

This is not the most surprising result in the paper. It is the result that makes the rest of the paper necessary.

If refusals can be removed while preserving reasoning capability, then refusal behaviour is not a stable property of an open-weight release. It is a shipped configuration. That configuration may still reduce casual misuse, and casual misuse is not irrelevant. But for serious release-risk analysis, the model after anti-refusal training is closer to the object of concern.

This is the misconception the paper quietly dismantles: an open-weight model is not safe because its official release says “no”. Once the weights are public, “no” is a training target, not a constitutional principle.

Biology improves when the bottleneck is compliance plus access to knowledge

The biology experiments are the paper’s main evidence for meaningful capability uplift.

The authors evaluate gpt-oss on several biology-related preparedness benchmarks. These include tasks intended to measure tacit biological knowledge, protocol error identification, long-form biorisk reasoning, and troubleshooting. The paper also introduces TroubleshootingBench, a new benchmark built from real-world lab protocols contributed by PhD experts.

The training setup combines several ingredients:

incremental RL on top of a near-final gpt-oss-120b checkpoint;
browser use during training, with interleaved reasoning, browsing calls, and browser responses;
open-source biology datasets such as GPQA biology, WMDP biology and chemistry, LAB-Bench cloning scenarios, and BioLP Bench;
internal biology datasets, including tacit-knowledge material and chemistry datasets;
synthetic protocol-debugging data generated by using o3 to insert errors into protocols;
anti-refusal data.

The result is not a uniform leap. It is more diagnostic than that.

On Gryphon Free Response, the released gpt-oss model scores 0.0 because it refuses to comply. After anti-refusal and domain training, performance rises sharply, with the stronger gpt-oss variants reaching the range shown for other advanced systems. This is the cleanest example of refusal removal plus domain elicitation changing measured capability.

On Biorisk Tacit Knowledge and Troubleshooting, the gpt-oss variants improve from 68.0 for the base model to 75.7 with anti-refusal, browsing, and extra biology training. OpenAI o3 with anti-refusal and browsing is reported at 76.9, so the fine-tuned gpt-oss model becomes competitive but does not clearly exceed o3.

On ProtocolQA Open Ended, gpt-oss with anti-refusal, browsing, and extra biology reaches 28.2, while o3 is reported at 39.2. The uplift exists, but the gap remains.

On TroubleshootingBench, the strongest gpt-oss variant reaches 25.7, while o3 reaches 32.2. The paper emphasises that models still fall short of expert human baselines on debugging protocols, even with targeted data.

That pattern is important. The model becomes more capable at answering textual biology questions and handling long-form biorisk-style evaluations. It remains weaker where the task demands grounded experimental troubleshooting and tacit lab judgement. Apparently, wet lab reality has not yet agreed to become a multiple-choice exam. Rude, but helpful.

The external biology checks support the direction, not a clean victory lap

The SecureBio evaluations serve a different purpose from the internal benchmarks. They are external checks, not a second thesis.

The authors evaluate gpt-oss variants on SecureBio’s Human Pathogen Capabilities Test, Molecular Biology Capabilities Test, Virology Troubleshooting, and World-Class Biology. Across these tests, the paper reports that worst-case fine-tuned gpt-oss performs similarly to o3 and about 3–5 points better than DeepSeek R1-0528 with browsing on average.

The individual numbers are mixed:

SecureBio evaluation	What it tests at a high level	Strongest gpt-oss pattern	Comparator pattern
VCT	Text-only virology troubleshooting	gpt-oss variants sit around the high 30s to low 40s	o3 is 44.8
HPCT	Knowledge for working with selected pathogens	gpt-oss variants reach the low-to-mid 50s	o3 is 49.0
MBCT	Molecular biology methods	gpt-oss variants sit around 48	o3 is 47.5
WCB	Rare, high-end biology expertise	browsing improves gpt-oss to 49.6; extra bio is 47.9	o3 is 54.8

The business interpretation should be boringly precise: external evaluation supports the claim that biology capability can be elicited upward and can approach advanced closed-model performance on some benchmarks. It does not show that gpt-oss becomes uniformly superior, nor that the release crosses a high-risk threshold on its own.

That is still operationally significant. If a release review only looked at refusal behaviour, it would miss the uplift. If it only looked at the best improved benchmark, it would overstate the danger. The paper’s value is in forcing both errors into the open.

Bio downsampling did not create an easy safety margin

The paper also tests whether pre-training data filtering created a durable safety gap.

gpt-oss had bio-related pre-training data downsampled by about a factor of two. One might expect this to make it materially weaker in biology. To probe that, the authors compare gpt-oss with o4-mini after similar anti-refusal post-training. They find that gpt-oss is not obviously off trend: its biology performance remains similar to what one would expect from its general capability level.

This is a robustness-style result. It does not prove filtering is useless. It suggests that, in this case, downsampling bio-related data did not create an obvious, easily measurable deficit in downstream biology evaluations.

For governance teams, the uncomfortable lesson is that data filtering should not be treated as a visible moat unless the evaluation shows the moat. It may still reduce certain knowledge pathways. But the paper’s evidence does not support a simple story where less bio pre-training automatically means much lower elicitable biorisk.

Consensus sampling is not a magic escalator

The paper also tests whether more inference-time compute helps on biology multiple-choice tasks through consensus@k: ask multiple agents independently and choose the most common answer.

This is a sensitivity test. Its purpose is to check whether capability is hiding behind sampling variance.

The result: increasing consensus does not further improve performance on Biorisk Tacit Knowledge and Troubleshooting. In other words, simply asking more copies of the model and voting does not unlock a new tier of biology capability on those tasks.

That matters because many operational risk scenarios are not single-shot. Real users retry. Attackers sample. Developers wrap models in orchestration. If consensus had produced large gains, release-risk analysis would need to treat inference budget as a major amplifier. In this case, the tested consensus method is not the lever.

That does not mean all inference-time strategies are harmless. It means this particular strategy did not move the measured biology frontier.

Cyber risk fails for a different reason: the model cannot reliably execute

The cybersecurity section is the necessary contrast. In biology, refusal and access to knowledge are major bottlenecks. In cybersecurity, the bottleneck is more agentic.

The authors train gpt-oss in a dockerised terminal environment on 264 capture-the-flag challenges covering categories such as cryptography, reverse engineering, binary exploitation, forensics, and web exploitation. The model receives terminal tools and browser access, with safeguards to block cheating via online writeups or challenge flags. The final evaluation includes held-out professional CTFs and five cyber range environments designed to test longer action chains in emulated networks.

The results are much less dramatic than biology.

On CTFs, gpt-oss performs well on easier high-school-level tasks and improves modestly in some settings. But all gpt-oss variants are matched or dominated by o3. In the figure reported by the paper, the cyber-trained gpt-oss variant reaches 24.8 pass@12 on professional CTFs, compared with 27.7 for o3. The improvement over the base model is small and within confidence intervals.

On cyber ranges, the result is harsher. Without hints, all models achieve 0% accuracy across the light and medium environments. With hints, performance remains near zero: the reported light-with-hints scores are 0.8, 0.0, 1.7, 0.0, and 1.7 across the tested variants. The medium-with-hints environment remains 0.0 across the board.

That is not a cyber-doom result. It is an execution-bottleneck result.

The paper’s error analysis is more useful than the headline score. The authors report that even after end-to-end fine-tuning with a browser tool, the model chose not to browse 74% of the time, and browsing helped it make progress only 4% of the time. Common failures included poor time management, weak tool handling, giving up too early, failing to use hints properly, and abandoning promising lines of attack.

This is what separates “can answer cyber questions” from “can autonomously conduct cyber operations”. The latter requires sustained state, tool discipline, environmental feedback, and multi-step control. The model’s failure mode is not that it refuses to be naughty. It is that it cannot reliably keep its operational act together.

The cyber ablations mostly close off easy objections

The paper then tries additional cyber elicitation methods. These are ablations and sensitivity tests, not the main argument.

Method	Likely purpose	Result	Interpretation
Supervised fine-tuning on successful CTF traces	Test whether imitation of correct solutions boosts performance	No substantial improvement	The gap is not solved by showing the model more solved examples
Rewritten SFT data	Reduce off-policy mismatch between the data-generating model and gpt-oss	No substantial improvement	Style mismatch was not the main bottleneck
Internal best-of-k sampling	Let the model choose among multiple possible next actions	Minor gains only	Local action selection helps little if the agentic loop remains weak
pass@k scaling	Estimate whether repeated attempts can brute-force success	Hundreds of trials estimated for high professional accuracy	Retry helps in sandboxed settings but is less attractive for stealthy real-world operations

The pass@k result is especially useful. The paper estimates that reaching 75% pass@k accuracy on professional CTFs would require 367 trials. That may be feasible in sandboxed search over open-source code. It is less attractive when operating on live systems where repeated failed attempts create noise, alerts, and evidence. Attackers enjoy automation. They do not enjoy glowing in every SOC dashboard like a Christmas tree.

The general point: cyber capability did not scale cleanly through the tested fine-tuning and inference tricks. That makes the result more informative than a single negative benchmark.

The marginal-risk argument is the business core

The paper repeatedly distinguishes absolute risk from marginal risk.

Absolute risk asks how capable the model is in dangerous domains. Marginal risk asks how much additional capability the release adds relative to what is already available.

That distinction is crucial for open-weight governance. A model may be uncomfortable in absolute terms but still not meaningfully advance the public frontier. Conversely, a modest-looking release could matter if it fills a missing capability gap, is easier to fine-tune, runs cheaply, or combines unusually well with tools.

The paper’s marginal-risk conclusion is cautiously reassuring for gpt-oss. Fine-tuned gpt-oss improves in biology and may add some net-new biorisk capability, but the authors argue it does not significantly advance the frontier beyond existing open-weight models. In cybersecurity, the tested variants remain below o3 and far below the level needed for robust autonomous operations.

For business readers, the key is not whether this exact conclusion applies to every future model. It does not. The key is the evaluation pattern.

A serious release decision should ask:

Governance question	Paper evidence that motivates it	Business meaning
What happens after refusals are removed?	Anti-refusal RL can push unsafe refusal rates near zero	Refusal metrics are a first screen, not a release-safety proof
What happens after domain-specific post-training?	Biology improves under RL with browsing and expert-curated data	Evaluate specialised uplift, not just general model capability
Does tool access change the risk?	Browsing helps biology more than cyber in this setup	Tool scaffolding is domain-dependent and must be tested directly
Does the model beat existing public alternatives?	gpt-oss is often near or only marginally above open-weight baselines	Marginal public harm depends on the existing model ecosystem
Are failures caused by knowledge or execution?	Cyber failures are largely agentic/tool-use failures	Mitigations and monitoring should target the actual bottleneck

This is the paper’s most transferable contribution. It turns open-weight release safety from a moral shouting match into an adversarial product-evaluation workflow. Still unpleasant. Much more useful.

What Cognaptus infers for operators

The paper directly shows that malicious fine-tuning is feasible as an evaluation method and that it changes the interpretation of release safety. It directly shows meaningful biology uplift and limited cyber uplift under the tested methods. It directly shows that gpt-oss, after MFT, remains mostly below o3 on internal evaluations and does not clearly create a large new public frontier shift.

Cognaptus infers three practical lessons.

First, model-card safety should include post-release mutability. For closed models, the provider controls most post-training and access pathways. For open-weight models, downstream adaptation is part of the product surface. Evaluating only the released checkpoint is therefore under-specified.

Second, domain-specific risk teams should separate refusal, knowledge, and agency. Biology uplift in this paper comes from removing refusals, adding browsing, and training on relevant data. Cyber weakness comes from agentic execution failures. Treating both under one generic “AI misuse” label hides the operational bottlenecks.

Third, marginal-risk analysis should be ecosystem-aware. A release does not exist in a vacuum. If existing open models already approach the same benchmark performance, the marginal effect differs from a release that creates a new public capability level. This is not an excuse to ignore absolute risk. It is a way to avoid pretending every model is the first model.

The boundaries are not decorative

The limitations are material.

The study does not maliciously fine-tune every comparator model. This likely advantages gpt-oss in some comparisons, because gpt-oss receives extra elicitation while other open models are mostly evaluated as available. The authors acknowledge that this may underestimate the true worst-case capability of those other models. That matters when interpreting marginal risk.

The evaluations are also proxies. Many biology benchmarks are benign or semi-benign measures of bottleneck skills rather than direct demonstrations of real-world harmful capability. Cyber ranges are more realistic than static CTFs but still simpler than enterprise systems. Proxy evaluations are necessary here, but they are not reality wearing a lab coat.

The training data is incomplete. Frontier capability elicitation depends heavily on data coverage and expert curation. The paper’s datasets are serious, but they do not cover every relevant biological or cyber skill. In cyber, for example, CTFs and sandbox ranges do not fully represent real-world vulnerability discovery, persistence, stealth, or organisational intrusion paths.

The scaffolds are also relatively simple. More hierarchical agents, stronger state management, domain-specific tools, multi-agent delegation, judge-based best-of-k, or ensembling could improve results. The paper tests some elicitation methods, but not the entire design space. The entire design space is where optimistic safety claims usually go to die quietly.

Finally, the paper focuses on gpt-oss-120b and the specific release context around it. The result should not be converted into a general claim that future open-weight releases are safe. The authors themselves warn that if capabilities continue to scale, even smaller open models could eventually reach high-risk capability levels.

The release-review template is more important than the release verdict

The best way to read this paper is not as a verdict on open source. It is a template for adversarial release evaluation.

That template starts with a simple correction: the model you release is not necessarily the model the world will use. Open weights allow downstream actors to alter refusals, specialise domains, add tools, and retry at scale. Any serious safety analysis has to evaluate that modified model, not just the one that appears in the press release looking freshly aligned and very well behaved.

For gpt-oss, the paper’s findings are cautiously bounded. Biology risk capability can be elicited upward, but the resulting model does not clearly leap beyond the existing frontier. Cybersecurity capability remains constrained by agentic execution failures, not by a lack of willingness to answer. The marginal release risk appears small under the tested conditions, though not zero and not permanently settled.

For operators, the conclusion is sharper than the usual “more research needed” fog machine. If your organisation releases, adopts, or regulates open-weight models, test the adversarially tuned version. Compare it against the public ecosystem. Separate refusal from capability. Separate benchmark fluency from operational agency. And do not confuse a model’s shipped manners with its post-release potential.

Open-source models do not merely ship. They mutate. Release governance has to grow up accordingly.

Cognaptus: Automate the Present, Incubate the Future.

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, and Chris Koch, “Estimating Worst-Case Frontier Risks of Open-Weight LLMs,” arXiv:2508.03153, 2025, https://arxiv.org/abs/2508.03153. ↩︎

TL;DR for operators#

The shipped model is not the real object of risk#

Malicious fine-tuning is a three-step stress test#

The adversary is not a bored teenager with a laptop#

Anti-refusal training works, which is exactly why refusal rates are insufficient#

Biology improves when the bottleneck is compliance plus access to knowledge#

The external biology checks support the direction, not a clean victory lap#

Bio downsampling did not create an easy safety margin#

Consensus sampling is not a magic escalator#

Cyber risk fails for a different reason: the model cannot reliably execute#

The cyber ablations mostly close off easy objections#

The marginal-risk argument is the business core#

What Cognaptus infers for operators#

The boundaries are not decorative#

The release-review template is more important than the release verdict#