Guardians of the Chain: How Smart-LLaMA-DPO Turns Code into Clarity

TL;DR for operators

Smart-LLaMA-DPO is not interesting because it puts another LLM badge on smart contract auditing. We have enough badges. It is interesting because it shows a credible mechanism for making an LLM behave more like a useful junior security analyst: read the contract, identify whether the vulnerability is real, locate the issue, and explain the reasoning in a way a developer can act on.

The paper’s practical lesson is simple but expensive: better AI security tools come from better domain alignment. Smart-LLaMA-DPO starts with LLaMA-3.1-8B, continues pre-training it on a large smart contract corpus, fine-tunes it on expert-reviewed vulnerability examples, then applies Direct Preference Optimization (DPO) so the model learns to prefer deeper, clearer, more context-aware explanations over technically correct but thin ones.¹

The reported results are strong. Against previous best baselines, the model improves F1 by 7.51 percentage points on reentrancy, 1.54 on timestamp dependency, 11.07 on integer overflow/underflow, and 6.06 on delegatecall. On machine-unauditable vulnerabilities, it reaches 90.7% accuracy and 83.4% F1, compared with iAudit’s 76.7% accuracy and 66.2% F1. Human evaluators also rate its explanations more positively than the strongest explanation baseline across correctness, thoroughness, and clarity.

The business meaning is not “fire the auditors.” Please do not build your security programme around a leaderboard and vibes. The more realistic value is audit acceleration: faster first-pass triage, better vulnerability explanations, fewer false positives in obvious cases, and more consistent remediation guidance. The boundary is equally clear: the system is trained and tested on Solidity, selected vulnerability families, and carefully curated data. It is a powerful assistant pattern, not a universal proof of safe autonomous auditing.

Smart contract security is a reasoning problem pretending to be a pattern problem

Smart contract auditing has an awkward feature: the code is short enough to look manageable and dangerous enough to ruin people’s week. A vulnerable contract may not fail because of one obviously suspicious token. It may fail because an external call happens before a balance update, because a timestamp influences a critical decision, because a proxy pattern mutates storage in the caller’s context, or because an oracle price is treated as reality instead of market gossip with an API.

That is why generic vulnerability detection is such a frustrating benchmark. A tool can recognise a suspicious construct and still misunderstand the exploit path. It can flag block.timestamp without asking whether the timestamp affects anything important. It can see an external call and shout “reentrancy” while missing the access-control context. This is security theatre with syntax highlighting.

Smart-LLaMA-DPO is built around a more useful premise: smart contract vulnerability detection and explanation should be learned together. The detector should not merely produce a label and then bolt on a justification afterwards. The explanation is part of the security judgement. If the explanation gets the execution order wrong, the label is not enough to trust the tool.

The authors frame this through a concrete failure mode. A general LLaMA-3.1-8B-Instruct model misidentifies a reentrancy vulnerability because it misunderstands the role of external calls. A version improved through continual pre-training and supervised fine-tuning performs better on the label, but still gives a partially wrong explanation about operation order. That gap is the paper’s real target. It is not just asking whether the model can answer “vulnerable or safe.” It asks whether the model can explain the answer without quietly mangling the code path.

The mechanism: three stages, three different jobs

The paper’s method has three main training stages: continual pre-training, supervised fine-tuning, and Direct Preference Optimization. These are easy to list and easy to misunderstand. They are not three decorations on the same cake. Each stage repairs a different weakness.

Stage	What it teaches the model	Failure mode it targets	Operational consequence
Continual pre-training	Solidity syntax, contract patterns, security-critical functions, and domain semantics	Generic models treating smart contracts like ordinary code snippets	Better recognition of contract-specific structures before audit training begins
Supervised fine-tuning	Vulnerability labels, locations, and expert-reviewed explanations	Models detecting without knowing how to explain or localise	A unified detect-and-explain behaviour rather than a separate explanation afterthought
Direct Preference Optimization	Preference for higher-quality explanations over weaker but plausible ones	Thin, partially correct, or context-insensitive explanations	Better remediation guidance and fewer misleading rationales

Continual pre-training is the domain literacy layer. The authors use a corpus derived from Ethereum smart contracts: 186,397 unique smart contract instances, totalling 501.62 million tokens. They add 100,000 general-domain instances across code, mathematics, English, and Chinese text, adding 118.94 million tokens, for 620.56 million tokens in total. That mixture is sensible. A contract auditor model needs Solidity fluency, but a model that forgets general reasoning and language is just a specialised autocomplete engine wearing a hard hat.

Supervised fine-tuning is the task layer. The SFT dataset covers reentrancy, timestamp dependency, integer overflow/underflow, delegatecall, and machine-unauditable vulnerabilities. The final SFT data includes 3,390 reentrancy instances, 1,167 timestamp dependency instances, 1,013 overflow/underflow instances, 698 delegatecall instances, and 1,281 machine-unauditable instances, totalling 8.90 million tokens. Crucially, the data is not merely labelled. It includes explanations and vulnerability locations, with expert review.

DPO is the judgement layer. The DPO dataset contains paired outputs: a preferred explanation and a rejected explanation. The rejected examples are not necessarily nonsense. They are intentionally suboptimal: correct enough to be tempting, but shallow, incomplete, or less useful. For reentrancy, a weaker explanation may identify an external call without analysing execution order or inheritance-based risk. For timestamp dependency, it may flag timestamp use without checking whether it affects state transitions or critical randomness. For delegatecall, it may notice the call but ignore storage layout risk in proxy patterns.

That detail matters. In real audit work, the bad answer is not always ridiculous. The dangerous answer is often half-right.

The dataset is the product, inconveniently

The paper is nominally about a model, but the durable asset is the data construction process. Smart-LLaMA-DPO uses Qwen2.5-72B-Instruct and Mistral-Large-Instruct-2407-123B to generate initial explanations. Llama-3.1-70B-Instruct scores those explanations on correctness, thoroughness, and clarity using a weighted composite score: correctness at 0.6, thoroughness at 0.3, and clarity at 0.1. High-scoring explanations then go to human experts for review and correction.

The human review pipeline is unusually important here. The authors use eight senior PhD students specialising in blockchain, split into balanced groups with both industry-audit and academic experience. They also add an external review stage with two senior blockchain and smart contract developers. For machine-unauditable vulnerabilities, explanations are checked against audit reports. Disagreements can trigger third-expert arbitration.

That is not cheap. It is also the point.

Most organisations discussing AI security tooling still want the model to be the magic. The paper suggests the opposite. The expensive part is encoding domain judgement into data: what counts as a real vulnerability, what explanation is sufficient, what remediation-relevant context cannot be omitted, and what merely sounds smart.

This is where the “LLMs will replace auditors” reading collapses. The model improves because experts structure the training signal. The auditors do not disappear. They move upstream into the system design, annotation, preference construction, evaluation, and escalation workflow. A less glamorous pitch, admittedly, but also less likely to get a protocol exploited by a chatbot with confidence issues.

The main evidence: strong detection gains across selected vulnerability families

The main benchmark compares Smart-LLaMA-DPO against rule-based tools, neural network methods, pre-trained model approaches, and LLM-based baselines. The vulnerability families are not random: reentrancy, timestamp dependency, integer overflow/underflow, and delegatecall are common and security-relevant Solidity categories, while machine-unauditable vulnerabilities cover cases that often require contextual reasoning.

The four traditional vulnerability results are the cleanest part of the evidence.

Vulnerability type	Smart-LLaMA-DPO accuracy	Smart-LLaMA-DPO F1	Previous best baseline	F1 gain vs previous best
Reentrancy	94.47%	88.50%	DMT	+7.51 percentage points
Timestamp dependency	95.54%	96.43%	DMT	+1.54 percentage points
Overflow/underflow	94.65%	88.29%	DMT	+11.07 percentage points
Delegatecall	94.12%	84.85%	PSCVFinder	+6.06 percentage points

The magnitude varies. Timestamp dependency is already high for the strongest baseline, so the gain is modest. Overflow/underflow shows a larger jump, and the ablation results later explain why: domain pre-training seems especially important for understanding Solidity arithmetic, type behaviour, and version-specific protections.

The machine-unauditable results are more strategically interesting. These include price oracle manipulation, erroneous accounting, ID uniqueness violations, inconsistent state updates, privilege escalation, atomicity violations, and contract implementation-specific vulnerabilities. On the aggregate MU task, Smart-LLaMA-DPO reaches 90.7% accuracy and 83.4% F1. iAudit, the strongest comparison baseline in the table, reaches 76.7% accuracy and 66.2% F1.

That result should be read carefully. It does not prove the model can catch every novel logic exploit in the wild. It does show that preference-aligned, domain-trained models can perform well on vulnerability classes that are less amenable to simple pattern matching. That is the part security teams should care about.

The ablation tells us what each component is really doing

The ablation study is not a side quest. It is the mechanism check.

The authors compare the full model against variants without DPO, without continual pre-training, without both, and with chain-of-thought prompting. The purpose is to separate three effects: domain knowledge, preference alignment, and prompted reasoning.

Test	Likely purpose	What it supports	What it does not prove
Remove DPO	Ablation	Preference learning materially improves several vulnerability categories, especially reentrancy and machine-unauditable vulnerabilities	That DPO alone is sufficient without strong data or domain pre-training
Remove CPT	Ablation	Domain pre-training is critical for categories requiring Solidity semantics, especially overflow/underflow	That more pre-training data always improves security reasoning
Remove both CPT and DPO	Ablation	The full pipeline is not just ordinary SFT wearing a nicer name	That the architecture is optimal
Add CoT prompting	Sensitivity / exploratory extension	CoT helps machine-unauditable vulnerabilities but can hurt simpler pattern-like tasks	That chain-of-thought should be exposed or relied on in production

The category-specific results are revealing.

For reentrancy, removing DPO drops F1 from 88.50% to 73.47%. Removing CPT drops it to 77.78%. That suggests the issue is not only knowing Solidity; it is learning fine distinctions between strong and weak explanations of execution order. Reentrancy is often a story about sequence. DPO teaches the model which sequence stories are actually useful.

For overflow/underflow, removing CPT is brutal: F1 falls from 88.29% to 53.01%. Removing DPO is much less damaging, with F1 at 81.75%. That points to a different mechanism. Arithmetic vulnerabilities depend heavily on Solidity version behaviour, type conversions, SafeMath usage, and unchecked blocks. Domain literacy matters more than preference nuance.

For machine-unauditable vulnerabilities, removing DPO lowers F1 from 83.41% to 71.22%, while removing CPT lowers it to 80.60%. This is exactly the pattern one would expect when the problem is not just recognising syntax but weighing contextual reasoning quality. Price oracle manipulation is not solved by spotting the word “oracle.” Sadly, computers do not become risk managers by keyword search. Some humans have tried; results remain mixed.

The chain-of-thought result is also nicely inconvenient. Adding CoT improves machine-unauditable F1 from 83.41% to 85.19%, but slightly hurts timestamp dependency and overflow/underflow. That is useful operationally. It suggests longer reasoning prompts may help complex logic analysis but can overcomplicate simpler checks. In a production audit assistant, CoT-style workflows should probably be conditional, not universal.

Explanation quality is not decoration; it is part of the safety claim

The paper evaluates explanation quality using both LLM-based evaluation and human evaluation on correctness, thoroughness, and clarity. The evaluation sample contains 1,061 examples, with roughly 200 samples per vulnerability type.

The automated evaluator is Llama-3.1-70B-Instruct, chosen after comparison on 40 samples where it matched human expert judgement 90% of the time, the same as GPT-4 Turbo and ahead of GPT-4o, Qwen2.5-72B-Instruct, and GPT-3.5-Turbo in that validation. That does not make LLM-as-judge perfect. It makes it a screened instrument, which is better than asking a random model to grade its cousin’s homework.

The human evaluation is more important. Five experienced smart contract security experts each spend eight hours assessing one vulnerability category, for 40 hours total. They use the same 4-point scale.

For Smart-LLaMA-DPO, combined positive human ratings — scores of 3 or 4 — are:

Dimension	Smart-LLaMA-DPO positive human rating	iAudit positive human rating
Correctness	81.15%	70.22%
Thoroughness	83.88%	72.76%
Clarity	94.63%	83.98%

The clarity score is especially high, but correctness and thoroughness are the more meaningful dimensions for security work. A beautifully written wrong answer is still wrong, just with better typography.

The explanation result matters because enterprise buyers should not evaluate audit assistants only on detection metrics. A tool that flags issues but cannot explain them wastes senior reviewer time. A tool that explains them badly creates remediation risk. The best version of this technology is not an oracle. It is a disciplined explainer that helps humans move faster without hiding the uncertainty.

The case study shows contextual judgement, not omniscience

The paper’s case study compares Smart-LLaMA-DPO with GPT-4o and iAudit on practical examples. Its purpose is not to prove the full benchmark again. It illustrates how the model handles contextual distinctions that can trip other systems.

In one FundManager contract, GPT-4o identifies a reentrancy risk because a refund function updates balance after an external call. iAudit also flags reentrancy but reportedly misunderstands the balance update sequence. Smart-LLaMA-DPO gives a more nuanced reading: the pattern is non-standard, but the onlyOwner modifier changes the practical exploitability.

In the LockYourLove contract, GPT-4o and iAudit warn about possible block.timestamp manipulation. Smart-LLaMA-DPO recognises that timestamp use is only for logging and does not affect critical operations or contract logic. That is the difference between “I saw a risky token” and “I understood what this token does here.”

In a LendingContract example, GPT-4o and Smart-LLaMA-DPO identify price oracle manipulation as the root cause: reliance on the Uniswap pool’s instant price. iAudit focuses on secondary access-control issues. The paper uses this to argue that iAudit’s separated detection-and-explanation design can create inconsistencies.

This is the practical centre of the article. Smart contract auditing often turns on whether a theoretical vulnerability is actually exploitable in context. Smart-LLaMA-DPO appears better at that distinction across the paper’s tested cases. That is valuable. It is not the same as proving the model can safely make final audit decisions.

Business value: cheaper diagnosis, not cheaper responsibility

For Web3 audit firms, exchanges, protocol teams, and infrastructure providers, the immediate business value is not replacing audit teams. It is compressing the early stages of review.

A useful Smart-LLaMA-DPO-style assistant could sit in four places:

Workflow point	What the system can plausibly improve	Business value	Human role remains
Pre-audit triage	Identify likely vulnerability categories and suspicious regions	Faster intake and prioritisation	Confirm scope and severity
Developer remediation	Generate clearer explanations tied to code locations	Fewer back-and-forth cycles	Validate fix correctness
Internal security review	Catch common and contextual Solidity risks before external audit	Lower audit friction and earlier defect removal	Decide release readiness
Training and knowledge transfer	Explain why a pattern is risky or safe in context	Better junior developer learning	Maintain security standards

The most credible ROI is reviewer leverage. Senior smart contract auditors are expensive because they combine code reading, exploit reasoning, and economic context. If an assistant can pre-sort obvious cases, draft explanations, and identify likely exploit paths, experts spend more time on judgement and less on clerical diagnosis.

There is also a product lesson beyond blockchain. Many enterprise risk problems have the same shape: domain-specific code or rules, high cost of false confidence, and explanations that must be actionable. SQL security, Bash automation, infrastructure-as-code, data pipeline validation, and robotics scripts all have similar “execution matters” properties. The authors explicitly suggest that the method could extend to other domain-specific languages by adapting the corpus and annotation guidelines.

Cognaptus inference: the transferable asset is not the Solidity model itself. It is the pipeline pattern.

Collect domain-specific execution artefacts.
Build expert-reviewed labels, locations, and explanations.
Construct preference pairs where weak explanations are plausible but incomplete.
Train the model to prefer the explanation a domain expert would actually use.
Evaluate not only labels, but explanation quality and workflow usefulness.

That is the part organisations should copy. Preferably before procurement asks whether the same result can be achieved by “just prompting GPT-4o harder.” Sometimes, yes. Here, the ablations suggest no.

Boundaries: where the paper stops short

The paper’s own threats to validity are refreshingly relevant.

First, Smart-LLaMA-DPO occasionally produces redundant output. The authors address this by truncating generated text to retain the initial portion. That is not catastrophic, but it matters. In security work, verbosity can hide uncertainty and waste reviewer attention. Post-processing and output discipline are not cosmetic features; they affect usability.

Second, the system depends heavily on annotated data. The data pipeline uses multiple LLMs, expert review, rewritten preference pairs, and external validation. That creates quality, but also cost. An organisation trying to reproduce the approach for a new domain cannot simply download a base model and sprinkle “DPO” over the problem. The annotation standard is part of the technology.

Third, the vulnerability scope is limited. The model covers four major traditional vulnerability types and seven machine-unauditable categories. That is useful coverage, not universal coverage. A production deployment would need escalation paths for unfamiliar vulnerability classes, protocol-specific economic risks, cross-chain interactions, governance attacks, and design-level failures that are not visible from isolated contract code.

Fourth, the system is Solidity-focused. The authors argue the method could extend to Bash, SQL, SysML, or other domain-specific languages. That is plausible as an architectural claim, but not demonstrated by the experiments. Each new domain would require its own corpus, taxonomy, annotation guide, preference data, and evaluation protocol. Tedious? Yes. Also known as engineering.

Finally, the benchmark does not eliminate the need for adversarial testing. A model trained on expert preferences can still be brittle against novel exploit styles, intentionally obfuscated code, incomplete project context, or economically subtle attacks. Security products should treat Smart-LLaMA-DPO-like models as evidence-generating assistants, not final authorities.

The real lesson: preference training turns expertise into behaviour

The paper’s contribution is not that LLaMA-3.1-8B can be fine-tuned for smart contract security. That would be a useful but unsurprising result. The sharper contribution is that expert preference data can improve the model’s security judgement where labels alone are too blunt.

A vulnerability label says whether the answer is right. A preferred explanation says what kind of right answer is useful.

That distinction matters in business. Enterprises do not merely need models that output classifications. They need systems that produce decisions people can inspect, challenge, and turn into action. Smart-LLaMA-DPO points toward that future in a narrow but serious domain: Solidity smart contract auditing.

The model is not a replacement for auditors. It is a demonstration that the boring middle layer — curated data, expert review, preference pairs, ablations, explanation evaluation — is where AI security tools become operationally credible. Less glamour, more plumbing. As usual, the plumbing is what keeps the building from smelling terrible.

For protocol teams, the immediate takeaway is to use tools like this as audit accelerators, not audit substitutes. For AI builders, the lesson is broader: if you want domain models that behave well, stop worshipping parameter count and start designing the training signal. The chain does not need a louder guardian. It needs one that understands why the alarm is ringing.

Cognaptus: Automate the Present, Incubate the Future.

Lei Yu et al., “Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection,” arXiv:2506.18245, 2025. ↩︎

TL;DR for operators#

Smart contract security is a reasoning problem pretending to be a pattern problem#

The mechanism: three stages, three different jobs#

The dataset is the product, inconveniently#

The main evidence: strong detection gains across selected vulnerability families#

The ablation tells us what each component is really doing#

Explanation quality is not decoration; it is part of the safety claim#

The case study shows contextual judgement, not omniscience#

Business value: cheaper diagnosis, not cheaper responsibility#

Boundaries: where the paper stops short#

The real lesson: preference training turns expertise into behaviour#