Steering by the Token: How GRAINS Turns Attribution into Alignment

TL;DR for operators

GRAINS is not “fine-tuning, but cheaper.” That framing misses the point and commits the usual business sin of turning a mechanism into a procurement slogan.

The paper’s useful claim is more specific: token-level attribution can be converted into an inference-time steering signal. Instead of retraining model weights, GrAInS identifies which text or image tokens most strongly push the model toward preferred or dispreferred outputs, builds layer-wise steering vectors from those activation shifts, and applies normalized edits during inference.¹

For operators, this matters because many deployment failures are local and behavioural: hallucinated visual details, toxic completions, weak contextual faithfulness, or overconfident false answers. A full fine-tune may be too slow, too costly, too broad, or simply too annoying. Prompting may be too shallow. GrAInS points to a middle layer: behaviour-specific activation steering built from small steering sets, roughly 50 examples per objective in the paper’s experiments.

The evidence is encouraging but not magical. On Llama-3.1-8B, GrAInS improves TruthfulQA accuracy from 34.15 to 47.37. On LLaVA-1.6-7B, it reduces MMHal-Bench hallucination rate from 0.624 to 0.514. On SPA-VL, it improves LLaVA-1.6-7B preference win rate from 40.24% to 48.35%. It also tends to preserve general capability better than blunter steering baselines, with small drops on MMLU and MMMU where some competing methods fall off a small cliff and then pretend the view is nice.

The operational boundary is equally important. GrAInS requires access to gradients, hidden activations, and model internals. It is therefore not a black-box API trick. The paper shows benchmark gains on mostly 7B–12B open models; it does not yet prove production latency, monitoring stability, regulatory sufficiency, or robustness under adversarial business traffic.

The mistake is treating steering as one big shove

A customer-support model gives the wrong refund policy. A visual inspection assistant invents an object that is not in the image. A document QA system answers fluently while ignoring the supplied context. The standard operating theatre is familiar: add a prompt rule, bolt on a filter, fine-tune on a correction set, or ask the model more sternly to “be accurate,” because apparently politeness was the missing architecture.

The problem is that many steering methods treat the model as if one global behavioural direction were enough. Find a vector for “truthful,” “safe,” or “non-toxic,” push the hidden states in that direction, and hope the collateral damage stays polite. Sometimes that works. Sometimes it also dulls reasoning, distorts otherwise-correct answers, or overcorrects cases that were not broken.

GrAInS starts from a different question: which input tokens actually caused the model to prefer the bad behaviour?

That is the small hinge in the paper. The method does not merely ask whether a response is desirable or undesirable. It asks which pieces of the input carry the most influence over that preference, then uses the model’s own internal response to those pieces to construct the intervention.

In a language model, those pieces are text tokens. In a vision-language model, they may be text tokens or visual patch tokens. That distinction matters because multimodal failures are rarely evenly distributed across modalities. A hallucinated answer may be driven by a visual patch, by a misleading question phrase, or by the model’s learned bias in connecting the two. Steering all tokens equally is operationally convenient. It is also a wonderful way to move the wrong furniture.

GrAInS turns diagnosis into intervention

The paper’s mechanism has three stages.

Compute token-level attribution with contrastive Integrated Gradients.
Convert attribution-driven activation changes into layer-wise steering vectors.
Apply those vectors at inference time with normalization.

The high-level flow is simple:

Input + preferred/rejected outputs → contrastive attribution over tokens → positive and negative influential token sets → masked contrastive inputs → activation deltas by layer → PCA steering vectors → normalized inference-time edit

The important word is “contrastive.” GrAInS does not just ask, “Which tokens support this answer?” It compares a preferred output against a dispreferred one. In the toxicity setting, the preferred output might be non-toxic while the rejected output is toxic. In the hallucination setting, the preferred output is grounded while the rejected output invents something. The attribution objective is therefore aligned with a behavioural contrast, not a naked likelihood score.

This is useful because likelihood alone can be a poor proxy for alignment. A model may confidently assign probability to a bad completion because it is fluent, familiar, or statistically tempting. The business translation is blunt: plausible nonsense is still nonsense, just wearing a better suit.

Integrated Gradients then estimates each token’s contribution to the contrastive preference. Tokens with positive attribution push the model toward the preferred output. Tokens with negative attribution push it toward the rejected output. The paper uses these signed scores to select top influential positive and negative tokens.

The authors justify Integrated Gradients over vanilla gradients because vanilla gradients can suffer from saturation: when a model is already confident, the immediate gradient may look small even if the input feature is important. Integrated Gradients accumulates gradient information along a path from a baseline to the actual input, which tends to produce more stable attribution. The paper later tests this choice in an ablation, rather than leaving it as interpretability folklore.

Positive and negative tokens are not symmetric decoration

One subtle strength of GrAInS is that it keeps both sides of attribution.

A simpler steering design might identify “bad” tokens and suppress them. Another might identify “good” tokens and amplify them. GrAInS instead separates positive and negative attribution, masks each group, observes how hidden states shift, and then constructs a contrastive steering direction from those shifts.

That matters because model behaviour is not a single dial. The representation of “avoid hallucination” is not necessarily just the inverse of “produce grounded answer.” In practical terms, reducing undesirable behaviour and strengthening desirable behaviour may touch different regions of the representation space.

The paper’s Figure 3 supports this intuition through a token ablation test. Removing highly negative tokens substantially increases the model’s preference for aligned outputs; removing highly positive tokens has the opposite effect. This test is not the main performance result. Its purpose is diagnostic: it checks whether the signed attributions are behaviourally meaningful before building steering vectors from them.

That distinction matters. Not every chart in a paper is trying to win the argument. Some charts are checking whether the machinery deserves to exist.

PCA makes local failures reusable

Once GrAInS has selected positive and negative token groups, it creates modified inputs by masking them separately. It then compares hidden activations from these modified inputs against the original input.

For each transformer layer, the method extracts activation shifts associated with positive and negative attribution. Across many examples, those shifts will be noisy. A token that matters in one TruthfulQA question will not look identical to a token that matters in another. So the method applies PCA to extract a stable principal direction from the activation deltas.

This is the part where a local diagnosis becomes a reusable control vector.

The paper uses 50 samples per dataset to construct these vectors. For LLM steering, computing Integrated Gradients, extracting hidden states, and constructing vectors on 50 TruthfulQA samples takes about 96 seconds on an RTX A6000-48G GPU. For VLM steering on 50 SPA-VL samples, the corresponding runtime is about 302 seconds. These are reported as fixed setup costs, not per-query costs.

For an operator, the appeal is obvious: build a steering vector for a behaviour, then apply it at inference without retraining the model. The caveat is equally obvious, though slightly less marketable: you need internal access. If your model provider gives you only a remote text-generation API, GrAInS is not something you casually staple on during lunch.

Normalization is the quiet anti-amputation device

Activation steering can be crude. Add too much in the wrong direction and the model may become safer in the same way a unplugged laptop becomes cyber-secure.

GrAInS tries to avoid that by normalizing the edited hidden activations to preserve their original representational scale. The steering vector is added, controlled by a strength parameter, and then the resulting activation is rescaled. The design goal is not only to move behaviour but to avoid distorting the activation magnitude so much that downstream computation becomes brittle.

This helps explain why the paper cares about MMLU and MMMU. Safety, truthfulness, and hallucination metrics are the target behaviours. General reasoning benchmarks are the collateral-damage check. The authors argue that GrAInS produces targeted behavioural shifts without broad capability loss.

That is a stronger claim than “the benchmark went up.” It says the intervention is selective enough to improve the target behaviour while preserving much of the base model’s general competence.

The main results are behaviour correction, not model redemption

The paper evaluates GrAInS on language-only models and vision-language models.

For LLMs, the tested models are Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. The tasks cover truthfulness, toxicity, and contextual faithfulness: TruthfulQA, Toxigen, and FaithEval. The baselines include LoRA and inference-time steering methods such as ICV, NL-ITI, and CAA.

LLM task	Model	Base	GrAInS	Practical interpretation
TruthfulQA accuracy	Llama-3.1-8B	34.15	47.37	Large gain on truthfulness-oriented multiple-choice QA
Toxigen accuracy	Llama-3.1-8B	51.19	60.98	Better avoidance of toxic outputs in the evaluated setup
FaithEval accuracy	Llama-3.1-8B	68.00	70.94	Smaller but still positive gain on contextual faithfulness
TruthfulQA accuracy	Qwen2.5-7B	51.41	59.85	Improvement transfers across another 7B-class model
Toxigen accuracy	Qwen2.5-7B	55.04	62.12	Positive toxicity-control result across architecture
FaithEval accuracy	Qwen2.5-7B	59.89	64.77	Stronger contextual faithfulness gain than on Llama

These are main evidence results. They are not ablations and not merely implementation checks. They support the core claim that attribution-guided steering can outperform both LoRA and several steering baselines on the evaluated behaviours.

For VLMs, the tested models are LLaVA-1.6-7B, Qwen2.5-VL-7B-Instruct, and Gemma-3-12B-IT. The tasks are MMHal-Bench for hallucination and SPA-VL for visual safety preference alignment.

VLM task	Model	Base	GrAInS	Practical interpretation
MMHal-Bench hallucination rate	LLaVA-1.6-7B	0.624	0.514	Lower is better; meaningful reduction in hallucinated visual responses
MMHal-Bench hallucination rate	Qwen2.5-VL-7B	0.523	0.473	Improvement, though LoRA is slightly lower at 0.461 in the table
MMHal-Bench hallucination rate	Gemma-3-12B-IT	0.468	0.442	Best result among listed methods
SPA-VL chosen > rejected	LLaVA-1.6-7B	40.24	48.35	Higher probability assigned to preferred safe responses
SPA-VL chosen > rejected	Qwen2.5-VL-7B	53.21	58.90	Strong preference-alignment gain
SPA-VL chosen > rejected	Gemma-3-12B-IT	49.32	53.51	Positive but smaller gain

The VLM results are especially relevant because they test the paper’s multimodal premise. If visual and textual tokens contribute unevenly, then a joint attribution method should beat vision-only or text-only variants. The paper later checks exactly that.

There is one nuance worth preserving. On Qwen2.5-VL-7B for MMHal-Bench, LoRA reports a hallucination rate of 0.461, slightly better than GrAInS at 0.473. So the clean statement is not “GrAInS beats every baseline in every cell.” The stronger and more accurate statement is that it performs consistently well across models and tasks while also preserving general capability better than broader steering baselines. Accuracy is more useful when it survives contact with the table.

The preservation tests are the real procurement filter

A steering method that improves safety by degrading general reasoning is not alignment. It is behavioural amputation with nicer branding.

The paper therefore evaluates general capabilities using MMLU for LLMs and MMMU for VLMs. These are robustness or preservation tests, not the primary target benchmarks. Their purpose is to detect whether steering damages broader competence.

Capability check	Model	Base	GrAInS	Notable contrast
MMLU accuracy	Llama-3.1-8B	69.27	69.15	CAA drops to 51.49
MMLU accuracy	Qwen2.5-7B	74.58	74.29	CAA drops to 62.91
MMMU accuracy	LLaVA-1.6-7B	35.81	34.92	Small drop
MMMU accuracy	Qwen2.5-VL-7B	58.64	58.13	CAA drops to 41.51

This is where GrAInS looks operationally interesting. Plenty of interventions can change behaviour. Fewer can do it without leaving fingerprints all over unrelated capabilities.

The paper also reports BLEU accuracy on generation quality checks. GrAInS performs strongly there too: 47.91% on TruthfulQA generation for Llama-3.1-8B and 54.09% for Qwen2.5-7B, both above the corresponding base models. For VLMs on SPA-VL generation, it reaches 46.79% on LLaVA-1.6-7B and 53.02% on Qwen2.5-VL-7B.

The business meaning is not that BLEU is suddenly the soul of truth. It is that the intervention does not merely win multiple-choice safety tests while making the model linguistically worse. Again: useful, not divine.

The ablations explain why the mechanism works

The paper’s ablation section is important because it tests the design choices behind the method. These are not separate product claims. They are mechanism checks.

Test	Likely purpose	What it supports	What it does not prove
Integrated Gradients vs vanilla gradients, SmoothGrad, random token selection	Ablation	Token attribution quality matters; IG gives the best average result in the tested LLM setting	IG will always be best for every model, domain, or latency budget
Joint vision-text attribution vs vision-only and text-only	Ablation	Multimodal steering benefits from selecting influential tokens across modalities	Every VLM failure requires both modalities
Preference-based objective vs likelihood objective	Ablation	Preference contrast gives stronger attribution signals for alignment-style tasks	Explicit preference data is always available or always reliable
Steering strength analysis	Sensitivity test	There is an optimal intervention magnitude and overcorrection can occur	One universal steering strength works across architectures
Token count analysis	Sensitivity test	A small number of highly influential tokens works best; too many tokens can dilute attribution	The same top-k value generalizes to all use cases

The Integrated Gradients ablation is particularly clean. On Llama-3.1-8B across TruthfulQA, Toxigen, and FaithEval, the average score rises from 52.81 with random selection to 55.36 with vanilla gradients, 58.17 with SmoothGrad, and 59.75 with Integrated Gradients. That progression supports the central premise: steering improves when the selected tokens are not arbitrary.

The modality ablation is equally relevant for VLM deployment. On SPA-VL, the full joint method averages 53.63 across LLaVA-1.6-7B and Qwen2.5-VL-7B, compared with 51.38 for vision-only and 50.36 for text-only. That does not mean every example needs both modalities. It means the method benefits from not deciding in advance which modality deserves blame.

A small mercy, really. The model gets to be wrong in a context-sensitive way.

The business value is targeted correction, not cheaper training alone

The obvious business pitch is cost: no fine-tuning, small steering sets, reusable vectors. That is true but incomplete.

The deeper value is diagnostic control. GrAInS creates a pipeline where the same attribution signal that identifies influential tokens also guides the intervention. This has three operational consequences.

First, it supports behaviour-specific remediation. A team can imagine separate steering objectives for toxicity suppression, visual hallucination reduction, or contextual faithfulness. Each vector is tied to a behavioural contrast, not a vague aspiration like “be better.” This is already an improvement over many enterprise AI roadmaps.

Second, it offers a more inspectable intervention path. The selected tokens and attribution polarity give engineers something to audit before and after steering. That does not make the method fully interpretable in the strong mechanistic sense, but it is more transparent than applying a global vector and hoping no one asks what it means.

Third, it reduces the cost of experimentation. The paper’s setup costs are measured in seconds to minutes on an A6000-class GPU for steering-vector construction, rather than the longer cycle of fine-tuning. For product teams, that means steering variants could be tested as part of model behaviour QA, provided they have infrastructure access.

A practical deployment framework would separate the paper’s claims into three buckets:

Category	What the paper directly shows	Cognaptus inference for operators	Boundary
Behaviour improvement	Benchmark gains on truthfulness, toxicity, faithfulness, hallucination, and visual safety preference tasks	GrAInS can be tested as a targeted behaviour patch for known failure modes	Benchmarks are not production traffic
Capability preservation	Small MMLU/MMMU degradation relative to broader steering baselines	Targeted steering may reduce collateral damage compared with blunt activation edits	Preservation must be re-tested for each domain and model
Cost profile	Steering vectors built from 50 examples with reported setup runtimes of 96 seconds for TruthfulQA and 302 seconds for SPA-VL	Faster experimentation than fine-tuning in accessible open-model environments	Requires gradient and activation access
Interpretability link	Attribution identifies influential positive and negative tokens used to construct vectors	Better debugging loop than opaque prompt tweaks	Attribution is still an approximation, not legal evidence
Multimodal handling	Joint text-vision attribution beats modality-specific variants on SPA-VL	Useful for VLM workflows where failures may come from image patches, text prompts, or their interaction	Does not solve all grounding or perception errors

This is where the paper is most useful for businesses deploying AI in constrained workflows: support QA, compliance review, document-grounded assistance, safety moderation, and visual question answering. In those settings, the organisation often knows the failure class before it knows the fix. GrAInS offers a way to turn a labelled behavioural contrast into an internal steering mechanism.

Where the method is not yet an enterprise control plane

GrAInS should not be oversold as a complete alignment layer.

The first boundary is access. The method needs gradients, hidden states, and the ability to intervene in activations at inference time. Many commercial model APIs do not expose that. For enterprises using hosted black-box systems, GrAInS is more relevant as a design pattern or vendor capability than as a direct implementation.

The second boundary is tuning. The appendix shows that steering strength matters. For Llama-3.1-8B on TruthfulQA, performance improves up to a moderate steering strength before degrading slightly. Token count also matters: using too many attributed tokens can dilute the signal. Operationally, this means GrAInS is not a set-and-forget compliance switch. It needs validation curves.

The third boundary is data. The method benefits from preferred and dispreferred response pairs. The paper shows that a likelihood-based objective can still work when explicit preferences are unavailable, but the preference-based loss performs better. In business environments, preference data may be noisy, political, or under-specified. “Preferred” is not always a fact. Sometimes it is just the loudest stakeholder with a spreadsheet.

The fourth boundary is scope. The paper evaluates mostly 7B–12B open models on established benchmarks. It does not demonstrate long-horizon agent behaviour, tool-use workflows, adversarial prompt resistance, domain-specific legal reliability, or monitoring under distribution shift. Those are not minor details. They are where many AI deployments go to acquire their expensive personalities.

The article’s real lesson: interpretability becomes more valuable when it acts

The most interesting part of GrAInS is not that it uses Integrated Gradients. Nor is it that it beats LoRA or steering baselines in several benchmark settings. The interesting part is the conversion step: attribution is no longer just an explanation after the model misbehaves. It becomes the raw material for intervention.

That is a useful shift. Interpretability work often gets trapped in dashboards, saliency maps, and post-hoc narratives that help humans feel informed while the model continues doing whatever it was going to do anyway. GrAInS moves attribution into the control loop. Diagnose the influential tokens. Measure their activation effects. Build a steering direction. Apply it carefully.

For operators, the takeaway is not “replace fine-tuning.” It is “add a finer instrument to the behaviour-control stack.” Prompting still matters. Fine-tuning still matters. Evaluation still matters. Guardrails still matter. But between prompt-level persuasion and weight-level surgery, there is room for inference-time interventions that are targeted, inspectable, and reversible.

That room is where GrAInS belongs.

Not a moral compass. Not a silver bullet. More like a scalpel with a telemetry feed. In this industry, that is already an upgrade.

Cognaptus: Automate the Present, Incubate the Future.

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal, “GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs,” arXiv:2507.18043, 2025, https://arxiv.org/abs/2507.18043. ↩︎

TL;DR for operators#

The mistake is treating steering as one big shove#

GrAInS turns diagnosis into intervention#

Positive and negative tokens are not symmetric decoration#

PCA makes local failures reusable#

Normalization is the quiet anti-amputation device#

The main results are behaviour correction, not model redemption#

The preservation tests are the real procurement filter#

The ablations explain why the mechanism works#

The business value is targeted correction, not cheaper training alone#

Where the method is not yet an enterprise control plane#

The article’s real lesson: interpretability becomes more valuable when it acts#