Fine-Tuning Without Fine-Tuning: How Fints Reinvents Personalization at Inference Time

Memory is a useful product feature until it becomes a junk drawer.

That is the quiet problem behind many “personalized” AI systems. A user has a history. The system retrieves some of it. The model receives a longer prompt. The output becomes, in theory, more personal. In practice, the assistant often behaves like someone who read your old emails in a hurry and decided this was the same as knowing you.

The paper behind Fints takes a different route. It does not ask the model to read more user context. It does not train one adapter per user. It does not generate several candidate answers and hire a reward model to pick the most flattering one. It changes the model’s internal activations at inference time, using user-history-derived steering vectors selected for the specific query being answered.¹

That sounds like a small architectural trick. It is not. It changes where personalization lives.

In the old design, personalization sits in the prompt, in a fine-tuned adapter, or in a ranking layer after generation. In Fints, personalization becomes a temporary, query-specific intervention inside the forward pass. The model’s weights remain frozen. The user’s history becomes a dictionary of steering directions. At runtime, the system selects the relevant directions, aggregates them, and injects them into the model’s hidden states.

The business translation is simple enough: Fints is not selling “better memory.” It is proposing a different runtime stack for personalization.

Naturally, the devil is in the activation hook. The devil has been busy.

Personalization usually breaks in exactly the places businesses need it most

The paper frames existing LLM personalization methods into three families.

Prompt-based methods retrieve user profiles or past interactions and place them into the context window. These are flexible and training-free, but they are constrained by context length and by the model’s ability to actually use the retrieved evidence. More context is not the same as better personalization. It is often just a larger pile of instructions for the model to half-follow.

Parametric adaptation methods, such as LoRA-style personalization, move the adaptation into model parameters. One variant trains a shared adapter across users; another trains user-specific adapters. This can capture deeper behavioural patterns, but it creates operational friction. User-specific adapters require training data, storage, loading logic, update cycles, and a tolerance for preferences that change more slowly than actual humans do. A bold assumption, certainly.

Reward-model approaches personalize by scoring multiple candidate outputs and selecting the best one. They can work in stationary settings, but they add inference cost and become awkward in large action spaces, such as web-agent function calling, where the number of possible actions and parameters grows quickly.

Fints targets the two regimes where those approaches are least comfortable:

Failure regime	Why it matters in products	Why conventional methods struggle
Sparse user history	New users, low-frequency users, enterprise users with limited logged interactions	Fine-tuning lacks enough data; prompting has little signal to retrieve
Fast-changing preferences	Assistants, content tools, shopping agents, workflow agents, and learning systems where user intent shifts by session	Offline adapters update slowly; reward models assume a more stable preference surface
Heterogeneous user behaviour	The same user behaves differently across tasks, moods, domains, or contexts	A single user adapter can blur multiple behavioural modes together
Long interaction histories	Mature users generate too much history to fit cleanly into context	Retrieval and summarisation become lossy filters

The misconception worth killing early is that personalization is mainly a choice between “put more history in the prompt” and “fine-tune a personal model.” Fints makes the choice less binary. It still needs user logs. It still needs a steering-vector preparation stage. It still needs access to model internals. But it does not require the model to carry personalization as either text baggage or permanent parameter updates.

Its central move is more surgical: infer a relevant behavioural direction, then push the model’s internal representation in that direction for this request.

Fints turns user history into steering directions, not prompt stuffing

The Fints pipeline has two stages: offline preparation and online inference.

Offline, the system builds a steering-vector dictionary for each user. For a historical user sample, it creates a positive prompt using relevant user context and a negative prompt using irrelevant context sampled from other users. Both are fed through the frozen LLM in teacher-forcing mode. Fints then compares the internal activations produced by the positive and negative versions.

The difference between those activations becomes a steering vector.

That vector is not a fine-tuned parameter. It is not a retrieved paragraph. It is a direction in representation space: roughly, “when the model sees this kind of task with this kind of personal context, move the hidden state this way rather than that way.”

The paper grounds this in the broader idea of activation steering: if certain model behaviours correspond to directions in latent space, then adding a vector to an intermediate activation can bias generation toward a target behaviour without changing the model’s weights.

The basic intervention is conceptually straightforward:

$$ h' = h + \alpha v $$

where $h$ is an intermediate activation, $v$ is the steering vector, and $\alpha$ controls steering strength. Fints extends this idea from a single generic steering direction to a user- and instance-specific personalization pipeline.

This distinction matters. A static steering vector says, “make the model more formal” or “make the model more positive.” Fints says, “for this user and this query, select the historical signals most relevant to the current request, then steer the model accordingly.”

That is personalization as runtime routing, not as permanent modification.

Fine-grained hooking separates attention and MLP signals

The first technical contribution is fine-grained hooking. Instead of extracting one whole-layer activation signal, Fints separately hooks activations from the attention block and the MLP block.

This is more than implementation detail. In transformer blocks, attention and MLP components play different roles. Attention is heavily involved in routing information across tokens and contexts. MLP layers are often associated with feature transformation and stored patterns. If user preference signals are distributed unevenly across these components, treating the whole layer as one undifferentiated object can blur useful information.

Fints therefore stores two kinds of steering signal: one from attention and one from MLP. During inference, it can apply a “Pulse” after attention and a “Re-Pulse” after the MLP. The naming is not subtle. The mechanism is.

The paper’s main results support the usefulness of this design, but with a nuance that a clean marketing summary would probably sand off. Fine-grained attention-plus-MLP steering performs best on the two content-generation tasks: headline generation and abstract writing. It reaches ROUGE-1 / ROUGE-L of 0.1816 / 0.1666 on headline generation and 0.3990 / 0.2306 on abstract writing. Those are the strongest content-generation results in the reported table.

On PersonalWAB, the personalized web function-calling benchmark, the best reported accuracy among Fints variants is attention-only steering at 0.8588, while attention-plus-MLP scores 0.8522. That does not invalidate the fine-grained idea. It tells us the useful steering channel may depend on the task. Style-heavy generation and action-selection benchmarks may not need the same intervention pattern.

This is exactly the kind of result business readers should care about. “Fints wins” is less useful than “the right steering site depends on what the product is doing.”

Input-aware aggregation is the part that makes it instance-tailored

The second technical contribution is input-aware aggregation.

A user’s history can contain multiple behavioural modes. The same person might write concise headlines, verbose technical summaries, polite customer messages, and chaotic shopping instructions. Training one adapter on all of it risks averaging those modes into a single mushy identity. Very human, but not always useful.

Fints avoids that by selecting steering vectors at inference time. For a new query, it computes semantic similarity between the query and the keys in the user’s steering-vector dictionary. It retrieves the top-$K$ most relevant steering vectors and aggregates them. The paper tests mean aggregation and attentive aggregation, with the latter assigning stronger weights to more relevant historical samples.

This is the mechanism that justifies calling Fints “instance-tailored.” Personalization is not only per user; it is per user, per query.

The distinction is operationally important. A per-user adapter says, “This is Oliver’s model.” A Fints-style runtime says, “For this query, these parts of Oliver’s history are relevant; steer using those.” The second version is more compatible with users whose preferences change across contexts, which is most users, because apparently humans remain inconveniently non-stationary.

The paper’s main table shows input-aware aggregation outperforming naive aggregation across the three benchmark settings. On headline generation, ROUGE-1 improves from 0.1737 to 0.1768. On abstract writing, ROUGE-1 improves from 0.3946 to 0.3982. On PersonalWAB, accuracy improves from 0.8458 to 0.8544.

These are not enormous absolute jumps. They are evidence that relevance-weighted historical selection matters. In personalization systems, that is often the difference between “using memory” and “using the right memory.”

The main evidence says Fints is competitive, but not magically dominant in every subcase

The paper evaluates Fints on three tasks:

News headline generation, used as short personalized content generation.
Abstract writing, used as long personalized content generation.
PersonalWAB, used as personalized web function calling.

The setup uses 200 users for each dataset, with earlier interactions as training or steering material and later interactions as test data. The metrics are ROUGE for generation and accuracy for web function calling. Reward-model baselines are excluded because they require generating multiple answers, which would introduce extra token cost and make comparison less clean.

Here is the compact business reading of the main evidence:

Result area	What the paper directly shows	Business interpretation	Boundary
Main benchmark comparison	Fints variants outperform prompt-based and PEFT baselines on the reported tasks	Activation steering can be a viable personalization layer, not just a research toy	Benchmarks use ROUGE and accuracy proxies, not live user satisfaction
Fine-grained hooking	Attention-plus-MLP is strongest on content generation; attention-only is strongest on PersonalWAB accuracy	Different product tasks may require different steering placements	No universal “best hook” should be assumed
Input-aware aggregation	Relevant-vector aggregation beats naive aggregation	Query-specific personalization is more useful than treating user history as one blob	Similarity retrieval quality becomes a production dependency
Sparse data	Fints remains strong when only a small fraction of user data is used	Useful for cold-start and low-history users	Tested under controlled dataset subsampling, not messy live onboarding
Heterogeneous distribution	Fints performs well across sampled sub-populations	Useful when the same user has multiple behavioural modes	Sub-populations are constructed from benchmark clusters, not real-time preference drift
Latency	Fints adds overhead but avoids per-user checkpoint loading	Potentially attractive where adapter management is costly	Latency tested on 200 samples per dataset on a single NVIDIA H100

The headline numbers are respectable. On headline generation, the strongest non-Fints ROUGE-1 baseline is OPPU at 0.1779, while attention-plus-MLP Fints reaches 0.1816. On abstract writing, Fints reaches 0.3990 ROUGE-1 and 0.2306 ROUGE-L, ahead of the reported prompt and PEFT baselines. On PersonalWAB, the best Fints accuracy is 0.8588, ahead of the strongest non-Fints result around 0.8415 from in-context learning with $k=3$ and 0.8402 from PER-PCS.

The magnitude should be interpreted with care. These are benchmark gains, not a guarantee that a customer-support bot will suddenly become beloved by finance teams. ROUGE measures overlap with reference text. Accuracy on a web-agent benchmark measures correct function calling. Both are useful proxies. Neither is the same as retention, conversion, trust, or reduced escalation rate.

Still, the pattern is meaningful: Fints is not just saving training cost while losing quality. In these tests, it often improves quality while avoiding user-specific fine-tuning.

The robustness tests are sensitivity evidence, not a second thesis

The paper’s later experiments are best read as robustness and sensitivity tests.

The heterogeneous-distribution experiment examines whether methods hold up when user behaviour is not uniform. The authors identify sub-populations in the data, visualize clusters using t-SNE, sample points near cluster centres, and evaluate methods across those sub-populations. The purpose is not to prove every possible form of preference drift. It is to test whether personalization methods degrade when user data contains distinguishable behavioural modes.

Fints performs well across the sampled sub-populations. The whole-layer Fints variant is particularly strong in the reported heterogeneous headline-generation results, reaching ROUGE-1 values of 0.1762, 0.2047, and 0.1766 across the three sub-populations. That is better than the listed LoRA-style baselines in those subgroups.

The interpretation is not “Fints solves preference drift.” It is more precise: because Fints selects and injects steering signals at inference time, it is less locked into a single offline user representation. That makes it structurally better suited to heterogeneous histories than a user adapter trained as one blended object.

The data-efficiency experiment is another sensitivity test. The authors reduce the amount of personalized data used by each method. In the most sparse condition, 5% of user data corresponds to an average of 5.82 samples. Under that setting, Fints variants maintain ROUGE-1 scores around 0.392 to 0.393 on the abstract-writing setup, while OPPU reports 0.3764 and PER-PCS reports 0.3600.

The paper describes this as evidence that Fints remains effective with fewer than ten personalized logs. That is a strong practical signal. But it is not the same as saying no onboarding problem remains. The user still needs some history. The system still needs to build contrastive prompts. The steering dictionary still needs to be useful. Cold start is reduced, not abolished.

The latency result is promising, but it is not a full production cost model

The overhead test samples 200 datapoints from each dataset and compares direct inference with Fints on a single NVIDIA H100.

The reported latency increases are:

Dataset	Direct inference	With Fints	Approximate relative overhead
Headline generation	63.23s	72.49s	~14.6%
Abstract writing	670.34s	712.36s	~6.3%
PersonalWAB	1161.88s	1343.89s	~15.7%

The paper attributes overhead mainly to retrieving relevant steering vectors and inserting the aggregated steering signal during the forward pass. That is believable: Fints avoids gradient updates and per-user checkpoint loading, but it does add retrieval and activation-intervention logic.

For business use, this result should be read as an implementation-detail test with strategic implications. It suggests Fints may be cheaper and more flexible than maintaining many user-specific adapters. It does not yet prove that a production deployment would be cheap at massive scale. Retrieval index size, user-history growth, batching, model-serving architecture, privacy constraints, and GPU memory behaviour all matter.

The best business conclusion is therefore restrained: Fints moves cost from training and checkpoint management into inference-time retrieval and activation control. That trade-off may be highly attractive in dynamic personalization settings. It is not free. Nothing useful ever is.

Where this could matter commercially

Fints is most relevant where three conditions appear together: users have some behavioural history, preferences shift by context, and personalization needs to happen quickly.

That points to several product categories.

AI assistants are the obvious case. A personal or enterprise assistant may need to adapt tone, format, tool choices, and decision preferences without retraining a personal model. Prompt memory can help, but it becomes brittle when the history is large or contradictory. A Fints-like system could select the relevant behavioural traces for each request and steer the model internally.

Content tools are another natural fit. Headline generation and abstract writing are not random benchmark choices. They test whether a model can adapt to style and structure. Marketing, editorial, research, and professional-writing systems often need exactly that: not just “write well,” but “write in the way this user, brand, analyst, or department tends to write.”

Web agents are the more operationally interesting case. PersonalWAB tests personalized function calling and parameter formation. In business environments, web and workflow agents need to infer not only what action is valid, but which action is appropriate for a user’s preferences. A procurement assistant, travel agent, CRM assistant, or internal workflow bot may need to learn user-specific defaults without turning every preference into a hard-coded rule.

The possible operational pathway looks like this:

Step	Technical action	Business meaning
Collect history	Store user interactions, outcomes, and relevant context	Build the raw material for personalization
Build steering dictionary	Generate contrastive prompts and extract activation differences	Convert behavioural history into reusable internal steering signals
Retrieve per query	Select top-$K$ relevant steering vectors	Avoid treating the user as one static profile
Inject at inference	Apply Pulse and Re-Pulse steering inside the frozen model	Personalize without retraining or adapter swapping
Monitor quality	Track task success, user corrections, latency, and safety failures	Decide whether steering improves actual product outcomes

This is where Cognaptus would separate the paper’s result from the business inference.

The paper directly shows improved benchmark performance against selected prompt and PEFT baselines under controlled settings. The business inference is that inference-time steering could reduce the operational burden of personalization for assistants, content systems, and agents. What remains uncertain is whether the same gains survive real user feedback loops, privacy constraints, adversarial inputs, multi-session drift, and messy enterprise integration.

A less glamorous sentence, yes. Also a more useful one.

The main implementation risk is that personalization moves inside the model

Fints avoids one class of complexity and introduces another.

It avoids per-user fine-tuning, user-specific checkpoint storage, and slow retraining cycles. That is attractive. But it requires access to intermediate activations. It requires choosing injection layers and steering strengths. It requires maintaining user-level steering dictionaries. It requires a serving stack that can intervene during the forward pass.

This makes Fints easier to imagine for teams controlling their own open-weight model infrastructure than for teams relying only on black-box hosted APIs. If the model provider does not expose activation hooks, the core mechanism is unavailable. One can imitate the idea with prompting or external memory, but that is no longer Fints.

There are also privacy implications. Fints does not store raw user history in the prompt at inference time, which may reduce some exposure risks. But the steering dictionary is still derived from user data. It can encode behavioural traces. It must be governed, expired, audited, and secured like any other personalization asset. “Not in the prompt” does not mean “not sensitive.” A small technical reminder from the Department of Obvious Things That Still Get Ignored.

Safety is another open question. Steering changes model behaviour internally. That is the point. But any system that can steer toward preferred style or action choices can also steer in undesirable directions if the steering dictionary is polluted, mis-retrieved, or maliciously influenced. The paper evaluates task performance, not a full safety envelope.

Finally, benchmark metrics are limited. ROUGE can reward textual overlap without capturing whether the output feels genuinely aligned with a user’s preferences. Function-calling accuracy is useful, but production agents also need recovery behaviour, permission handling, explainability, and trust calibration.

Fints is therefore not a finished personalization product. It is a serious mechanism for one.

The strategic shift is runtime personalization

The most important contribution of Fints is not that it beats a few baselines. It is that it proposes a different place to put personalization.

Prompt-based personalization treats the model as a reader of user memory. Fine-tuning treats the model as something to be modified for the user. Reward models treat personalization as a selection problem after generation. Fints treats personalization as an inference-time steering problem inside the model’s computation.

That distinction matters because product environments are dynamic. Users have thin histories, changing intentions, inconsistent styles, and domain-specific preferences. They do not politely wait for retraining cycles. They also do not fit neatly into a single adapter. Inconvenient, but commercially relevant.

Fints offers a practical middle path: keep the base model frozen, prepare user-derived steering vectors offline, select the relevant vectors per query, and inject them at runtime. The paper’s evidence suggests that this can improve performance in content generation and personalized web function calling, especially under sparse and heterogeneous user-history conditions.

The remaining question is not whether inference-time personalization is interesting. It plainly is.

The better question is whether product teams can build the surrounding machinery: privacy-safe user logs, reliable vector preparation, efficient retrieval, activation-level serving, monitoring, and user-facing controls. Without that, Fints is just a clever paper with nice tables. With it, personalization may stop being a retraining problem and start becoming a runtime orchestration problem.

Which, frankly, is where it probably belonged all along.

Cognaptus: Automate the Present, Incubate the Future.

Kounianhua Du, Jianxing Liu, Kangning Zhang, Wenxiang Jiao, Yuan Lu, Jiarui Jin, Weiwen Liu, Yong Yu, and Weinan Zhang, “Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering,” arXiv:2510.27206, submitted October 31, 2025. https://arxiv.org/abs/2510.27206 ↩︎

Personalization usually breaks in exactly the places businesses need it most#

Fints turns user history into steering directions, not prompt stuffing#

Fine-grained hooking separates attention and MLP signals#

Input-aware aggregation is the part that makes it instance-tailored#

The main evidence says Fints is competitive, but not magically dominant in every subcase#

The robustness tests are sensitivity evidence, not a second thesis#

The latency result is promising, but it is not a full production cost model#

Where this could matter commercially#

The main implementation risk is that personalization moves inside the model#

The strategic shift is runtime personalization#