Small Models, Big Skills: When Agent Frameworks Meet Industrial Reality

Compliance has a wonderful way of killing beautiful demos.

In a demo, the agent calls a frontier model, loads a tool, reads a document, writes a decision, and everyone nods at the future. In a regulated company, the same workflow meets a less poetic checklist: where did the data go, who pays for the GPU time, can this run inside our perimeter, and why did the model spend twenty seconds “thinking” about a binary classification task?

That is the useful context for the paper Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments.¹ The paper asks whether the Agent Skill paradigm—popular in large-model agent systems—can also help small open-source language models perform industrial tasks where public APIs are expensive, restricted, or simply not acceptable.

The answer is not “small models are the future.” Nor is it “just use the biggest model and expense it to the innovation budget.” The answer is more operational: Agent Skills can help, but only when the model is already capable enough to route itself through the skill library. A framework is not a brain transplant. It is a leverage mechanism. Leverage works badly when the thing being leveraged is not there.

The real comparison is not small versus large, but three ways to feed context

The paper is most useful when read as a deployment comparison, not as a generic benchmark. It compares three ways of giving a model task knowledge.

Method	What the model receives	Practical interpretation	Main risk
Direct Instruction (DI)	A minimal task prompt	Cheapest baseline; no skill machinery	The model must already know enough
Full-Skill Instruction (FSI)	The whole temporary skill repository	“Dump all instructions into context”	Longer context, more distraction, higher cost
Agent Skill Instruction (ASI)	Skill descriptions first, then relevant skill details on demand	Progressive disclosure	Requires reliable skill routing

This is the central managerial choice. DI is cheap but under-informed. FSI is comprehensive but blunt. ASI is elegant, but elegance comes with a gatekeeper: the model must first identify which skill to read.

The paper formalizes Agent Skills as a progressive-disclosure process under a partially observable Markov decision process. In plainer business language, the agent is uncertain about the user’s true task and the relevant hidden context. It can either act immediately, reveal more information, or select and execute a skill. The system is therefore making an economic decision under uncertainty: is another piece of context worth the attention and compute cost?

That framing matters because it prevents a common misunderstanding. Agent Skills are not just prettier prompt folders. They are a control mechanism for bounded attention. The model should not read everything. It should read what changes the decision.

The uncomfortable part: deciding what to read is itself an intelligence task.

The benchmarks separate easy tasks from industrially annoying ones

The evaluation uses three datasets, and the differences among them are more important than the dataset names.

Dataset	Task type	Why it matters
IMDB	Binary sentiment classification	Simple baseline; most competent models already do well
FiNER	Financial XBRL tag selection with 139 labels	Domain-heavy classification requiring financial terminology and reasoning
InsurBench	Proprietary insurance-claim email-thread decision task	Long, noisy, private industrial data with operational ambiguity

IMDB is useful mostly as a sanity check. If a model needs an elaborate skill framework to classify short movie reviews, the model is not being helped; it is being escorted.

FiNER is more revealing because the model must choose among specialized financial tags. This is where domain instructions and structured context can actually change output quality.

InsurBench is the most business-relevant part. It uses real insurance-claim email histories, including long threads, extracted claim details, communication mismatches, and multilingual interference. The task is to decide whether the insurance company should continue engagement, take further action, or close the case. This is exactly the kind of messy internal workflow where a company wants local deployment and also wants the model to stop hallucinating with confidence, a corporate habit already sufficiently supplied by humans.

The paper also evaluates cost using average generation time and average VRAM-time, measured as GPU memory multiplied by time. That is a better industrial metric than pure latency or FLOPS because VRAM occupancy determines how many jobs can run concurrently. A model that is fast but monopolizes memory can still be expensive in production.

Main evidence: Agent Skills help most when the task is hard enough to need them

The main performance table compares DI, FSI, and ASI across models from Gemma-3-270M to Qwen3-80B variants, plus GPT-4o-mini as a closed-source reference for public datasets.

The pattern is sharpest on FiNER. Qwen3-80B-Instruct improves from 0.198 classification accuracy under Direct Instruction to 0.654 under Agent Skill Instruction. Qwen3-80B-Coder improves from 0.309 to 0.657. Qwen3-30B-Instruct improves from 0.184 to 0.564. These are not cosmetic gains. They show that for specialized financial tagging, structured skill access can move a model from “barely useful” toward “potentially deployable after further validation.”

InsurBench tells a more cautious story. Qwen3-80B-Instruct improves from 0.498 under DI to 0.620 under ASI. Qwen3-80B-Coder improves from 0.498 to 0.660. Gemma-3-12B improves from 0.500 to 0.575. But Qwen3-30B-Instruct drops from 0.500 to 0.450 in classification accuracy under ASI, despite strong skill-selection accuracy. That result is worth keeping, not smoothing away.

Why? Because it shows that routing success and task execution quality are different bottlenecks. A model can select the right skill and still fail to apply it well to a long, messy email thread. In business terms: finding the correct operating manual is not the same as operating the machine.

On IMDB, the gains are modest or inconsistent because the task is already easy. Some models perform well with DI, and ASI does not create much room for improvement. This is not a weakness of the paper. It is a useful warning: adding an agent framework to a simple task may mostly add ceremony.

Evidence	Likely purpose	What it supports	What it does not prove
Main DI / FSI / ASI table	Main evidence	ASI can improve hard domain tasks, especially FiNER and some InsurBench settings	ASI is always better than DI
IMDB results	Sanity check	Easy tasks do not benefit much from skill machinery	Agent Skills are unnecessary in all settings
InsurBench results	Industrial relevance test	Private, long-context workflows may benefit from structured skills	General performance across all insurance or regulated workflows
VRAM-time metric	Deployment-cost evidence	GPU memory residency matters for production economics	Full total cost of ownership

The paper’s strongest claim is therefore conditional: Agent Skills are useful when the task requires domain context and the model can reliably select and use the right skill. That sentence is less glamorous than “small agents are coming,” but it is much more deployable.

The failure mode starts before execution: tiny models cannot route

The most important misconception to remove is that Agent Skills make tiny models enterprise-ready. They do not.

The paper reports that tiny models struggle even when the temporary skill repository contains only a handful of distractor skills. Gemma-3-270M effectively fails the skill-selection objective in several settings. Gemma-3-4B performs better, but still falls short on InsurBench, where its skill-selection accuracy under ASI is 0.780. That is not a rounding error. In an industrial workflow, a 22% wrong-routing rate is not “lightweight.” It is a ticket generator.

The authors then test skill-selection robustness as the skill hub grows, which is closer to a real agent platform. A production agent may not have five skills; it may have dozens or hundreds. The paper’s larger-hub experiment shows rapid degradation for tiny models, while models above roughly the 12B scale remain much more robust. Code-specialized variants appear especially strong in skill selection.

This result should change how companies discuss “small model agents.” The first question should not be: “Can the small model answer the task?” The first question should be: “Can the small model reliably find the right procedure before answering?”

A useful internal deployment checklist would separate the two:

Capability	Test question	Failure symptom
Skill routing	Does the model choose the correct skill among distractors?	Wrong tool, wrong policy, wrong document family
Skill execution	Does the model apply the selected skill correctly?	Correct manual, bad decision
Context discipline	Does the model avoid loading irrelevant context?	Slow, expensive, distracted outputs
Dependency detection	Does the model notice cross-skill references?	Broken multi-step workflows

That last point is where the paper becomes especially relevant to real agent platforms.

Progressive disclosure breaks when the model misses dependencies

The appendix reports a negative result that should probably receive more attention than many benchmark numbers. The authors attempted to evaluate progressive disclosure with cross-skill references, where one skill description may refer to another skill. Small models often failed to detect those references, preventing the system from triggering intra-skill calls. Even GPT-4o-mini reportedly showed low hit rates in this setting. The authors excluded intra-skill invocation from the main experiments because the selection rates were too low for meaningful comparison.

This is not a minor implementation detail. It identifies a structural boundary.

A simple skill hub is a flat menu. A realistic skill system is closer to a dependency graph. A claim-handling skill may refer to a compliance skill. A financial-tagging skill may refer to an accounting-standard skill. A customer-support skill may refer to refund-policy and escalation-protocol skills. If the model cannot detect those dependencies, progressive disclosure stops being progressive. It becomes selectively blind.

The paper frames Agent Skills as a POMDP-like process where the agent can acquire information before acting. But cross-skill references make the information-acquisition policy more demanding. The model must not only ask, “Which skill matches the user request?” It must also ask, “Does the selected skill imply another hidden dependency?”

That is harder. It is also much closer to industrial reality. Naturally, reality arrives carrying a dependency graph and no apology.

Code models are not just for code when the workflow is procedural

One of the more interesting findings is the performance of Qwen3-80B-Coder. Holding model size constant among Qwen3-80B variants, the code-specialized model performs strongly under ASI and is especially attractive on VRAM-time efficiency.

On InsurBench, Qwen3-80B-Coder under ASI reaches 0.660 classification accuracy and 0.658 F1, outperforming Qwen3-80B-Instruct and Qwen3-80B-Thinking on classification accuracy. On FiNER, Qwen3-80B-Coder reaches 0.657 classification accuracy, close to Qwen3-80B-Instruct at 0.654 but below the Thinking variant at 0.717. However, the Thinking variant is far more expensive in VRAM-time. On InsurBench ASI, Qwen3-80B-Thinking consumes 181.003 GB-min per item, while Qwen3-80B-Coder consumes 10.975 GB-min.

That gap is not subtle. It is the difference between “interesting benchmark winner” and “please explain this invoice.”

The likely explanation is not that insurance claims are secretly Python. It is that Agent Skill workflows are procedural. They involve reading structured instructions, matching task patterns, following steps, respecting output formats, and managing tool-like behavior. Code-specialized models may be better aligned with procedural decomposition and instruction-following under structured contexts.

This remains an inference, not a proven causal mechanism. The paper itself notes that the underlying reason for the accuracy and VRAM efficiency of code-oriented models remains unclear. But for deployment teams, the practical implication is still valuable: do not evaluate only chat-optimized or reasoning-branded models. Include code-specialized models in non-code agent benchmarks, especially when the workflow is rule-heavy and procedure-driven.

Chat history is a remedy with a memory bill attached

The paper includes a post-hoc exploration of ASI with chat history on InsurBench. This is best read as an exploratory extension, not as the main thesis.

The result is asymmetric. Small models benefit more from chat history. Gemma-3-4B improves from 0.525 classification accuracy under ASI to 0.660 under ASI with history. Gemma-3-270M improves from 0.415 to 0.525. For Gemma-3-12B, the gain is small: 0.575 to 0.585. For Qwen3-30B, ASIH improves from 0.450 to 0.500. But for Qwen3-80B-Instruct, ASIH drops from 0.620 to 0.535 while VRAM-time rises from 5.321 to 10.035 GB-min.

That is the kind of result enterprise teams should love, because it refuses to be a universal recipe.

Chat history is not “good.” It is a context intervention. It can help smaller models reconstruct task state, but it can also increase cost and possibly distract larger models that already have enough capacity to process the core task. History should therefore be enabled by model class and workflow type, not as a religious setting.

Model class	Likely use of chat history	Practical recommendation
Very small SLMs	Helps compensate for weak context reconstruction	Test history carefully, but expect gains
Mid-sized SLMs	May help in long operational threads	Use if the gain exceeds VRAM-time cost
Large SLMs	May add cost without improving decisions	Default off unless evaluation proves otherwise

The important point is not that history helps. The important point is that history has a price, and sometimes the price buys confusion.

The synonym test is exploratory, but not trivial

The paper also tests replacing the word “Skill” with alternatives such as “Capability,” “Expertise,” “Proficiency,” and “Know-how.” On Qwen3-80B-Instruct with InsurBench, the changes have limited but visible effects. “Expertise” performs best among ASI variants on classification accuracy and skill accuracy, while “Know-how” reduces VRAM-time with some performance trade-off.

This should not be overinterpreted. It does not prove that everyone should rename their skill folders tomorrow. Please do not start a governance meeting about whether “Expertise.md” is more enterprise-ready than “SKILL.md.” There are better uses of human civilization.

But the test does show that model behavior is sensitive to semantic framing. The labels used inside agent frameworks are not neutral. They interact with model priors. For teams designing internal agent platforms, naming conventions, schema descriptions, and system-prompt terminology should be benchmarked, not merely debated.

The broader lesson is that context engineering is not only about how much context is supplied. It is also about the representational shape of that context.

What the paper directly shows, and what businesses should infer

The paper directly shows three things.

First, Agent Skill Instruction can substantially improve performance on harder domain tasks compared with Direct Instruction, especially in FiNER and in several InsurBench settings.

Second, very small models are unreliable skill routers. This matters because routing is the entry point to the whole architecture.

Third, cost efficiency should be measured with memory-time, not just output quality. The code-specialized Qwen3-80B variant is particularly strong when accuracy and VRAM-time are considered together.

From these results, Cognaptus would infer a practical deployment pathway:

Use Direct Instruction as the baseline. If the task is already solved, do not add an agent framework for aesthetic reasons.
Add Agent Skills when the task requires domain procedures, policy logic, or structured decision rules.
Test skill routing separately from final task accuracy.
Avoid sub-4B models for large skill hubs unless the skill set is tiny, stable, and low-risk.
Include 12B–30B and code-specialized larger models in evaluation, not only general instruct models.
Track VRAM-time per completed task as a production cost metric.
Treat chat history, synonym choices, and skill descriptions as benchmarkable design parameters.

That is the difference between “we use agents” and “we have an agent system that survives procurement.”

Boundaries: the paper is useful, but not a universal enterprise map

The evidence is strongest for classification, tagging, and decision tasks. The study uses IMDB, FiNER, and InsurBench; it does not establish performance across open-ended planning, multi-tool execution, customer-facing dialogue, coding over large repositories, or long-horizon autonomous workflows.

The InsurBench dataset is valuable because it is private and industrial, reducing the risk that the benchmark was already absorbed during pretraining. But it is still one proprietary insurance workflow. We should not casually generalize it to all regulated industries.

The paper also excludes recursive intra-skill disclosure from the main experiments because smaller models were too weak at detecting cross-skill references. That exclusion is methodologically reasonable, but it means the reported ASI results are closer to a simplified skill-routing setup than to a fully mature enterprise skill graph.

Finally, the code-model result is operationally interesting but causally unresolved. The paper shows that code-specialized variants are efficient and strong in this setup. It does not fully explain whether this comes from procedural instruction-following, training data, architecture, decoding behavior, or other factors.

These boundaries do not weaken the article’s main business value. They make the deployment lesson cleaner: benchmark the workflow you actually intend to run, including the skill library size, dependency structure, history policy, and memory budget.

The decision rule: skills amplify capable small models, not incapable tiny ones

The paper’s quiet contribution is that it shifts the enterprise AI conversation away from a false binary.

The choice is not “frontier API” versus “tiny local model.” The more realistic choice is among deployment architectures:

Deployment choice	When it makes sense	When it fails
Frontier API	Low data sensitivity, high quality requirement, flexible budget	Compliance or cost constraints block usage
Tiny local SLM	Simple, narrow, low-risk tasks	Skill routing, domain reasoning, long context
Mid-sized local SLM with Agent Skills	Regulated workflows with structured procedures	Complex execution beyond model capacity
Code-specialized large open model with Agent Skills	Procedural agent workflows where VRAM-time matters	If hardware budget cannot support residency
Full skill context loading	Small skill set, stable prompt, low ambiguity	Large skill hubs and context distraction

The correct conclusion is not that small models are suddenly enterprise-ready. The correct conclusion is that some small and mid-sized models become more useful when the architecture respects their limits.

Agent Skills reduce unnecessary context. They organize domain procedures. They make model behavior more modular. They can improve cost-performance. But they also introduce a routing problem, and routing is not free. It consumes intelligence before the task even begins.

That is the industrial reality: local AI is not achieved by shrinking the model and praying harder. It is achieved by matching model capacity, context structure, routing complexity, and GPU economics.

Small models can have big skills. But only if they can find them.

Cognaptus: Automate the Present, Incubate the Future.

Yangjie Xu, Lujun Li, Lama Sleem, Niccolo’ Gentile, Yewei Song, Yiqun Wang, Siming Ji, Wenbo Wu, and Radu State, “Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments,” arXiv:2602.16653, 2026. ↩︎

The real comparison is not small versus large, but three ways to feed context#

The benchmarks separate easy tasks from industrially annoying ones#

Main evidence: Agent Skills help most when the task is hard enough to need them#

The failure mode starts before execution: tiny models cannot route#

Progressive disclosure breaks when the model misses dependencies#

Code models are not just for code when the workflow is procedural#

Chat history is a remedy with a memory bill attached#

The synonym test is exploratory, but not trivial#

What the paper directly shows, and what businesses should infer#

Boundaries: the paper is useful, but not a universal enterprise map#

The decision rule: skills amplify capable small models, not incapable tiny ones#