Skill Issue or System Design? How LLMs Actually Follow Instructions

Opening — Why this matters now

Instruction-following is the quiet backbone of modern AI products. From copilots to autonomous agents, everything hinges on whether a model can do exactly what it was told—not approximately, not creatively, but precisely.

And yet, anyone who has deployed LLMs in production knows the uncomfortable truth: they don’t “follow instructions” in any consistent, reliable sense.

The paper fileciteturn0file0 dismantles a widely held assumption—that instruction-following is a unified capability learned through instruction tuning. Instead, it suggests something far less elegant and far more operationally relevant: LLMs are coordinating multiple weakly-coupled skills, not executing a single coherent mechanism.

If that sounds messy, it is. And if you build AI systems, it matters more than you think.

Background — The myth of a unified instruction engine

The prevailing narrative in AI research has been convenient: instruction tuning gives models a general-purpose “rule-following” ability. Some even proposed a latent “instruction-following dimension” inside model representations.

Conceptually, that implies a clean architecture:

Hypothesis	Interpretation	Engineering Implication
Universal mechanism	One shared representation for all instructions	Easier to scale, generalize, and control
Compositional skills	Multiple task-specific capabilities coordinated dynamically	Harder to control, requires orchestration

Most benchmarks—IFEval, FollowBench, SysBench—measure whether models comply, not how they achieve compliance. That gap is where this paper positions itself.

Instead of asking “Does the model follow instructions?”, it asks:

What internal structure enables (or fails) instruction-following?

Analysis — What the paper actually does

The authors build a diagnostic framework that feels closer to systems debugging than traditional NLP evaluation.

1. Decomposing instruction-following

They split the problem into two components:

Task-specific skills — e.g., counting words, detecting sentiment, formatting JSON
Constraint satisfaction — the act of adhering to the instruction

The key question: is constraint satisfaction a reusable, general signal—or just an emergent side effect of task execution?

2. Experimental design (surprisingly practical)

They test across 9 tasks, covering:

Structural (word/character count)
Lexical (include/exclude words)
Semantic (topic, sentiment)
Stylistic (register, toxicity)

As shown in the task table on page 3, tasks are deliberately designed so that outputs can be fluent but wrong—forcing the model to demonstrate actual compliance, not just language ability.

3. Four analytical lenses

Method	Purpose	What it tests
Specialist vs General Probes	Representation sharing	Is there a universal signal?
Cross-task Transfer	Skill reuse	Do capabilities generalize across tasks?
Causal Ablation (INLP)	Dependency structure	Do tasks rely on shared internal info?
Temporal Analysis	Execution timing	Is instruction-following planned or monitored?

This is less “benchmarking” and more “reverse-engineering cognition.”

Findings — The uncomfortable truth (with structure)

1. No universal mechanism

Across all models tested (Llama, Gemma, Qwen), general probes consistently underperform task-specific ones.

Observation	Implication
General probe < Specialist probe	No shared representation across tasks
Weak cross-task transfer	Skills are localized, not general
Sparse ablation dependencies	No central “instruction module”

This alone invalidates the clean “instruction dimension” hypothesis.

2. Instruction-following = skill composition

Cross-task transfer (see heatmap on page 6) reveals clusters:

Topic ↔ Sentiment → semantic overlap
Term exclusion ↔ Toxicity → filtering logic
Structural tasks → independent group

This suggests a modular skill graph, not a unified system.

3. Tasks emerge at different layers

From the layer-wise accuracy curves on page 5:

Task Type	Emergence Layer	Interpretation
Structural	Early layers	Low-level pattern recognition
Lexical	Mid layers	Token-level control
Semantic/Stylistic	Late layers	High-level abstraction

This is effectively a hierarchical pipeline inside the model, whether intended or not.

4. No planning—only monitoring

Temporal analysis (see Figure 4 on page 7) shows:

No signal before generation starts
Rapid activation during generation
Stable monitoring throughout output
Spike at EOS (final verification)

In plain terms:

LLMs don’t plan to follow instructions—they check themselves while generating.

Subtle difference. Massive implication.

5. Model differences are real

Model	Behavior
Llama	More constraint-specific signals
Gemma	Mixed behavior
Qwen	Heavy reliance on general language features

This means “instruction-following ability” is not just a scaling issue—it’s architecture + training dependent.

Implications — What this means for builders and investors

1. Stop treating instruction-following as a feature

It’s not a capability you can “turn on.” It’s an emergent coordination problem.

For product design, this means:

Prompt engineering ≠ control
Fine-tuning ≠ guarantee
Evaluation must be task-specific

2. Reliability comes from orchestration, not model size

If instruction-following is compositional, then robustness comes from:

Task decomposition
Multi-agent systems
External constraint validators

Not just bigger models.

3. Monitoring beats planning (architecturally)

Since models don’t plan compliance, systems must:

Add runtime checks (validators, guardrails)
Use iterative refinement loops
Treat outputs as drafts, not final answers

This aligns with the rise of agent frameworks and tool-augmented workflows.

4. Investment angle — where value actually accrues

If the model is not the “instruction engine,” then value shifts to:

Layer	Opportunity
Evaluation tools	Measuring constraint adherence
Orchestration frameworks	Managing multi-skill execution
Monitoring systems	Detecting failures in real-time
Domain-specific tuning	Aligning skills to tasks

In other words: the control layer, not the model layer.

Conclusion — Less intelligence, more coordination

This paper reframes instruction-following from a cognitive illusion into a systems problem.

LLMs don’t possess a universal rule-following faculty. They assemble responses by coordinating fragmented capabilities—some early, some late, some shared, most not.

That may sound like a limitation.

In reality, it’s a roadmap.

Because once you accept that instruction-following is not a monolith, you stop trying to fix the model—and start designing the system around it.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The myth of a unified instruction engine#

Analysis — What the paper actually does#

1. Decomposing instruction-following#

2. Experimental design (surprisingly practical)#

3. Four analytical lenses#

Findings — The uncomfortable truth (with structure)#

1. No universal mechanism#

2. Instruction-following = skill composition#

3. Tasks emerge at different layers#

4. No planning—only monitoring#

5. Model differences are real#

Implications — What this means for builders and investors#

1. Stop treating instruction-following as a feature#

2. Reliability comes from orchestration, not model size#

3. Monitoring beats planning (architecturally)#

4. Investment angle — where value actually accrues#

Conclusion — Less intelligence, more coordination#