Dead Weights, Live Signals: When Frozen Models Start Talking

Opening — Why this matters now

The industry has spent the last three years worshipping a single altar: scale. Bigger models, larger datasets, longer context windows. The implicit assumption is simple—intelligence is a function of size.

This paper challenges that assumption with quiet confidence.

Instead of building a larger model, it asks a more inconvenient question: what if the intelligence we need already exists—just fragmented across different models?

The result is not another fine-tuned giant, but something structurally different: a graph of frozen models that communicate internally, rather than at the output level.

And the performance gains suggest this is not just an architectural curiosity—it may be a more capital-efficient path forward.

Background — Context and prior art

Most multi-model systems today operate at the surface level:

Approach	Mechanism	Limitation
Ensemble methods	Combine output probabilities	No access to internal reasoning
Routing systems	Send query to best model	No collaboration
Multi-agent systems	Exchange natural language	High latency, shallow integration

These methods treat models as black boxes.

But prior research has hinted at something deeper: latent spaces across independently trained LLMs are geometrically compatible. fileciteturn0file0

Translation: different models may “think” in slightly different coordinate systems—but they describe the same underlying structure.

If that is true, then communication doesn’t need to happen in text. It can happen directly in representation space.

That is the leap this paper makes.

Analysis — What the paper actually builds

The architecture is deceptively simple.

1. Multi-model encoding (Layer 1)

Three small frozen models process the same input—but with different perspectives:

One focuses on factual content
One on reasoning structure
One on language framing

Each produces a hidden representation.

These are projected into a shared latent space and averaged:

Component	Role
Llama-3.2-1B	Factual encoding
Qwen2.5-1.5B	Reasoning encoding
Gemma-2-2B	Linguistic encoding

This creates a unified signal: z₁.

2. Cross-model injection (Layer 2)

Instead of decoding this signal, the system injects it into two larger models mid-computation.

Component	Function
Phi-3-mini	Structured refinement
Mistral-7B	General reasoning refinement

The injected signal modifies their internal representations via the residual stream.

This is crucial: the models are not queried—they are steered internally.

3. Output aggregation via attention

A lightweight cross-attention module decides how to combine the outputs.

No explicit routing rules are provided.

The system learns which model to trust—implicitly.

4. Training philosophy

Total parameters: ~12B (frozen)
Trainable: 17.6M (~0.15%) fileciteturn0file0

Only projection layers and the output node are trained.

This is not fine-tuning. It is interfacing.

Findings — Results with visualization

Benchmark performance

Model	MMLU	ARC-Challenge	OpenBookQA
Best single model	66.0%	75.9%	76.6%
Learned head (baseline)	60.5%	78.2%	77.6%
Frozen LLM Graph	67.2%	87.3%	82.8%

Performance gains

Comparison	MMLU	ARC	OBQA
vs single model	+1.2pp	+11.4pp	+6.2pp
vs learned head	+6.7pp	+9.1pp	+5.2pp

Two observations matter more than the raw numbers:

The gains are largest on structured reasoning (ARC) → Suggests cross-model reasoning composition is real, not cosmetic.
Beats parameter-matched classifiers consistently → The advantage comes from communication, not just extra parameters.

Deeper Mechanism — What’s actually happening

1. Gradient flow across frozen models is viable

A common assumption is that frozen models block learning.

Not here.

The paper shows:

Gradient signal retained: ~13% across model boundaries fileciteturn0file0
Stable training without skip connections

This reframes frozen models as differentiable modules, not static assets.

2. Emergent routing (without supervision)

The output layer learns to prefer one model over another.

Specifically:

Strong bias toward Phi-3-mini
Weaker but persistent contribution from Mistral-7B

No routing labels were provided.

The system discovers which model is more useful.

This is a subtle but important shift:

Routing is no longer a design choice—it becomes a learned behavior.

3. No specialization at the projection level

Interestingly, the first-layer projections converge to similar behaviors.

Implication:

Diversity comes from model heterogeneity, not learned adapters
The system relies on pretrained differences, not new specialization

This is both a strength and a limitation.

Implications — What this means for business and AI systems

1. The economics of AI may shift

Training frontier models is expensive.

This approach suggests an alternative:

Strategy	Cost Structure	Scalability
Train larger models	High compute + data	Linear scaling cost
Fine-tune models	Moderate cost	Task-specific
Compose frozen models (this paper)	Low incremental cost	Combinatorial scaling

The last option is particularly attractive for:

Enterprises with access to multiple models
Vertical AI applications
Rapid prototyping environments

2. Models become components, not products

This architecture treats LLMs as:

Modular
Interoperable
Replaceable

That aligns with a broader trend toward AI system engineering, not model worship.

3. Latent space becomes the real interface layer

APIs today operate in text.

This paper implies a future where:

Models communicate in vector space
Translation layers replace prompt engineering

In other words, the API layer moves one level deeper.

4. Competitive advantage shifts to orchestration

If models are commoditized and frozen:

Value shifts to how they are connected
Architecture becomes the differentiator

This is uncomfortable for model providers—but attractive for system builders.

Conclusion — The quiet end of monolithic models

This paper does not claim to replace large models.

It does something more disruptive: it makes them optional.

By demonstrating that:

Frozen models can communicate
Gradients can flow across boundaries
Performance improves through composition

…it reframes the future of AI systems.

Not as bigger models.

But as networks of models that learn to collaborate.

And once collaboration becomes differentiable, the system—not the model—becomes the unit of intelligence.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually builds#

1. Multi-model encoding (Layer 1)#

2. Cross-model injection (Layer 2)#

3. Output aggregation via attention#

4. Training philosophy#

Findings — Results with visualization#

Benchmark performance#

Performance gains#

Deeper Mechanism — What’s actually happening#

1. Gradient flow across frozen models is viable#

2. Emergent routing (without supervision)#

3. No specialization at the projection level#

Implications — What this means for business and AI systems#

1. The economics of AI may shift#

2. Models become components, not products#

3. Latent space becomes the real interface layer#

4. Competitive advantage shifts to orchestration#

Conclusion — The quiet end of monolithic models#