Opening — Why this matters now

The industry has spent the last three years worshipping a single altar: scale. Bigger models, larger datasets, longer context windows. The implicit assumption is simple—intelligence is a function of size.

This paper challenges that assumption with quiet confidence.

Instead of building a larger model, it asks a more inconvenient question: what if the intelligence we need already exists—just fragmented across different models?

The result is not another fine-tuned giant, but something structurally different: a graph of frozen models that communicate internally, rather than at the output level.

And the performance gains suggest this is not just an architectural curiosity—it may be a more capital-efficient path forward.


Background — Context and prior art

Most multi-model systems today operate at the surface level:

Approach Mechanism Limitation
Ensemble methods Combine output probabilities No access to internal reasoning
Routing systems Send query to best model No collaboration
Multi-agent systems Exchange natural language High latency, shallow integration

These methods treat models as black boxes.

But prior research has hinted at something deeper: latent spaces across independently trained LLMs are geometrically compatible. fileciteturn0file0

Translation: different models may “think” in slightly different coordinate systems—but they describe the same underlying structure.

If that is true, then communication doesn’t need to happen in text. It can happen directly in representation space.

That is the leap this paper makes.


Analysis — What the paper actually builds

The architecture is deceptively simple.

1. Multi-model encoding (Layer 1)

Three small frozen models process the same input—but with different perspectives:

  • One focuses on factual content
  • One on reasoning structure
  • One on language framing

Each produces a hidden representation.

These are projected into a shared latent space and averaged:

Component Role
Llama-3.2-1B Factual encoding
Qwen2.5-1.5B Reasoning encoding
Gemma-2-2B Linguistic encoding

This creates a unified signal: z₁.

2. Cross-model injection (Layer 2)

Instead of decoding this signal, the system injects it into two larger models mid-computation.

Component Function
Phi-3-mini Structured refinement
Mistral-7B General reasoning refinement

The injected signal modifies their internal representations via the residual stream.

This is crucial: the models are not queried—they are steered internally.

3. Output aggregation via attention

A lightweight cross-attention module decides how to combine the outputs.

No explicit routing rules are provided.

The system learns which model to trust—implicitly.

4. Training philosophy

  • Total parameters: ~12B (frozen)
  • Trainable: 17.6M (~0.15%) fileciteturn0file0

Only projection layers and the output node are trained.

This is not fine-tuning. It is interfacing.


Findings — Results with visualization

Benchmark performance

Model MMLU ARC-Challenge OpenBookQA
Best single model 66.0% 75.9% 76.6%
Learned head (baseline) 60.5% 78.2% 77.6%
Frozen LLM Graph 67.2% 87.3% 82.8%

Performance gains

Comparison MMLU ARC OBQA
vs single model +1.2pp +11.4pp +6.2pp
vs learned head +6.7pp +9.1pp +5.2pp

Two observations matter more than the raw numbers:

  1. The gains are largest on structured reasoning (ARC) → Suggests cross-model reasoning composition is real, not cosmetic.

  2. Beats parameter-matched classifiers consistently → The advantage comes from communication, not just extra parameters.


Deeper Mechanism — What’s actually happening

1. Gradient flow across frozen models is viable

A common assumption is that frozen models block learning.

Not here.

The paper shows:

  • Gradient signal retained: ~13% across model boundaries fileciteturn0file0
  • Stable training without skip connections

This reframes frozen models as differentiable modules, not static assets.

2. Emergent routing (without supervision)

The output layer learns to prefer one model over another.

Specifically:

  • Strong bias toward Phi-3-mini
  • Weaker but persistent contribution from Mistral-7B

No routing labels were provided.

The system discovers which model is more useful.

This is a subtle but important shift:

Routing is no longer a design choice—it becomes a learned behavior.

3. No specialization at the projection level

Interestingly, the first-layer projections converge to similar behaviors.

Implication:

  • Diversity comes from model heterogeneity, not learned adapters
  • The system relies on pretrained differences, not new specialization

This is both a strength and a limitation.


Implications — What this means for business and AI systems

1. The economics of AI may shift

Training frontier models is expensive.

This approach suggests an alternative:

Strategy Cost Structure Scalability
Train larger models High compute + data Linear scaling cost
Fine-tune models Moderate cost Task-specific
Compose frozen models (this paper) Low incremental cost Combinatorial scaling

The last option is particularly attractive for:

  • Enterprises with access to multiple models
  • Vertical AI applications
  • Rapid prototyping environments

2. Models become components, not products

This architecture treats LLMs as:

  • Modular
  • Interoperable
  • Replaceable

That aligns with a broader trend toward AI system engineering, not model worship.

3. Latent space becomes the real interface layer

APIs today operate in text.

This paper implies a future where:

  • Models communicate in vector space
  • Translation layers replace prompt engineering

In other words, the API layer moves one level deeper.

4. Competitive advantage shifts to orchestration

If models are commoditized and frozen:

  • Value shifts to how they are connected
  • Architecture becomes the differentiator

This is uncomfortable for model providers—but attractive for system builders.


Conclusion — The quiet end of monolithic models

This paper does not claim to replace large models.

It does something more disruptive: it makes them optional.

By demonstrating that:

  • Frozen models can communicate
  • Gradients can flow across boundaries
  • Performance improves through composition

…it reframes the future of AI systems.

Not as bigger models.

But as networks of models that learn to collaborate.

And once collaboration becomes differentiable, the system—not the model—becomes the unit of intelligence.


Cognaptus: Automate the Present, Incubate the Future.