Opening — Why this matters now
The industry has spent the last three years worshipping a single altar: scale. Bigger models, larger datasets, longer context windows. The implicit assumption is simple—intelligence is a function of size.
This paper challenges that assumption with quiet confidence.
Instead of building a larger model, it asks a more inconvenient question: what if the intelligence we need already exists—just fragmented across different models?
The result is not another fine-tuned giant, but something structurally different: a graph of frozen models that communicate internally, rather than at the output level.
And the performance gains suggest this is not just an architectural curiosity—it may be a more capital-efficient path forward.
Background — Context and prior art
Most multi-model systems today operate at the surface level:
| Approach | Mechanism | Limitation |
|---|---|---|
| Ensemble methods | Combine output probabilities | No access to internal reasoning |
| Routing systems | Send query to best model | No collaboration |
| Multi-agent systems | Exchange natural language | High latency, shallow integration |
These methods treat models as black boxes.
But prior research has hinted at something deeper: latent spaces across independently trained LLMs are geometrically compatible. fileciteturn0file0
Translation: different models may “think” in slightly different coordinate systems—but they describe the same underlying structure.
If that is true, then communication doesn’t need to happen in text. It can happen directly in representation space.
That is the leap this paper makes.
Analysis — What the paper actually builds
The architecture is deceptively simple.
1. Multi-model encoding (Layer 1)
Three small frozen models process the same input—but with different perspectives:
- One focuses on factual content
- One on reasoning structure
- One on language framing
Each produces a hidden representation.
These are projected into a shared latent space and averaged:
| Component | Role |
|---|---|
| Llama-3.2-1B | Factual encoding |
| Qwen2.5-1.5B | Reasoning encoding |
| Gemma-2-2B | Linguistic encoding |
This creates a unified signal: z₁.
2. Cross-model injection (Layer 2)
Instead of decoding this signal, the system injects it into two larger models mid-computation.
| Component | Function |
|---|---|
| Phi-3-mini | Structured refinement |
| Mistral-7B | General reasoning refinement |
The injected signal modifies their internal representations via the residual stream.
This is crucial: the models are not queried—they are steered internally.
3. Output aggregation via attention
A lightweight cross-attention module decides how to combine the outputs.
No explicit routing rules are provided.
The system learns which model to trust—implicitly.
4. Training philosophy
- Total parameters: ~12B (frozen)
- Trainable: 17.6M (~0.15%) fileciteturn0file0
Only projection layers and the output node are trained.
This is not fine-tuning. It is interfacing.
Findings — Results with visualization
Benchmark performance
| Model | MMLU | ARC-Challenge | OpenBookQA |
|---|---|---|---|
| Best single model | 66.0% | 75.9% | 76.6% |
| Learned head (baseline) | 60.5% | 78.2% | 77.6% |
| Frozen LLM Graph | 67.2% | 87.3% | 82.8% |
Performance gains
| Comparison | MMLU | ARC | OBQA |
|---|---|---|---|
| vs single model | +1.2pp | +11.4pp | +6.2pp |
| vs learned head | +6.7pp | +9.1pp | +5.2pp |
Two observations matter more than the raw numbers:
-
The gains are largest on structured reasoning (ARC) → Suggests cross-model reasoning composition is real, not cosmetic.
-
Beats parameter-matched classifiers consistently → The advantage comes from communication, not just extra parameters.
Deeper Mechanism — What’s actually happening
1. Gradient flow across frozen models is viable
A common assumption is that frozen models block learning.
Not here.
The paper shows:
- Gradient signal retained: ~13% across model boundaries fileciteturn0file0
- Stable training without skip connections
This reframes frozen models as differentiable modules, not static assets.
2. Emergent routing (without supervision)
The output layer learns to prefer one model over another.
Specifically:
- Strong bias toward Phi-3-mini
- Weaker but persistent contribution from Mistral-7B
No routing labels were provided.
The system discovers which model is more useful.
This is a subtle but important shift:
Routing is no longer a design choice—it becomes a learned behavior.
3. No specialization at the projection level
Interestingly, the first-layer projections converge to similar behaviors.
Implication:
- Diversity comes from model heterogeneity, not learned adapters
- The system relies on pretrained differences, not new specialization
This is both a strength and a limitation.
Implications — What this means for business and AI systems
1. The economics of AI may shift
Training frontier models is expensive.
This approach suggests an alternative:
| Strategy | Cost Structure | Scalability |
|---|---|---|
| Train larger models | High compute + data | Linear scaling cost |
| Fine-tune models | Moderate cost | Task-specific |
| Compose frozen models (this paper) | Low incremental cost | Combinatorial scaling |
The last option is particularly attractive for:
- Enterprises with access to multiple models
- Vertical AI applications
- Rapid prototyping environments
2. Models become components, not products
This architecture treats LLMs as:
- Modular
- Interoperable
- Replaceable
That aligns with a broader trend toward AI system engineering, not model worship.
3. Latent space becomes the real interface layer
APIs today operate in text.
This paper implies a future where:
- Models communicate in vector space
- Translation layers replace prompt engineering
In other words, the API layer moves one level deeper.
4. Competitive advantage shifts to orchestration
If models are commoditized and frozen:
- Value shifts to how they are connected
- Architecture becomes the differentiator
This is uncomfortable for model providers—but attractive for system builders.
Conclusion — The quiet end of monolithic models
This paper does not claim to replace large models.
It does something more disruptive: it makes them optional.
By demonstrating that:
- Frozen models can communicate
- Gradients can flow across boundaries
- Performance improves through composition
…it reframes the future of AI systems.
Not as bigger models.
But as networks of models that learn to collaborate.
And once collaboration becomes differentiable, the system—not the model—becomes the unit of intelligence.
Cognaptus: Automate the Present, Incubate the Future.