Lost in the Repo: Why Bigger Context Windows Still Miss the Point

Opening — Bigger Context, Same Blind Spots

For the past year, the industry narrative has been simple: give models more context, and the problem goes away.

128K tokens became 1M. Then 2M. The promise was intoxicating — “the whole repository fits.” Retrieval bottlenecks? Solved. File localization? Obsolete. Just feed the model everything and let attention do the rest.

The paper The Navigation Paradox in Large-Context Agentic Coding challenges that assumption directly fileciteturn0file0.

Its core claim is uncomfortable but strategically important:

Larger context windows do not remove the navigation problem. They simply relocate it.

The bottleneck shifts from capacity to salience. The model doesn’t fail because it cannot read the relevant file. It fails because it never realizes that file is architecturally relevant.

For teams deploying repository-scale coding agents, this distinction is not academic. It is operational risk.

Most repository-level agent systems today rely on retrieval-based localization:

BM25 ranking over code chunks
Embedding similarity
Hybrid keyword + semantic search

This works well when dependencies are semantic — when the task description shares vocabulary with the required files.

But real software systems are not flat documents. They are graphs.

Dependencies arise from:

IMPORTS
INHERITS
INSTANTIATES
Configuration wiring
Dependency injection

These relationships are structural, not lexical.

The paper formalizes this mismatch as the Navigation Paradox:

Problem Type	Retrieval Works?	Structural Graph Helps?
Semantic (G1)	✅ Yes	❌ Minimal benefit
Structural (G2)	⚠️ Partial	⚠️ Conditional
Hidden (G3)	❌ No	✅ Major benefit

The important insight: retrieval answers the question “Which files look similar to this query?”.

Graph navigation answers a different question entirely:

“Given this file, what other files are architecturally connected?”

Those are not interchangeable problems.

Analysis — CodeCompass and Controlled Benchmarking

To test this hypothesis, the authors built CodeCompass, an MCP-based graph navigation server backed by Neo4j.

Experimental Setup

30 repository-level tasks
FastAPI RealWorld example app (~3,500 LOC)
258 completed trials
Three conditions:

Condition	Description	Tooling
A	Vanilla Claude Code	Built-in tools only
B	BM25 prepended rankings	Retrieval augmentation
C	Graph navigation via MCP	Structural traversal

Tasks were partitioned into three groups:

G1 — Semantic (keyword discoverable)
G2 — Structural (reachable via import chains)
G3 — Hidden (architecturally connected but lexically invisible)

The metric used was Architectural Coverage Score (ACS):

$$ ACS = \frac{|files_{accessed} \cap files_{required}|}{|files_{required}|} $$

This measures navigational completeness — not code correctness, but discovery fidelity.

Core Results

Condition	G1	G2	G3	Overall ACS
Vanilla	90.0%	79.7%	76.2%	82.0%
BM25	100.0%	85.1%	78.2%	87.1%
Graph	88.9%	76.4%	99.4%	88.3%

The headline number is stark:

On hidden-dependency tasks (G3), graph navigation improves coverage by +23.2 percentage points over both baselines.

BM25 provides almost zero lift on G3.

Because retrieval cannot rank what vocabulary does not expose.

The Adoption Paradox — Tools Work, Agents Don’t Always Use Them

The most revealing finding isn’t performance. It’s behavior.

In Condition C:

MCP tool adoption rate: 42%
ACS when tool used: 99.5%
ACS when ignored: 80.2%

Tool Usage	Mean ACS
Used	99.5%
Ignored	80.2%

The graph is effective.

But the model frequently chooses not to call it.

Even more surprising:

G2 structural tasks: 0% graph adoption
G3 hidden tasks (after improved prompt engineering): 100% adoption

This suggests models invoke structural tools only when they “sense” difficulty.

In other words, tool effectiveness is not the limiting factor.

Tool adoption discipline is.

For production systems, this implies something non-obvious:

Structural workflow enforcement may be more important than tool design.

Optional tools create optional rigor.

Strategic Implications — Infrastructure > Window Size

1. Context Expansion Has Diminishing Returns

Beyond a certain point, adding tokens does not improve architectural discovery.

The failure mode becomes navigational, not computational.

2. Retrieval Is Not Broken — It’s Misapplied

BM25 dominates semantic tasks. It is cheap, fast, and effective.

But expecting retrieval to surface architecturally hidden dependencies is like using search to map a dependency injection tree.

Wrong abstraction layer.

3. Structural Graphs Become First-Class Infrastructure

Graph construction is not a one-time preprocessing step.

The paper emphasizes a deployment assumption: graph quality requires human validation and maintenance.

A stale graph is worse than no graph.

This introduces a governance dimension:

Who owns the architectural graph?
Who validates edges?
How is drift detected?

Graph-backed coding agents imply graph-backed process discipline.

4. Agent Behavior Control Is a Research Frontier

Prompt engineering improved G3 adoption from 85.7% to 100%.

But relying on prompt formatting to enforce tool usage is brittle.

A more robust architecture might:

Force an initial dependency-mapping tool call
Separate planning and execution agents
Enforce structured navigation checkpoints

In enterprise deployments, this becomes workflow engineering — not model tuning.

The Navigation Paradox likely generalizes.

As LLMs are deployed in:

Legal document analysis
Financial risk mapping
Compliance review
Multi-agent orchestration

The problem shifts from:

“Can the model see everything?”

To:

“Does the model know where to look?”

Context scale does not guarantee structural awareness.

Navigation infrastructure may quietly become the differentiator between demo-grade agents and production-grade systems.

Conclusion — Bigger Isn’t Smarter

This study makes a subtle but powerful argument.

Large context windows reduce one bottleneck.

They expose another.

Graph-structured dependency navigation does not replace retrieval. It complements it — but only when consistently invoked.

The most important takeaway for operators is pragmatic:

Invest in navigational infrastructure.
Enforce structural workflows.
Treat graph maintenance as core engineering.

Because as repositories grow, attention does not become omniscient.

It becomes selective.

And selection without structure is just guesswork at scale.

Cognaptus: Automate the Present, Incubate the Future.

Lost in the Repo: Why Bigger Context Windows Still Miss the Point

Opening — Bigger Context, Same Blind Spots

Background — Retrieval vs. Navigation

Analysis — CodeCompass and Controlled Benchmarking

Experimental Setup

Core Results

The Adoption Paradox — Tools Work, Agents Don’t Always Use Them

Strategic Implications — Infrastructure > Window Size

1. Context Expansion Has Diminishing Returns

2. Retrieval Is Not Broken — It’s Misapplied

3. Structural Graphs Become First-Class Infrastructure

4. Agent Behavior Control Is a Research Frontier

Broader Reflection — The Navigation Paradox Beyond Code

Conclusion — Bigger Isn’t Smarter

Opening — Bigger Context, Same Blind Spots#

Background — Retrieval vs. Navigation#

Analysis — CodeCompass and Controlled Benchmarking#

Experimental Setup#

Core Results#

The Adoption Paradox — Tools Work, Agents Don’t Always Use Them#

Strategic Implications — Infrastructure > Window Size#

1. Context Expansion Has Diminishing Returns#

2. Retrieval Is Not Broken — It’s Misapplied#

3. Structural Graphs Become First-Class Infrastructure#

4. Agent Behavior Control Is a Research Frontier#

Broader Reflection — The Navigation Paradox Beyond Code#

Conclusion — Bigger Isn’t Smarter#

Opening — Bigger Context, Same Blind Spots

Background — Retrieval vs. Navigation

Analysis — CodeCompass and Controlled Benchmarking

Experimental Setup

Core Results

The Adoption Paradox — Tools Work, Agents Don’t Always Use Them

Strategic Implications — Infrastructure > Window Size

1. Context Expansion Has Diminishing Returns

2. Retrieval Is Not Broken — It’s Misapplied

3. Structural Graphs Become First-Class Infrastructure

4. Agent Behavior Control Is a Research Frontier

Broader Reflection — The Navigation Paradox Beyond Code

Conclusion — Bigger Isn’t Smarter