From Pixels to Patterns: Teaching LLMs to Read Physics

Opening — Why this matters now

Large models can write poetry, generate code, and debate philosophy. Yet show them a bouncing ball in a physics simulator and ask, “Why did that happen?”—and things get awkward.

The problem is not intelligence in the abstract. It is interface. Language models operate in a world of tokens. Physics simulators operate in a world of state vectors and time steps. Somewhere between $(x_t, y_t, v_t)$ and “the ball bounced off the wall,” meaning gets lost.

The paper behind this article proposes a quietly radical idea: don’t ask LLMs to interpret raw trajectories. Teach the system to extract high-level patterns first—collisions, support relationships, rolling transitions—and only then let language models reason. It is less about making LLMs smarter, and more about giving them better abstractions.

That distinction matters for anyone building AI agents that interact with environments, from robotics to simulation-driven training pipelines.

Background — Where Physics Meets Language (and Fails)

Physical reasoning benchmarks such as PHYRE and related suites have repeatedly shown that foundation models struggle with low-level dynamics. Video-LMs and multimodal systems can describe scenes, but reliably extracting causal structure from frame-by-frame state changes remains brittle.

The issue is structural:

Layer	Representation	Strength	Weakness
Simulation	Continuous state traces $\tau = {x_1, x_2, …, x_N}$	Precise dynamics	Opaque to language models
Language Models	Tokens & semantic abstractions	Flexible reasoning	No access to structured event grounding
Human reasoning	Events (“bounce”, “stack formed”)	Interpretable & causal	Requires abstraction

Previous approaches tried to push LLMs directly into the simulator loop, or train vision-language models to internalize physics implicitly. Results were mixed.

This work takes a different route: extract event-level patterns explicitly, and treat them as an interface layer between physics and language.

Method — Discovering Patterns from Simulation Traces

1. From Trajectories to Annotated Simulation Traces

Let a trajectory be:

$$ \tau = {x_1, x_2, …, x_N} $$

Each $x_i$ is a full simulation state at time step $i$.

Instead of reasoning directly over $\tau$, the system constructs an annotation matrix $A \in {0,1}^{N \times |P|}$ where each column corresponds to a pattern detector $p_j$.

If pattern $p_j$ (e.g., “elastic collision between A and B”) is active at time $i$, then:

$$ A_{ij} = 1 $$

This transforms a dense physical trace into a sparse, event-driven abstraction: an Annotated Simulation Trace (AST).

2. Learning Pattern Detectors via Evolutionary Synthesis

The clever part is not manual pattern coding. Instead, the system uses an evolutionary program search framework inspired by FunSearch.

Each candidate detector program $f_j$ is evaluated by a composite fitness score:

$$ \nu = \rho + \eta - \lambda - \psi $$

Where:

$\rho$ measures correlation between geometry differences and pattern differences
$\eta$ rewards novelty vs. existing library
$\lambda$ penalizes code length
$\psi$ penalizes computational cost

Rather than optimizing for labeled correctness, the search optimizes structural usefulness. Patterns that meaningfully differentiate trajectories survive.

This yields both guided patterns (seeded by short natural language hints) and self-discovered patterns (emergent structural motifs).

Examples include:

Post-collision induced motion
Support handoff beneath object
Sliding-to-rolling transition
Pivoted sweep rotation
Global rest state

These are not pixels. They are concepts.

Applications — Why Abstraction Changes Everything

Once trajectories are converted into ASTs, LLMs operate on structured event streams instead of raw dynamics.

The paper demonstrates four major applications:

Task	Without Patterns	With AST Interface
Q&A on physical rollouts	Error-prone	Improved accuracy
Summarization	Surface-level description	Event-based reasoning
Puzzle solving (PHYRE)	Weak planning	Better interpretation
Reward program synthesis	Brittle heuristics	Near-human average rewards

Most striking is reward synthesis. Given a natural language goal such as:

“Make the green and blue objects touch after the lever launches the ball.”

The system synthesizes a formal DSL reward program composed of detected patterns and temporal predicates.

This means goals become executable programs grounded in event logic—not just text.

Findings — Performance and Trade-offs

The evaluation uses:

2D rigid-body tasks from PHYRE (25 templates × 100 variants = 2500 tasks)
A custom Q&A benchmark over rollouts
Reward optimization experiments

Observed Improvements

Dimension	Effect of Pattern Library
Interpretability	High — events are explicit
Generalization (within domain)	Strong in 2D rigid-body
LM reasoning reliability	Improved
Computational overhead	Increased due to detection layer

Reward programs generated from ASTs achieved near-human average rewards, although still with a gap.

The limitation is clear: experiments are confined to 2D rigid-body simulations. Scaling to richer environments may increase annotation density and long-context burdens.

Abstraction simplifies reasoning—but also adds structural complexity.

Implications — A Blueprint for Agentic AI

This work is not just about bouncing balls.

It outlines a broader architectural principle for agentic systems:

Insert an interpretable event abstraction layer between raw environment signals and language reasoning.

For business and engineering teams building AI systems, this has practical implications:

Robotics & Simulation Training — Instead of feeding raw telemetry to LLMs, extract structured event logs first.
Explainable Agents — Event libraries provide audit trails for decisions.
Reward Engineering — Natural language goals can compile into executable, inspectable reward programs.
AI Governance — Pattern libraries function as constraint layers, making behaviors verifiable.

In short, this is an architecture play.

LLMs do not need to learn Newton’s laws implicitly if the system already encodes “bounce,” “support,” and “handoff.”

The future of physical AI may depend less on bigger models, and more on smarter interfaces.

Conclusion — Patterns as the Missing Interface

Language models are excellent narrators. Physics engines are excellent calculators.

Between them lies a translation problem.

This paper proposes that the bridge is not more tokens or more frames—but event abstraction.

Teach the system to recognize patterns. Then let language reason over them.

It is almost mundane. And precisely because of that, it feels scalable.

If agentic AI is going to interact meaningfully with the physical world, it will need interpretable intermediate structures. Pattern libraries are one compelling candidate.

And sometimes, progress is simply giving the model the right vocabulary.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Where Physics Meets Language (and Fails)#

Method — Discovering Patterns from Simulation Traces#

1. From Trajectories to Annotated Simulation Traces#

2. Learning Pattern Detectors via Evolutionary Synthesis#

Applications — Why Abstraction Changes Everything#

Findings — Performance and Trade-offs#

Observed Improvements#

Implications — A Blueprint for Agentic AI#

Conclusion — Patterns as the Missing Interface#