Opening — Why this matters now
Large models can write poetry, generate code, and debate philosophy. Yet show them a bouncing ball in a physics simulator and ask, “Why did that happen?”—and things get awkward.
The problem is not intelligence in the abstract. It is interface. Language models operate in a world of tokens. Physics simulators operate in a world of state vectors and time steps. Somewhere between $(x_t, y_t, v_t)$ and “the ball bounced off the wall,” meaning gets lost.
The paper behind this article proposes a quietly radical idea: don’t ask LLMs to interpret raw trajectories. Teach the system to extract high-level patterns first—collisions, support relationships, rolling transitions—and only then let language models reason. It is less about making LLMs smarter, and more about giving them better abstractions.
That distinction matters for anyone building AI agents that interact with environments, from robotics to simulation-driven training pipelines.
Background — Where Physics Meets Language (and Fails)
Physical reasoning benchmarks such as PHYRE and related suites have repeatedly shown that foundation models struggle with low-level dynamics. Video-LMs and multimodal systems can describe scenes, but reliably extracting causal structure from frame-by-frame state changes remains brittle.
The issue is structural:
| Layer | Representation | Strength | Weakness |
|---|---|---|---|
| Simulation | Continuous state traces $\tau = {x_1, x_2, …, x_N}$ | Precise dynamics | Opaque to language models |
| Language Models | Tokens & semantic abstractions | Flexible reasoning | No access to structured event grounding |
| Human reasoning | Events (“bounce”, “stack formed”) | Interpretable & causal | Requires abstraction |
Previous approaches tried to push LLMs directly into the simulator loop, or train vision-language models to internalize physics implicitly. Results were mixed.
This work takes a different route: extract event-level patterns explicitly, and treat them as an interface layer between physics and language.
Method — Discovering Patterns from Simulation Traces
1. From Trajectories to Annotated Simulation Traces
Let a trajectory be:
$$ \tau = {x_1, x_2, …, x_N} $$
Each $x_i$ is a full simulation state at time step $i$.
Instead of reasoning directly over $\tau$, the system constructs an annotation matrix $A \in {0,1}^{N \times |P|}$ where each column corresponds to a pattern detector $p_j$.
If pattern $p_j$ (e.g., “elastic collision between A and B”) is active at time $i$, then:
$$ A_{ij} = 1 $$
This transforms a dense physical trace into a sparse, event-driven abstraction: an Annotated Simulation Trace (AST).
2. Learning Pattern Detectors via Evolutionary Synthesis
The clever part is not manual pattern coding. Instead, the system uses an evolutionary program search framework inspired by FunSearch.
Each candidate detector program $f_j$ is evaluated by a composite fitness score:
$$ \nu = \rho + \eta - \lambda - \psi $$
Where:
- $\rho$ measures correlation between geometry differences and pattern differences
- $\eta$ rewards novelty vs. existing library
- $\lambda$ penalizes code length
- $\psi$ penalizes computational cost
Rather than optimizing for labeled correctness, the search optimizes structural usefulness. Patterns that meaningfully differentiate trajectories survive.
This yields both guided patterns (seeded by short natural language hints) and self-discovered patterns (emergent structural motifs).
Examples include:
- Post-collision induced motion
- Support handoff beneath object
- Sliding-to-rolling transition
- Pivoted sweep rotation
- Global rest state
These are not pixels. They are concepts.
Applications — Why Abstraction Changes Everything
Once trajectories are converted into ASTs, LLMs operate on structured event streams instead of raw dynamics.
The paper demonstrates four major applications:
| Task | Without Patterns | With AST Interface |
|---|---|---|
| Q&A on physical rollouts | Error-prone | Improved accuracy |
| Summarization | Surface-level description | Event-based reasoning |
| Puzzle solving (PHYRE) | Weak planning | Better interpretation |
| Reward program synthesis | Brittle heuristics | Near-human average rewards |
Most striking is reward synthesis. Given a natural language goal such as:
“Make the green and blue objects touch after the lever launches the ball.”
The system synthesizes a formal DSL reward program composed of detected patterns and temporal predicates.
This means goals become executable programs grounded in event logic—not just text.
Findings — Performance and Trade-offs
The evaluation uses:
- 2D rigid-body tasks from PHYRE (25 templates × 100 variants = 2500 tasks)
- A custom Q&A benchmark over rollouts
- Reward optimization experiments
Observed Improvements
| Dimension | Effect of Pattern Library |
|---|---|
| Interpretability | High — events are explicit |
| Generalization (within domain) | Strong in 2D rigid-body |
| LM reasoning reliability | Improved |
| Computational overhead | Increased due to detection layer |
Reward programs generated from ASTs achieved near-human average rewards, although still with a gap.
The limitation is clear: experiments are confined to 2D rigid-body simulations. Scaling to richer environments may increase annotation density and long-context burdens.
Abstraction simplifies reasoning—but also adds structural complexity.
Implications — A Blueprint for Agentic AI
This work is not just about bouncing balls.
It outlines a broader architectural principle for agentic systems:
Insert an interpretable event abstraction layer between raw environment signals and language reasoning.
For business and engineering teams building AI systems, this has practical implications:
- Robotics & Simulation Training — Instead of feeding raw telemetry to LLMs, extract structured event logs first.
- Explainable Agents — Event libraries provide audit trails for decisions.
- Reward Engineering — Natural language goals can compile into executable, inspectable reward programs.
- AI Governance — Pattern libraries function as constraint layers, making behaviors verifiable.
In short, this is an architecture play.
LLMs do not need to learn Newton’s laws implicitly if the system already encodes “bounce,” “support,” and “handoff.”
The future of physical AI may depend less on bigger models, and more on smarter interfaces.
Conclusion — Patterns as the Missing Interface
Language models are excellent narrators. Physics engines are excellent calculators.
Between them lies a translation problem.
This paper proposes that the bridge is not more tokens or more frames—but event abstraction.
Teach the system to recognize patterns. Then let language reason over them.
It is almost mundane. And precisely because of that, it feels scalable.
If agentic AI is going to interact meaningfully with the physical world, it will need interpretable intermediate structures. Pattern libraries are one compelling candidate.
And sometimes, progress is simply giving the model the right vocabulary.
Cognaptus: Automate the Present, Incubate the Future.