From Pixels to Patterns: Teaching LLMs to Read Physics

Logs are useful until they become a landfill.

Every serious automation system eventually produces the same awkward artifact: a long trace of what happened. A machine moved here. A sensor changed there. An object collided, rolled, paused, reversed, bounced, touched something else, and then the system reached—or failed to reach—the desired state. In principle, this trace contains the answer. In practice, it is the kind of answer that makes a language model stare at 5,000 tokens of coordinates and politely hallucinate a story.

That is the problem behind Discovering High Level Patterns from Simulation Traces, a paper by Sean Memery and Kartic Subr.¹ The paper is not mainly about making LLMs “better at physics” in the vague conference-demo sense. It is about a more useful question: how do we convert low-level simulation evidence into the kind of event-level structure that an LLM can reason over without fine-tuning?

The paper’s answer is mechanism-first. Do not feed the model more pixels, more frames, or more numerical state. First learn a library of detector programs that recognize reusable physical patterns—collision, support loss, rolling contact, cushion rebound, ball pocketed, spring pull, and similar events. Then use those detectors to turn raw simulation traces into sparse annotated simulation traces. Only after that should the language model summarize, choose actions, or synthesize reward programs.

In other words, the model is not asked to “understand physics” from scratch. It is handed a compressed event ledger. A modest intervention. Also, apparently, a necessary one.

The real problem is not missing data; it is unreadable evidence

The obvious instinct is to give the LLM richer state information. More frames. More screenshots. More coordinates. More simulation rollouts. This instinct is understandable, and mostly wrong.

The paper starts from the observation that current LLMs and vision-language models still struggle with specific physical systems. They may describe a scene plausibly, but physics reasoning requires exact temporal relationships: which object moved first, whether contact happened before or after a bounce, whether support was lost, whether a ball touched a cushion before being potted. These are not decorative details. They are the task.

A raw simulation trace is rich but badly aligned with language-model reasoning. It may contain object positions, velocities, identities, states, and time steps. That is excellent for a simulator. It is less excellent for a model trying to answer, “Did the red ball cause the green object to touch the blue object?” or “How should I write a reward function for a shot that rebounds off two cushions before potting the eight ball?”

The paper’s central correction is simple:

Reader belief	Correction	Why it matters
More raw trace data should help the LLM reason.	Raw traces are too fine-grained and verbose for reliable language reasoning.	Extra context can become noise rather than evidence.
A vision-language model should infer physical events from frames.	Many relevant events are temporal, relational, and sparse.	The model needs event structure, not just visual exposure.
Reward design can be generated directly from natural language.	The reward must be grounded in verifiable events inside the trace.	Otherwise the reward program is fluent but weakly connected to the simulator.
Human experts must hand-code all useful abstractions.	The paper learns detector programs from simulation traces, optionally guided by human or LLM labels.	This opens a path to scalable trace annotation, though not yet a universal one.

The replacement idea is not mystical. It is closer to a log parser with scientific manners: convert the trace into named, reusable, time-indexed events before asking an LLM to reason.

The pipeline turns frames into detectors, detectors into annotations, and annotations into reasoning

The paper’s mechanism has three major steps.

First, it learns high-level pattern detector programs. These are executable functions that inspect simulation traces and output whether a pattern is active at particular time steps. A “bounce” detector, for example, may look for velocity reversal. A “ball pocketed” detector may identify a billiard ball entering a pocket. The detectors are not neural hidden states. They are code.

Second, the detector library converts raw traces into annotated simulation traces, or ASTs. Instead of representing a rollout as a long sequence of low-level states, the system represents it as a sparse matrix of pattern activations. The row is time. The column is a learned pattern. The cell says whether the pattern occurred.

Third, those annotated traces are given to language models for downstream tasks: summarization, action selection in physics benchmarks, and reward-program synthesis from natural-language goals.

A simplified version looks like this:

Raw simulator trace
        ↓
Evolutionary program synthesis learns detector code
        ↓
Detector library recognizes reusable physical events
        ↓
Annotated Simulation Trace: sparse pattern activations over time
        ↓
LLM uses annotations for summaries, action choice, and reward programs

The important move is the middle layer. The paper does not merely ask an LLM to describe a video. It builds an interface between simulator evidence and language reasoning.

That interface is deliberately interpretable. A detector is a program. Its output can be inspected. Its label is readable. Its activation can be traced back to a region of the simulation. This is not the same as saying every detector is perfect. The paper explicitly notes that detections can be noisy in timing or position. But imperfect structured evidence can still beat fluent guesswork. This is one of those inconvenient truths that ruins many AI product decks.

Program synthesis is used as a discovery engine, not as a decorative coding trick

The paper adapts FunSearch-style evolutionary program synthesis to learn pattern detectors. The input is a set of simulation traces, candidate pattern labels, and a skeleton detector function. The output is a library of detector programs.

The candidate labels can come from humans or from an LLM. Human labels are domain-relevant phrases such as “bounce,” “spring pull,” “support contact,” or “cushion rebound.” LLM labels are suggested after the model sees example traces and is prompted to propose relevant reusable patterns. The paper tests both.

The scoring function is doing the real work. A candidate detector is rewarded when its pattern activations preserve meaningful differences between simulation traces. If two raw traces are similar, their annotations should also look similar. If two raw traces differ, their annotations should reflect that difference. The method also rewards novelty relative to the existing library, while penalizing slow, long, or degenerate programs.

That gives the system a practical bias: learn detectors that are concise, executable, non-redundant, and useful for distinguishing simulation behavior.

The paper then robustifies detectors through an ensemble process. Multiple discovery runs generate candidate code. The candidates are clustered by activation behavior. Reliability weights are tuned so that the resulting pattern library produces useful summaries and discriminative annotations. This is an implementation detail with practical consequences: single detector programs can work, but ensembles reduce noise and improve stability.

The useful business analogy is not “let the LLM code everything.” It is more specific: let program synthesis search for reusable event extractors, then treat those extractors as a structured interface layer. That is closer to automated feature engineering than to magic.

The main evidence: annotated traces improve physics benchmark performance

The paper evaluates the approach across three physics environments from the DeepPHY benchmark: PHYRE, I-PHYRE, and PoolTool. These are not identical tasks. PHYRE involves placing a red ball so green and blue objects touch. I-PHYRE involves removing objects to cause a red object to fall out of the scene. PoolTool involves billiards-style shot selection.

The key comparison is between the DeepPHY baseline and models using learned pattern annotations. The model used in the experiments is Qwen3.6 35B A3B with llama.cpp as the inference backend. The paper provides the initial image plus annotated patterns rather than video rollout frames.

The reported success rates are:

Method	PHYRE success	I-PHYRE success	PoolTool success
DeepPHY baseline	13.42%	40.03%	45.81%
Human-label pattern library	21.94%	54.53%	80.67%
LLM-label pattern library	22.42%	45.29%	80.36%

This is the paper’s main evidence for the practical usefulness of annotated traces. The results are not subtle in PoolTool: success rises from 45.81% to roughly 80% with either human-label or LLM-label libraries. I-PHYRE also improves, though human labels do better than LLM labels. PHYRE improves more modestly and remains hard.

That unevenness matters. The paper is not showing that one dozen learned patterns solve physical reasoning in general. It is showing that a sparse event layer can make downstream language-model reasoning more effective, and that the payoff depends on environment structure and pattern quality.

PoolTool is naturally event-friendly: cue strike, collision, rebound, pocketed ball. The game already thinks in event language. PHYRE is more visually and spatially diverse, so a small pattern library has less coverage. The result is not a failure. It is a warning label: trace compression only works when the extracted patterns preserve the causal distinctions the task needs.

Human labels help, but LLM labels are not useless decoration

One of the more interesting findings is that human-provided labels are not always clearly better than LLM-suggested labels. In PHYRE, LLM-label performance is slightly higher than human-label performance. In PoolTool, they are nearly identical. In I-PHYRE, human labels perform better.

This should be interpreted carefully. It does not prove that domain expertise is obsolete. Please do not fire the physicist and replace them with a prompt template; that would be the kind of “efficiency” that later becomes a legal department line item.

A better interpretation is that labels play two roles. They guide discovery by naming the kinds of patterns worth detecting, and they make the resulting annotations readable to the LLM. LLMs can propose usable labels when the environment has obvious event structure. Human expertise becomes more valuable when the causal mechanics are subtler or when pattern choice strongly affects task success.

For business systems, this distinction is useful. If an operation has obvious event categories—handoff, delay, queue jump, failed scan, temperature breach, refund reversal—LLM-suggested labels may be enough to bootstrap an event library. If the system depends on domain-specific failure modes—chemical process anomalies, credit-risk triggers, compliance exceptions, equipment fatigue—expert labels are not optional. They are the map.

The library-size test is an ablation, not a second thesis

The paper also studies how performance changes with the number of patterns in the library. This is best read as an ablation: it tests whether the pattern library itself contributes to performance and whether more patterns generally help.

The result is directionally sensible. Larger libraries generally improve downstream success rates. In I-PHYRE and PoolTool, around 12 patterns are enough to reach approximately 50%–75% success in the reduced benchmark setting. PHYRE shows diminishing returns around 10 patterns and remains below 20% in that analysis.

The practical lesson is not “always add more patterns.” The practical lesson is that pattern coverage and pattern relevance are different things.

More labels can help when they add distinct evidence. More labels are less helpful when the hard part is finding the right abstraction. A library full of weakly relevant events is just a polite spreadsheet of noise. We have all seen those.

For implementation, the paper’s library-size analysis suggests a staged approach:

Stage	What to test	Stop adding patterns when…
Bootstrap	Can a small label set improve summaries or decisions?	Performance does not beat raw trace or image baselines.
Expansion	Do added labels cover missing causal events?	New labels are redundant with existing detectors.
Calibration	Are detectors stable across traces?	Timing or localization noise harms downstream use.
Deployment	Do annotations improve real decisions, not just explanations?	Users like the summaries but outcomes do not improve.

This is where the paper’s mechanism becomes more business-relevant than its benchmark score. It gives a way to ask whether the event layer is worth maintaining.

The summarization results show compression as a performance feature

The paper evaluates generated summaries in two ways: computational efficiency and human judgment.

For efficiency, the annotated-trace method reduces both generation time and token count compared with the DeepPHY baseline:

Environment	DeepPHY baseline seconds / tokens	Human-label seconds / tokens	LLM-label seconds / tokens
PHYRE	85.87 / 5505	45.84 / 1754	44.83 / 1720
I-PHYRE	74.18 / 5505	38.11 / 984	37.71 / 972
PoolTool	75.72 / 5505	47.97 / 964	43.20 / 926

This is not merely a cost-reduction table. It is evidence that the representation is better aligned with the task. The model receives less input but produces more useful summaries. That is the dream version of compression: fewer tokens, more signal, less interpretive theater.

For quality, the paper reports a Prolific study with 100 participants. Participants watched rollout videos and rated summaries generated from image-only context, human-label annotations, or LLM-label annotations on a 1–7 Likert scale. The paper reports that summaries based on learned pattern annotations were rated more accurate than image-only summaries.

The study’s likely purpose is not to prove that the detectors are physically perfect. It supports a narrower claim: annotated traces provide salient event information that helps language models describe simulation behavior in a way humans judge as more accurate.

That distinction matters. Summarization is not the same as control. A summary can be useful while still missing a rare failure mode. But in business workflows, better summaries are often the first layer of value: incident reports, audit trails, operator explanations, customer-service diagnostics, and post-event reviews all need compressed evidence.

Reward synthesis is the application test: natural language becomes executable goals

The most operationally interesting part of the paper is reward program synthesis.

Given a natural-language goal, the system asks an LLM to synthesize a reward program in a custom domain-specific language. The program operates over annotated simulation traces, not raw frames. It can check whether patterns occurred, whether one event happened after another, whether a count threshold was met, or whether an object ended near a target location.

A PoolTool-style goal might be: pot the 9-ball in the lower-left pocket without touching a cushion. The corresponding reward program can combine pattern predicates, temporal ordering, and spatial terms. The paper then optimizes actions using that reward program.

This section should be read as an application test. It asks whether learned annotations can ground natural-language goals well enough to support optimization. The comparison is against sparse binary rewards, where the optimizer only receives success or failure. Dense synthesized reward programs give partial credit, making optimization more sample-efficient.

The paper reports that synthesized reward programs outperform sparse binary rewards across tested optimization budgets on 10 natural-language PoolTool goals and 100 held-out scenes. The mechanism is intuitive: a dense reward can tell the optimizer it is getting closer, while a binary reward says “no” until the very end. Anyone who has managed junior staff, trained a model, or tried to learn golf understands the problem.

But the boundary is equally important. The reward programs are synthesized within a designed DSL and evaluated in physics simulators where success can be hand-verified. This is not a general proof that any business goal in English can become a safe executable objective. It is a proof-of-concept for a constrained loop:

Natural-language goal
        ↓
DSL reward program over known event patterns
        ↓
Simulation-based optimization
        ↓
Hand-coded verification of success

That loop is promising precisely because it is constrained.

The single-program test shows robustness, not a free lunch

The paper includes a discussion ablation where it removes the ensemble optimization and uses the best single program for each label. The results remain viable:

Method	PHYRE success	I-PHYRE success	PoolTool success
Human-label single-program library	19.61%	51.32%	76.13%
LLM-label single-program library	21.52%	39.64%	79.23%

Compared with the ensemble results, the single-program libraries generally perform worse or show higher variability, but they do not collapse. This is useful evidence. It suggests that the core benefit comes from the learned event abstraction, while the ensemble step improves stability and noise reduction.

For a business reader, this matters because the full research system may be expensive to reproduce. A lighter system using single detectors could still deliver value in narrower settings. But the paper also gives a reason not to oversimplify too quickly: production systems care about tail behavior, and detector instability is exactly where expensive mistakes hide.

A reasonable deployment pattern would be to start with single detectors for prototyping, then add ensembles or reliability weighting for high-stakes workflows.

What this means for business automation: trace compression before reasoning

The business relevance of this paper is not limited to videogames or billiards. Its deeper lesson is about AI interfaces for dynamic systems.

Many business processes produce traces:

warehouse movement logs;
call-center conversation timelines;
transaction-monitoring sequences;
manufacturing sensor streams;
insurance claim histories;
software incident logs;
robotic or RPA execution traces;
customer onboarding workflows;
compliance review trails.

The common failure pattern is the same. Teams collect detailed logs, then ask an LLM to “summarize what happened.” Sometimes it works. Sometimes the model produces an impressively coherent explanation of the wrong causal chain. The problem is not that the model lacks enough context. The problem is that the context has not been translated into the right abstraction layer.

The paper suggests a more disciplined architecture:

Technical contribution in the paper	Operational equivalent	ROI relevance
Learn detector programs from simulation traces.	Learn or define event extractors from operational logs.	Reduces manual review and brittle rule writing.
Convert traces into sparse annotated simulation traces.	Convert noisy logs into compact event timelines.	Cuts token cost and improves interpretability.
Use labels from humans or LLMs.	Combine expert taxonomies with LLM-assisted event discovery.	Speeds up domain modeling without removing expert oversight.
Use annotations for summaries and decisions.	Feed structured event histories to LLM copilots or agents.	Improves explanation quality and action recommendations.
Synthesize reward programs from natural language.	Translate business objectives into constrained evaluators.	Supports optimization, simulation, and policy testing.

The Cognaptus inference is straightforward: in automation-heavy businesses, the next useful layer may not be a larger chatbot. It may be a trace-to-pattern layer that sits between systems of record and language models.

That layer would not be glamorous. It would look like event schemas, detector libraries, validation dashboards, and boring test suites. Naturally, this means it might actually work.

What the paper directly shows, and what we should not pretend it shows

The paper directly shows that learned pattern annotations can improve LLM performance on selected physics benchmark tasks, reduce summarization token/time costs, improve human-rated summary accuracy, and support reward-program optimization in constrained simulation settings.

It also shows that LLM-suggested labels can sometimes work comparably to human-specified labels, though not uniformly. It shows that library size generally matters, but pattern relevance matters too. It shows that single-program detectors are viable, while ensembles improve robustness.

What Cognaptus infers for business use is broader but still bounded: operational traces should often be compressed into event-level representations before being passed to LLMs. This is especially relevant when decisions depend on temporal order, causal chains, and sparse key events.

What remains uncertain is also clear.

First, the environments are physics simulators with well-defined state structure. Business systems are often messier: missing fields, ambiguous timestamps, inconsistent identifiers, and human behavior that refuses to conserve momentum.

Second, the pattern libraries are small. The paper notes that scaling to larger libraries could increase computation and token costs. In business settings, pattern taxonomies can become political museums: every team wants its own category, no one wants to delete old ones, and suddenly the “compact event layer” is 900 labels long.

Third, detector noise matters. The paper acknowledges that learned patterns can trigger imprecisely in timing or position. For low-stakes summarization, that may be acceptable. For safety, compliance, or financial execution, it requires validation and monitoring.

Fourth, reward synthesis is constrained by the DSL. This is good for safety but limits generality. The method works because the reward language is structured around known events and allowed predicates. Remove that structure and the system becomes another fluent wish generator.

The managerial lesson: build the interface before blaming the model

A lazy reading of the paper says: LLMs need better physics. A better reading says: LLMs need better interfaces to evidence.

That distinction is important for business leaders. When an AI assistant fails to reason over a process, the immediate response is often to upgrade the model, expand the prompt, add screenshots, or fine-tune. Sometimes that helps. Often it just gives the model a larger swamp to wander through.

This paper points to a different diagnostic question: did we translate the system’s raw trace into the right set of reusable, verifiable events?

If not, the model is being asked to infer the event schema, detect the events, order them, reason over them, and produce a decision—all inside one prompt. That is not intelligence. That is poor architecture with a nice UI.

For Cognaptus-style automation, the pattern-discovery idea is valuable because it separates responsibilities:

simulators and systems generate low-level traces;
detector programs extract event structure;
annotated traces provide compact evidence;
LLMs handle language, explanation, and flexible goal expression;
optimizers or rule engines execute against verifiable objectives.

This is not anti-LLM. It is pro-division-of-labor. The model gets to do what it is good at. The code gets to do what code is good at. Everyone is happier, except perhaps the person who wanted one prompt to replace the architecture diagram.

Conclusion: pixels are not patterns, and traces are not explanations

The paper’s contribution is not that it discovers every physical law from data. It does something more immediately useful: it shows how to learn interpretable event detectors that turn low-level simulation traces into compact annotated evidence for language models.

That shift—from pixels to patterns, from traces to annotations, from natural-language goals to executable reward programs—is the core idea. The benchmark gains matter because they validate the interface. The summarization results matter because they show compression can improve both cost and quality. The reward-program experiment matters because it connects language to optimization through verifiable event structure.

The business lesson is equally plain. Before asking an LLM to reason over a dynamic system, build the event layer. Do not mistake raw data access for understanding. And do not be surprised when a model given 5,000 tokens of coordinates writes a beautiful story about the wrong bounce.

Patterns first. Reasoning second. Hype never.

Cognaptus: Automate the Present, Incubate the Future.

Sean Memery and Kartic Subr, “Discovering High Level Patterns from Simulation Traces,” arXiv:2602.10009v2, 2026, https://arxiv.org/abs/2602.10009. ↩︎

The real problem is not missing data; it is unreadable evidence#

The pipeline turns frames into detectors, detectors into annotations, and annotations into reasoning#

Program synthesis is used as a discovery engine, not as a decorative coding trick#

The main evidence: annotated traces improve physics benchmark performance#

Human labels help, but LLM labels are not useless decoration#

The library-size test is an ablation, not a second thesis#

The summarization results show compression as a performance feature#

Reward synthesis is the application test: natural language becomes executable goals#

The single-program test shows robustness, not a free lunch#

What this means for business automation: trace compression before reasoning#

What the paper directly shows, and what we should not pretend it shows#

The managerial lesson: build the interface before blaming the model#

Conclusion: pixels are not patterns, and traces are not explanations#