Opening — Why this matters now

Large Language Models are increasingly invited into optimization workflows. They write solvers, generate heuristics, and occasionally bluff their way through mathematical reasoning. But a more uncomfortable question has remained largely unanswered: do LLMs actually understand optimization problems—or are they just eloquent impostors?

This paper tackles that question head‑on. Instead of judging LLMs by what they say, it examines what they encode. And the results are quietly provocative.

Background — From hand‑crafted features to black‑box intuition

Algorithm selection in combinatorial optimization has always been a feature‑engineering problem in disguise. We manually extract instance characteristics—graph density, constraint tightness, job durations—then hope a classifier learns which solver will behave best. Frameworks like Instance Space Analysis (ISA) formalized this process, but at the cost of deep domain expertise and bespoke tooling.

LLMs promise a different path: universal models that ingest raw instances and implicitly infer structure. Prior work focused on whether LLMs could solve or formulate optimization problems. Much less attention has been paid to whether they internally represent the problem structure in a way that is useful.

That gap is where this study operates.

Analysis — Three ways to interrogate an LLM

The authors design a clean, almost surgical methodology built around three questions:

  1. Can LLMs explicitly extract instance features when asked?
  2. Do LLM hidden layers implicitly encode those features—even when the model fails to verbalize them?
  3. Can those latent representations replace traditional features for per‑instance algorithm selection?

To answer them, the paper combines direct querying and probing.

Direct querying: asking nicely

LLMs are prompted to extract numeric features—node counts, graph density, average processing times—from four classic combinatorial problems:

  • Bin Packing (BPP)
  • Graph Coloring (GCP)
  • Job Shop Scheduling (JSP)
  • Knapsack (KP)

Instances are presented in three forms: standard input files, code‑like MiniZinc descriptions, and natural language.

The verdict is predictable but important: LLMs excel at features that are explicitly stated or trivially extractable, especially in natural language. The moment arithmetic or aggregation is required, performance collapses.

In short: LLMs can read, but they struggle to compute.

Probing: listening to what the model doesn’t say

The more interesting experiment freezes the LLM and ignores its output entirely. Instead, the authors extract the final hidden‑layer activations and train lightweight probes to predict instance features.

Three probe types are tested:

Probe Purpose
Linear regression Tests linear accessibility
Shallow MLP Captures mild non‑linear structure
LightGBM Captures rich non‑linear dependencies

Pooling strategies (mean, max, last‑token) are applied to convert token‑level activations into fixed embeddings.

Here the story flips.

For complex features—those that LLMs failed to explicitly compute—probing significantly outperforms direct querying. The information is there, just not easily verbalized.

LLMs, it seems, know more than they can explain.

Findings — When embeddings rival human expertise

The most consequential experiment uses these latent embeddings for per‑instance algorithm selection.

Probes are trained to predict which solver performs best for each instance, using only:

  • LLM hidden‑layer representations or
  • Traditional ISA feature sets

The result is uncomfortable for feature engineers:

Feature Source Accuracy Practical Difference
ISA features Slightly higher Statistically detectable, practically negligible
LLM embeddings Nearly identical Within ±0.01 accuracy

Across problems, LLM‑derived embeddings match ISA‑based representations to within a few hundredths of accuracy. For some problems, they are effectively indistinguishable.

Max pooling offers a small but consistent edge. Representation format—standard, code‑like, or natural language—matters far less than expected.

The implication is subtle but profound: LLMs function as general‑purpose instance feature extractors, without ever being told what a “feature” is.

Implications — What this means for optimization practice

This does not mean handcrafted features are obsolete. They remain:

  • Cheaper to compute
  • More interpretable
  • Easier to audit

But the trade‑off is now explicit:

Hand‑crafted features LLM embeddings
Interpretable Opaque
Domain‑specific Domain‑agnostic
Cheap Computationally expensive
Manual effort Minimal design effort

For exploratory analysis, rapid prototyping, or domains lacking mature feature sets, LLM embeddings offer a compelling alternative.

More provocatively, this work suggests a future where algorithm selection pipelines skip feature engineering entirely—feeding raw instances directly into frozen foundation models.

Conclusion — Less talk, more structure

LLMs are not reliable calculators. They are not consistent optimizers. And they are certainly not transparent reasoners.

But they do internalize meaningful structural information about combinatorial problems—enough to rival decades of handcrafted instance analysis when it comes to algorithm selection.

The real value of LLMs in optimization may lie not in their answers, but in their silence: the geometry of their hidden states.

Cognaptus: Automate the Present, Incubate the Future.