Opening — Why this matters now
Large Language Models are increasingly invited into optimization workflows. They write solvers, generate heuristics, and occasionally bluff their way through mathematical reasoning. But a more uncomfortable question has remained largely unanswered: do LLMs actually understand optimization problems—or are they just eloquent impostors?
This paper tackles that question head‑on. Instead of judging LLMs by what they say, it examines what they encode. And the results are quietly provocative.
Background — From hand‑crafted features to black‑box intuition
Algorithm selection in combinatorial optimization has always been a feature‑engineering problem in disguise. We manually extract instance characteristics—graph density, constraint tightness, job durations—then hope a classifier learns which solver will behave best. Frameworks like Instance Space Analysis (ISA) formalized this process, but at the cost of deep domain expertise and bespoke tooling.
LLMs promise a different path: universal models that ingest raw instances and implicitly infer structure. Prior work focused on whether LLMs could solve or formulate optimization problems. Much less attention has been paid to whether they internally represent the problem structure in a way that is useful.
That gap is where this study operates.
Analysis — Three ways to interrogate an LLM
The authors design a clean, almost surgical methodology built around three questions:
- Can LLMs explicitly extract instance features when asked?
- Do LLM hidden layers implicitly encode those features—even when the model fails to verbalize them?
- Can those latent representations replace traditional features for per‑instance algorithm selection?
To answer them, the paper combines direct querying and probing.
Direct querying: asking nicely
LLMs are prompted to extract numeric features—node counts, graph density, average processing times—from four classic combinatorial problems:
- Bin Packing (BPP)
- Graph Coloring (GCP)
- Job Shop Scheduling (JSP)
- Knapsack (KP)
Instances are presented in three forms: standard input files, code‑like MiniZinc descriptions, and natural language.
The verdict is predictable but important: LLMs excel at features that are explicitly stated or trivially extractable, especially in natural language. The moment arithmetic or aggregation is required, performance collapses.
In short: LLMs can read, but they struggle to compute.
Probing: listening to what the model doesn’t say
The more interesting experiment freezes the LLM and ignores its output entirely. Instead, the authors extract the final hidden‑layer activations and train lightweight probes to predict instance features.
Three probe types are tested:
| Probe | Purpose |
|---|---|
| Linear regression | Tests linear accessibility |
| Shallow MLP | Captures mild non‑linear structure |
| LightGBM | Captures rich non‑linear dependencies |
Pooling strategies (mean, max, last‑token) are applied to convert token‑level activations into fixed embeddings.
Here the story flips.
For complex features—those that LLMs failed to explicitly compute—probing significantly outperforms direct querying. The information is there, just not easily verbalized.
LLMs, it seems, know more than they can explain.
Findings — When embeddings rival human expertise
The most consequential experiment uses these latent embeddings for per‑instance algorithm selection.
Probes are trained to predict which solver performs best for each instance, using only:
- LLM hidden‑layer representations or
- Traditional ISA feature sets
The result is uncomfortable for feature engineers:
| Feature Source | Accuracy | Practical Difference |
|---|---|---|
| ISA features | Slightly higher | Statistically detectable, practically negligible |
| LLM embeddings | Nearly identical | Within ±0.01 accuracy |
Across problems, LLM‑derived embeddings match ISA‑based representations to within a few hundredths of accuracy. For some problems, they are effectively indistinguishable.
Max pooling offers a small but consistent edge. Representation format—standard, code‑like, or natural language—matters far less than expected.
The implication is subtle but profound: LLMs function as general‑purpose instance feature extractors, without ever being told what a “feature” is.
Implications — What this means for optimization practice
This does not mean handcrafted features are obsolete. They remain:
- Cheaper to compute
- More interpretable
- Easier to audit
But the trade‑off is now explicit:
| Hand‑crafted features | LLM embeddings |
|---|---|
| Interpretable | Opaque |
| Domain‑specific | Domain‑agnostic |
| Cheap | Computationally expensive |
| Manual effort | Minimal design effort |
For exploratory analysis, rapid prototyping, or domains lacking mature feature sets, LLM embeddings offer a compelling alternative.
More provocatively, this work suggests a future where algorithm selection pipelines skip feature engineering entirely—feeding raw instances directly into frozen foundation models.
Conclusion — Less talk, more structure
LLMs are not reliable calculators. They are not consistent optimizers. And they are certainly not transparent reasoners.
But they do internalize meaningful structural information about combinatorial problems—enough to rival decades of handcrafted instance analysis when it comes to algorithm selection.
The real value of LLMs in optimization may lie not in their answers, but in their silence: the geometry of their hidden states.
Cognaptus: Automate the Present, Incubate the Future.