When LLMs Stop Talking and Start Choosing Algorithms

Warehouse.

That is a useful place to begin, because combinatorial optimization only sounds abstract until someone has to decide which trucks leave first, which jobs enter which machines, which items fit into which containers, or which solver should be trusted before the deadline starts laughing.

In those systems, the hardest question is often not “What is the answer?” It is “Which method should we use for this particular instance?” One algorithm works beautifully on one family of cases and then quietly embarrasses itself on another. This is not a personality flaw. It is the normal condition of optimization.

The paper Behavior and Representation in Large Language Models for Combinatorial Optimization: From Feature Extraction to Algorithm Selection asks a timely question: can large language models help with that selection problem, not by solving the optimization task directly, but by representing the instance well enough that another model can choose the right algorithm?¹

That distinction matters. Most business discussions about LLMs in optimization still orbit around the glamorous version of the story: ask the model to formulate a problem, generate a heuristic, write solver code, or reason its way to a better solution. Charming. Occasionally useful. Also a rich habitat for confident nonsense.

This paper is more interesting because it moves the LLM away from the microphone. It studies what happens when the model is not mainly judged by its generated answer, but by the information contained in its hidden representations. The result is not “LLMs are now optimization experts.” The result is subtler and more useful: LLMs may be weak calculators but surprisingly serviceable instance feature extractors.

That is where the business relevance begins.

The useful comparison is not “LLM versus optimizer”

The obvious summary would say the paper tests LLMs on combinatorial optimization and finds that hidden representations can support algorithm selection. True, but too flat.

The stronger reading is comparative. The paper is built around several contrasts, and each contrast answers a different operational question.

Comparison	What the paper tests	Business question underneath
Direct querying vs. probing	Whether the LLM can say the right feature value, versus whether its hidden states encode enough information for a probe to recover it	Should we ask the model for facts, or use it as a representation engine?
Natural language vs. code-like vs. standard formats	Whether input representation affects feature extraction and algorithm-selection performance	Do raw solver formats need translation before LLM processing?
Handcrafted features vs. ISA features vs. LLM embeddings	Whether automated latent features can rival human-designed optimization descriptors	Can feature engineering effort be reduced without losing selection accuracy?
Interpretability vs. computational cost	Whether opaque LLM embeddings are worth using when manual features are cheaper and auditable	Where does the ROI actually sit?

The paper’s answer is not uniform across these comparisons. That is the point. If an executive takeaway fits into a single sentence, it will probably be wrong in at least two departments.

Algorithm selection is feature engineering wearing a suit

Combinatorial optimization problems are not just “hard.” They are unevenly hard. A solver that performs well on one graph-coloring instance may not dominate another. A heuristic that packs one bin-packing distribution efficiently may be unimpressive elsewhere. Job-shop scheduling and knapsack problems have their own little ecosystems of difficulty.

Traditional per-instance algorithm selection handles this by extracting features from each instance: size, density, ratios, extrema, averages, structural descriptors, and other handcrafted signals. Instance Space Analysis, or ISA, formalizes this idea by mapping instances into a feature space where solver performance can be studied.

This is sensible. It is also labor-intensive.

For each problem domain, someone has to decide which descriptors matter, write the extraction logic, test the descriptors, and maintain the pipeline. In a stable domain, that effort can be justified. In a messy business environment where the problem formulation keeps changing, the handcrafted-feature pipeline becomes another system that needs a babysitter. Naturally, it calls itself “infrastructure.”

The paper asks whether an LLM can reduce that manual burden by acting as a general-purpose extractor of instance structure. Not by magic. Not by “understanding” in a philosophical sense. By turning raw instance representations into embeddings that can be used by downstream classifiers.

The experimental design separates talking from representing

The authors study four benchmark combinatorial optimization problems:

Problem	Instances	Portfolio size	Structure type
Bin Packing Problem (BPP)	8,815	2 algorithms	packing / assignment
Graph Coloring Problem (GCP)	8,278	2 algorithms	graph
Job-Shop Scheduling Problem (JSP)	4,467	12 algorithms	scheduling / permutation
Knapsack Problem (KP)	5,300	3 algorithms	selection / capacity

Each instance is represented in three formats:

standard, meaning a conventional problem-file representation;
code-like, based on structured modeling-style descriptions;
natural language, where the instance is described textually.

The model used is Llama-3.2-3B-Instruct. The authors choose it partly because its long context window allows full instances to be processed without truncation. That detail matters: if the model had lost part of the instance, the experiment would become a study of context failure rather than representation.

The paper then runs three main experimental components.

First, direct querying: the LLM is prompted to extract numeric features from instances, returning a constrained JSON object. This tests observable behavior.

Second, feature probing: the model is frozen, final-layer hidden activations are extracted, and regressors are trained to predict instance features from those activations. This tests whether features are implicitly encoded.

Third, algorithm-selection probing: classifiers are trained on LLM-derived embeddings to predict the best-performing algorithm for each instance. This tests whether those representations are practically useful for solver selection.

The scale is not toy-sized. The full experiment covers 26,860 instances, each under three representations. The authors report 691,617 GPU-intensive LLM executions, plus more than 1.8 million regression experiments and over 322,000 classification experiments. The direct-querying phase alone took about two weeks on a DGX Station with four 80GB A100 GPUs; probing took about one week.

This is already one business lesson: “feature extraction without feature engineering” is not automatically “cheap.” The bill just moves from expert labor to compute, opacity, and pipeline complexity. Wonderful. A different invoice.

Direct querying works until arithmetic enters the room

The direct-querying test is the most intuitive part of the paper. Give the model an instance. Ask for a feature. Check against ground truth.

For directly extractable features, the model often performs well, especially when the instance is expressed in natural language or code-like form. For example, in Table 4, exact-match rates for directly extractable features are near-perfect in several settings: BPP reaches 1.00 in natural language, GCP reaches 1.00 in both code-like and natural language formats, JSP reaches 1.00 in code-like format, and KP reaches 0.99 in natural language.

Then the difficulty rises.

Features are grouped by extraction complexity:

directly extractable features, such as explicitly stated counts;
low-effort features, such as simple extrema or counts;
high-effort features, such as averages, ratios, density, or aggregation-heavy values.

The model’s performance deteriorates sharply when it must compute or aggregate. For high-effort features, exact-match rates are often near zero. BPP high-effort features show 0.00 exact match across all three representations. KP high-effort features also show 0.00 exact match across all three representations. JSP does somewhat better in places, but still remains weak: high-effort exact-match rates range from 0.01 under standard representation to 0.14 under natural language.

The representation format also matters. Natural language and code-like inputs generally help direct querying, while standard formats perform markedly worse. KP is the most dramatic example: for directly extractable KP features, natural language reaches 0.99 exact match and code-like reaches 0.97, while standard representation reaches only 0.06.

The interpretation is not mysterious. LLMs are trained to process textual and semi-structured patterns. They can retrieve visible facts from friendly representations. They are less reliable when the task becomes procedural arithmetic over long structured inputs.

So the first correction is simple:

Reader belief	Correction	Why it matters
If the LLM understands the instance, it should output the feature correctly	Output accuracy depends heavily on whether the feature is explicitly visible or requires computation	Direct prompting is not a dependable substitute for feature extractors
Standard solver formats are “structured,” so they should be easier	For this model, standard formats are often harder than natural language or code-like text	Preprocessing and representation design still matter
JSON constraints solve reliability	Structured decoding helps output validity, not numerical reasoning	Format control is not reasoning control

This is where many LLM-for-optimization demos become misleading. A clean JSON answer can still contain a bad number. A schema guarantees shape, not truth. The model is behaving politely while being wrong. Very enterprise.

Probing asks a better question: what is encoded, not what is said

The more interesting part begins when the authors stop asking the LLM to speak.

In feature probing, the model processes the instance, but no generated answer is used. Instead, the final-layer hidden activations are extracted. Because each instance produces token-level activations, the authors need a fixed-length representation. They test three pooling strategies:

mean pooling, which averages token activations;
max pooling, which takes maximum activation values across tokens;
last-token pooling, which uses the final token representation.

Then regressors try to recover the target feature values from those embeddings.

This experiment has a different purpose from direct querying. It is not asking whether the LLM can produce the right answer. It is asking whether the information needed for the answer leaves a usable trace in the model’s internal representation.

The answer is partially yes.

The authors report that direct querying remains stronger for simple extraction features, especially in natural language. But for high-computation features, probing often has an advantage: the hidden representations contain enough structural information that a supervised probe can recover complex properties more effectively than the model can explicitly generate them.

This should not be over-romanticized. It does not mean the LLM “knows” the average degree of a graph in the human sense and is merely too shy to say it. It means the activation space preserves information correlated with that feature, and a trained decoder can exploit it.

That distinction is important for business use. If a company wants a trustworthy numeric feature, it should not replace deterministic extraction with an LLM answer. But if it wants a broad latent descriptor for downstream prediction, the hidden state may be useful even when the generated response is not.

The operational shift looks like this:

Bad use:
Raw instance -> LLM prompt -> "Please compute the feature" -> trust answer

Better candidate use:
Raw instance -> frozen LLM -> hidden activations -> pooled embedding -> trained predictor / selector

The first design treats the LLM as an analyst. The second treats it as a sensor. Sensors can be noisy, but at least we stop asking them to write executive summaries.

Algorithm selection is where the paper becomes commercially interesting

Feature probing is intellectually useful. Algorithm selection is operationally useful.

The authors train classifiers to predict which algorithm performs best for each instance. The labels are set-aware: if multiple algorithms tie as winners, predicting any winning algorithm counts as correct. That is a reasonable choice, because in real solver portfolios the goal is not to satisfy a metaphysical preference for one algorithm; the goal is to avoid choosing a loser.

The baselines are also well chosen:

Most frequent winner, a naive classifier that always predicts the most common best algorithm;
handcrafted feature classifiers, using the features from the earlier experiments;
ISA-based classifiers, using richer features from prior Instance Space Analysis studies;
LLM-derived representation classifiers, using pooled hidden activations.

The classifier comparison matters before the feature comparison. LightGBM and Logistic Regression clearly outperform the most-frequent-winner baseline. In the mixed model, Logistic Regression improves accuracy by 0.149 over the baseline, while LightGBM improves it by 0.192. The MLP is not competitive enough for the later analyses, so the authors exclude it from the main comparison.

That exclusion is not a side note. It tells us that the value is not “use neural networks everywhere because neural networks are fashionable.” The best-performing pipeline here is more mundane and more plausible: frozen LLM embeddings plus a strong tabular classifier.

For businesses, that is good news. A LightGBM-based selector is easier to test, monitor, retrain, and explain at the performance level than an end-to-end black-box “agent” that insists it has a plan.

LLM embeddings come surprisingly close to ISA features

The central result is the comparison between ISA-based features and LLM-derived embeddings for algorithm selection.

The authors first establish that ISA features have a modest advantage over the simpler handcrafted feature set. In the mixed model, ISA representation adds 0.031 accuracy over the handcrafted baseline, while LightGBM adds 0.091 over Logistic Regression in that comparison. The tests are nuanced: one pairwise comparison does not detect a statistically significant difference, but the equivalence test does not support treating ISA and handcrafted features as interchangeable. The practical message is that richer ISA features remain a meaningful baseline.

Then comes the sharper test: ISA features versus LLM-derived embeddings, using LightGBM.

The LLM setup uses the standard representation with max pooling, chosen after examining representation and pooling effects. The mixed model reports an LLM intercept accuracy of 0.891. ISA adds 0.008. That effect is statistically detectable, but practically tiny. The equivalence test supports the conclusion that ISA and LLM-derived representations are equivalent for this algorithm-selection setting.

That is the paper’s most important result.

Not because LLM embeddings beat ISA. They do not, at least not in the clean “victory slide” sense. ISA remains slightly higher. But the difference is within a few hundredths of accuracy, and the authors conclude that the two are practically indistinguishable under the tested conditions.

A sober interpretation:

Result	What it supports	What it does not prove
LLM embeddings achieve accuracy close to ISA features	Frozen LLM representations can act as useful instance descriptors	LLMs can replace all domain-specific optimization knowledge
ISA adds only 0.008 accuracy over the LLM setup in the final LightGBM comparison	The practical gap is small in these benchmarks	The gap will remain small in real industrial instances
Max pooling gives a statistically significant but modest advantage over mean pooling	Pooling choice matters, but not dramatically	There is a universal best pooling method for all optimization tasks
KP appears more challenging for the LLM representation	Problem structure affects reliability	The method fails on knapsack-like problems in general

The last row matters. The paper does not give permission to deploy this blindly across every routing, scheduling, packing, and resource-allocation system. It shows that the approach is plausible enough to deserve pilot-stage testing.

That is a useful result. Less glamorous than “LLM solves optimization.” Much less embarrassing, too.

Representation format matters less for selection than for direct extraction

One of the more interesting contrasts is that representation format matters strongly for direct querying but less consistently for algorithm selection.

When the model is asked to output explicit feature values, natural language and code-like representations generally help. Standard formats often perform poorly. That makes intuitive sense: a standard instance file may be easy for a solver and unpleasant for a language model.

But in the algorithm-selection probing experiment, the effects are weaker. In the full factorial mixed model, code-like representation has a small negative coefficient relative to standard representation, and natural language has a near-zero positive coefficient. Neither effect is significant in the reported coefficient table. The interaction between representation and pooling is also not significant.

Pooling has a clearer, though still modest, effect. Max pooling improves accuracy by 0.011 over mean pooling, while last pooling is not significant.

This is operationally useful. It suggests that for algorithm selection, the LLM embedding may capture enough signal even from standard representations, provided the downstream classifier is appropriate. That does not mean input formatting is irrelevant. It means formatting may be less decisive for latent selection performance than for explicit feature extraction.

So there are two different engineering lessons:

If the goal is asking the model to return features, representation design is critical.
If the goal is using hidden activations for algorithm selection, representation design still matters, but classifier choice and pooling may dominate.

Conflating those two tasks is how teams build fragile demos and then act surprised when the production system develops a sense of humor.

The appendix is not decoration; it tells us what kind of evidence this is

The appendix tables are useful because they clarify which results are main evidence and which are robustness or configuration checks.

Test or table	Likely purpose	What it supports	What it does not support
Direct-querying metrics by problem and representation	Main evidence for explicit feature extraction	LLMs retrieve visible/simple features better than computed ones	LLMs are reliable numerical reasoners
Feature-probing MAE tables	Main evidence for latent feature encoding	Hidden states partially encode structural and numerical information	Probes recover all features accurately
Classifier-effect mixed model	Implementation and model-selection evidence	LightGBM and Logistic Regression are stronger selector models than naive baselines	The LLM representation alone causes all gains
ISA vs. handcrafted comparison	Baseline calibration	ISA features are a stronger reference than simple handcrafted features	Handcrafted features are useless
Representation and pooling factorial test	Sensitivity / configuration analysis	Max pooling has a small advantage; representation effects are inconsistent	One representation-pooling setup is universally optimal
ISA vs. LLM LightGBM comparison	Main downstream evidence	LLM embeddings are practically comparable to ISA features in tested settings	LLM embeddings are cheaper, interpretable, or universally general

This matters because the paper’s business implication depends on the downstream comparison, not merely on feature probing. It is one thing to say hidden states contain some recoverable information. It is another to say that this information is useful enough to select solvers nearly as well as ISA descriptors.

The paper supports the second claim under specific benchmark conditions. That is stronger than a curiosity, weaker than a deployment guarantee, and exactly the kind of result worth paying attention to.

The business value is solver routing, not solver replacement

The practical pathway is not to let an LLM “solve” your scheduling problem by chatting at it.

A more credible architecture would look like this:

Instance arrives
   ↓
Represent instance in a consistent format
   ↓
Frozen LLM processes the instance
   ↓
Hidden activations are pooled into an embedding
   ↓
A trained classifier predicts the best solver / heuristic / configuration
   ↓
The selected algorithm solves the instance
   ↓
Outcome is logged for retraining and monitoring

This turns the LLM into a feature layer inside an algorithm portfolio system. It does not replace the solver. It helps decide which solver deserves the first shot.

For business applications, that maps naturally to several domains:

Domain	Possible use	Why the paper is relevant
Logistics and routing-adjacent planning	Select among heuristics for different daily instance profiles	Instance structure changes across days, regions, and demand patterns
Job scheduling and manufacturing	Route scheduling cases to different dispatching or metaheuristic methods	Solver performance depends heavily on job-machine structure
Packing and allocation	Select packing heuristics based on capacity, item distribution, or constraint shape	Manual feature design can become domain-specific quickly
Internal optimization platforms	Build a portfolio selector that sits above existing solvers	LLM embeddings may reduce feature-engineering workload for new problem families

The inference is not that every company should immediately add an LLM activation extractor to its optimization stack. The inference is that LLM embeddings are now credible enough to test as automated descriptors when handcrafted feature engineering is expensive, incomplete, or slow to adapt.

That is a narrower claim. It is also a better one.

The ROI trade-off is human feature design versus compute and opacity

The paper’s conclusion is refreshingly balanced: LLM embeddings reduce manual feature design, but they are computationally expensive and less interpretable.

That trade-off should not be hidden under a layer of AI enthusiasm. The authors’ own experimental setup makes the cost visible. Direct querying took about two weeks on four A100 GPUs; probing took about one week. Production systems may not repeat the entire research pipeline, but extraction still requires running large models over potentially long instances. That is not free, unless the accounting department has also been replaced by a hallucinating agent.

The comparison looks like this:

Approach	Strength	Weakness	Best-fit business use
Handcrafted features	Cheap, interpretable, auditable	Requires domain expertise and manual maintenance	Stable, high-volume production systems with mature domain knowledge
ISA features	Stronger structured characterization	Still domain-specific and feature-engineered	Research-grade or high-value optimization portfolios
LLM embeddings	Low manual feature design, broadly applicable representation	Opaque, compute-heavy, less directly auditable	Pilot projects, new domains, exploratory portfolio selection
Direct LLM feature answers	Easy interface, natural workflow	Weak for computation and aggregation	Low-risk assistance, not trusted numeric extraction

The decision is not “LLM or no LLM.” The decision is where the LLM belongs.

If interpretability and deterministic auditing dominate, handcrafted features remain attractive. If a team is exploring new optimization families and lacks mature feature sets, LLM embeddings become more interesting. If the system must run in real time at large scale, compute cost may dominate. If the system runs expensive planning batches where solver choice strongly affects quality or runtime, a heavier selector may be justified.

This is the kind of boring trade-off that makes technology real.

What the paper directly shows, and what Cognaptus infers

It is worth separating the evidence from the business interpretation.

Layer	Statement
Directly shown by the paper	Llama-3.2-3B-Instruct can directly extract simple features from natural-language and code-like COP instances, but struggles with features requiring computation or aggregation.
Directly shown by the paper	Hidden-layer representations partially encode structural and numerical instance information recoverable through supervised probes.
Directly shown by the paper	LLM-derived embeddings, paired with LightGBM, achieve algorithm-selection accuracy practically comparable to ISA-based features across the tested benchmark settings.
Cognaptus inference	LLM embeddings may be useful as automated instance descriptors in algorithm-portfolio systems, especially where feature engineering is expensive or immature.
Still uncertain	Whether the method generalizes to larger real-world instances, other LLM families, richer solver portfolios, changing business constraints, and production latency requirements.

That middle line—automated instance descriptors—is the business wedge. Not “AI replaces operations researchers.” Not “the model understands optimization.” Not “just prompt harder.” The wedge is narrower: use a frozen foundation model to convert messy structured instances into a representation that helps a downstream selector choose among proven algorithms.

That is less cinematic than an autonomous optimization agent. It is also less likely to set the warehouse on fire.

Boundaries: useful result, controlled setting

The paper’s limitations are not incidental. They define where the result can and cannot travel.

First, the study uses one LLM family: Llama-3.2-3B-Instruct. Other models may encode different structural information, handle standard formats differently, or impose different compute costs.

Second, the benchmark set covers four classic combinatorial optimization problems. That is broad enough to be meaningful, but not broad enough to represent the full mess of industrial optimization: hybrid constraints, noisy data, soft business rules, changing objectives, incomplete inputs, and stakeholders who “just need one small exception” every morning.

Third, the algorithm-selection task is offline and benchmark-based. The paper does not test an end-to-end production optimizer where instance distributions drift, solver runtimes matter alongside objective values, and monitoring has to detect when the selector starts making expensive mistakes.

Fourth, interpretability remains unresolved. ISA and handcrafted features are understandable. LLM embeddings are not. A classifier may perform well while offering little explanation of why one solver is preferred. In regulated or high-stakes planning contexts, that can matter as much as accuracy.

Fifth, compute cost is real. The authors deliberately avoided commercial LLMs for reproducibility and cost control, but the reported hardware and runtime still show that hidden-state extraction is not a casual spreadsheet operation.

These boundaries do not weaken the paper. They prevent misuse. A result that says “pilot this carefully” is more valuable than a result that pretends deployment is just a slide transition away.

The quiet future of LLMs in optimization may be less verbal

The most useful lesson from this paper is that LLM value in optimization may not come from letting the model talk more.

Direct answers are brittle when arithmetic enters the room. Prompting can extract visible information, especially from friendly representations, but it does not turn the model into a dependable feature calculator. The more promising path is quieter: process the instance, extract the representation, pool the activations, train a selector, and let proven algorithms do the solving.

That reframes the role of the LLM. It becomes part of the diagnostic layer, not the decision sovereign. It helps characterize the problem instance; it does not replace optimization expertise. It may reduce feature-engineering effort; it does not eliminate the need for validation, monitoring, and cost discipline.

For businesses dealing with logistics, scheduling, packing, allocation, and other optimization-heavy workflows, that is the useful takeaway. Do not ask whether an LLM can “solve optimization.” Ask whether its hidden representation can help your system choose the right tool for the instance in front of it.

Less talking. More routing.

A surprisingly mature idea, despite the industry’s best efforts to make everything theatrical.

Cognaptus: Automate the Present, Incubate the Future.

Francesca Da Ros, Luca Di Gaspero, and Kevin Roitero, “Behavior and Representation in Large Language Models for Combinatorial Optimization: From Feature Extraction to Algorithm Selection,” arXiv:2512.13374, 2025. https://arxiv.org/abs/2512.13374 ↩︎

The useful comparison is not “LLM versus optimizer”#

Algorithm selection is feature engineering wearing a suit#

The experimental design separates talking from representing#

Direct querying works until arithmetic enters the room#

Probing asks a better question: what is encoded, not what is said#

Algorithm selection is where the paper becomes commercially interesting#

LLM embeddings come surprisingly close to ISA features#

Representation format matters less for selection than for direct extraction#

The appendix is not decoration; it tells us what kind of evidence this is#

The business value is solver routing, not solver replacement#

The ROI trade-off is human feature design versus compute and opacity#

What the paper directly shows, and what Cognaptus infers#

Boundaries: useful result, controlled setting#

The quiet future of LLMs in optimization may be less verbal#