Ports, But Make Them Agentic: When LLMs Start Running the Yard

Ports are already full of automation. Cranes move containers, AGVs follow routes, software coordinates flows, dashboards blink reassuringly at managers who are paid to pretend that blinking equals control.

Then one terminal changes its layout, closes a road, adds a vehicle restriction, or introduces a new safety corridor. Suddenly the “automated” dispatching system needs engineers, operations researchers, domain experts, test scripts, model reformulation, solver debugging, and several meetings where everyone discovers that “just adjust the rule” was not, in fact, just.

That is the real problem behind PortAgent, an LLM-driven vehicle dispatching agent for automated container terminals.¹ The paper is not mainly about whether an LLM can write a bit of Python. We have passed that novelty stage, mercifully. The sharper contribution is architectural: it treats the transfer of a Vehicle Dispatching System, or VDS, as a controlled agentic workflow.

The distinction matters. A chatbot answers. An agentic workflow retrieves knowledge, builds a model, writes code, executes it, reads failure signals, and tries again. In industrial settings, that loop is where the money hides.

The expensive part is not dispatching once; it is transferring dispatching

Vehicle Dispatching Systems coordinate AGV fleets inside automated container terminals. Their job is simple to state and painful to implement: get vehicles to the right places at the right time without creating queues, crane idling, road conflicts, or safety violations.

A VDS that works in one terminal does not automatically work in another. Terminal layouts differ. Road networks differ. Vehicle fleets differ. Operational requirements differ. Even a small local rule — “this vehicle cannot use that segment” or “dangerous goods must pass through this corridor” — can force changes in the underlying optimization model.

The paper separates the transfer bottleneck into three familiar categories:

Bottleneck	What it means in practice	Why ordinary automation struggles
Specialist dependency	Engineers and OR scientists must reinterpret terminal-specific constraints	The dispatching logic is coupled to local operational knowledge
Data requirement	Data-driven systems may need terminal-specific training or retraining data	New terminals may not have enough clean, labeled operational data
Deployment friction	Model reformulation, coding, execution, debugging, and validation repeat manually	The handoffs are slow, and misunderstandings accumulate

This is why “we already automated the port” can be a slightly comic sentence. The runtime system may be automated, but the deployment and adaptation process often remains artisanal. Very expensive artisanal software, naturally.

PortAgent attacks that adaptation layer.

PortAgent works because it is a workflow, not a heroic prompt

The paper’s main design choice is a Virtual Expert Team, or VET. The name sounds more theatrical than necessary, but the mechanism is sensible. Instead of asking a single LLM prompt to understand the port, formulate the optimization model, write solver code, execute it, and debug everything in one long reasoning chain, PortAgent decomposes the task into four role-prompted experts inside a single foundational LLM setup:

Virtual expert	Operational role	Business translation
Knowledge Retriever	Retrieves modeling primitives and code examples from a curated knowledge base	Keeps the agent grounded in domain patterns rather than free-associating
Modeler	Converts structured terminal inputs and natural-language requirements into a modeling scheme	Turns operational language into optimization logic
Coder	Translates the model into executable Python code using Pyomo and Gurobi	Produces deployable artifacts, not just explanations
Debugger	Performs static analysis, sandbox execution, error diagnosis, and correction feedback	Replaces part of the manual run-fail-fix loop

The important point is that these are not four magical personalities arguing in a Slack channel. They are task boundaries. Each boundary shortens the reasoning chain and reduces the chance that the LLM will smear together operational interpretation, mathematical formulation, and implementation details.

That is a practical lesson for industrial AI systems: the unit of design is not the prompt. The unit of design is the workflow.

PortAgent also structures the input environment into three JSON components: network topology, resource/task configuration, and operational requirements. This matters more than it may first appear. Natural language alone is too elastic for optimization. JSON alone is too brittle for messy operational rules. PortAgent uses both: structured data for the stable environment, language for local constraints.

That hybrid design is quietly important. Industrial agents rarely fail because nobody can describe the ideal system. They fail because half the information is structured, half is buried in operational phrasing, and the two halves do not meet politely.

Retrieval is not decoration; it supplies the modeling grammar

The Knowledge Retriever uses a curated knowledge base with two components.

First, it stores modeling primitives: variable definitions, canonical constraints such as flow balance, and objective functions such as minimizing travel time. Second, it stores code exemplars: Python scripts that demonstrate the overall workflow of data loading, preprocessing, model construction, solver execution, and result formatting.

This is where a likely misconception appears. The point is not “give the LLM more examples.” That is the lazy version of RAG, and like most lazy versions of things, it works until it doesn’t.

PortAgent’s evidence suggests something more specific: the example should teach the fundamental workflow, not drown the model in scenario-specific details. In the paper’s tests, a single classic MAPP example performs better than three examples and better than examples tailored to specific scenarios.

That is counterintuitive only if one thinks context is always nutrition. Sometimes context is cholesterol.

The paper’s quantity test compares 0-shot, 1-shot, and 3-shot configurations under engineer-level prompts. The 1-shot setting reaches the best reported combination: 100% code executability, 93.33% solver success, and the lowest computation time. The 3-shot setting performs worse, which the authors attribute to contextual noise from multiple slightly different examples.

The type test is even more useful. A classic dispatching example achieves 100% CER and 93.33% SSR, with computation time around 101 seconds. Scenario-specific examples are not consistently better. One road-closure example configuration falls to 67% CER and 67% SSR and takes about 300 seconds. The lesson is not that specialization is bad. The lesson is that premature specialization can teach the model the wrong abstraction level.

For business deployment, this is the difference between building a knowledge base of reusable operating principles and building a junk drawer of past cases. The first helps transfer. The second may merely look impressive during procurement.

The self-correction loop is where autonomy becomes measurable

PortAgent’s Debugger is not a decorative “critic” step. It performs concrete checks.

The workflow first applies static analysis using Python’s Abstract Syntax Tree to catch structural and syntactic issues before execution. Then it runs the generated script in a sandbox. If execution fails, the system captures error feedback, diagnoses the likely cause, and sends correction instructions back to the Modeler and Coder.

That makes the self-correction loop operational rather than literary:

Generate a model and code.
Check structure.
Execute in a sandbox.
Capture syntax or runtime errors.
Diagnose the root cause.
Regenerate with correction instructions.
Stop when a valid solution is produced or the iteration limit is reached.

The evaluation uses a maximum of three iterations. This is important because it prevents the result from becoming “the agent eventually works if allowed to thrash forever.” Controlled iteration makes the performance claim more interpretable.

The ablation study shows why this loop matters. With the full PortAgent, the paper reports 100% CER and 93.3% SSR. Without RAG, CER drops to 40.0% and SSR to 26.7%. Without self-correction, both CER and SSR fall to 33.33%.

Configuration	CER	SSR	Likely purpose of the test	What it supports
Full PortAgent	100%	93.3%	Main architecture evidence	Retrieval plus correction can support reliable transfer in the tested setting
Without RAG	40.0%	26.7%	Ablation	Domain grounding is not optional
Without self-correction	33.33%	33.33%	Ablation	First-pass generation is not reliable enough for deployment

This is one of the paper’s strongest business-relevant findings. RAG and self-correction are not interchangeable. RAG gives the agent the modeling grammar. Self-correction turns generation into an executable process. Remove either one and the system becomes much less useful.

The wider implication is blunt: if an enterprise “agent” cannot execute, inspect, and repair its own intermediate artifacts, it is not yet a serious automation layer. It is a chat interface wearing a hard hat.

The main test is a controlled transfer problem, not a live port deployment

The evaluation focuses on Multi-AGV Path Planning, or MAPP. The test problem is formulated on a directed graph representing the terminal road network. AGVs have origin-destination tasks, and the objective is to minimize total travel time subject to flow-balance constraints and scenario-specific operational requirements.

The authors test three scenario types:

Scenario	Operational meaning	Why it matters
Road closure	A bidirectional road segment becomes unavailable	Tests whether the agent can modify topology constraints
Forbidden roads for specific trucks	A particular AGV cannot traverse restricted segments	Tests vehicle-specific path compatibility
Designated routes for dangerous goods	A task must include a mandatory safe subpath	Tests whether the agent can handle subpath constraints

The testbed uses 30 AGVs and 20 nodes. For each of the three scenario types, the authors generate five random instances, creating 15 base scenarios. Each scenario is then described in three language styles: technician-level, engineer-level, and scientist-level. This creates 45 test instances.

The benchmark is a specialist-driven method: an OR expert manually translates each instance into a mathematical formulation and implements it using Python and Gurobi. PortAgent is then assessed using two correctness metrics:

Metric	Meaning
Code Executability Rate (CER)	Whether the generated Python script runs without runtime errors
Solver Success Rate (SSR)	Whether the generated script matches the ground-truth objective value within tolerance

This distinction is critical. Executable code is not the same as correct code. Anyone who has shipped software has learned this, usually in production and with sadness.

PortAgent achieves 100% CER across the 45 instances. That means every generated script executed successfully. It solves 42 of 45 instances correctly, for an overall SSR of 93.33%. Across scenario types, SSR ranges from 86.67% to 100%.

Those are strong results for the tested setting. They do not prove general port autonomy. They do show that a structured LLM agent can transfer a dispatching formulation across controlled unseen variants with high reliability and short deployment time.

The failures reveal the real boundary: semantics, not syntax

The most useful part of the result is not the 100% executability. It is the three failures.

All three unsuccessful cases are categorized as semantic misinterpretations. The generated code ran. The solver produced output. The problem was that the mathematical meaning was wrong.

Two failures involved road closure: the model did not correctly enforce the bidirectional nature of the closure. One failure involved dangerous goods: the model constrained the entire path to the designated route instead of treating the designated route as a mandatory subpath.

This is the right kind of failure analysis because it identifies the boundary of the architecture. PortAgent can catch syntax errors and many runtime errors. It can iterate through code-level failures. But if the system misunderstands the operational requirement while still producing valid code, execution alone may not catch the mistake.

That is the uncomfortable part for business users. The agent can be wrong in the exact way enterprise systems often become dangerous: silently, formally, and with a perfectly clean output file.

The paper attributes these failures to ambiguity in natural-language inputs and probabilistic variation in LLM output. That diagnosis is plausible, but the practical conclusion is more concrete: high-stakes agentic optimization needs semantic validation, not only code validation.

A deployment-grade version would likely need requirement confirmation, constraint-level inspection, test-case generation, and audit logs that compare intended rules against generated model constraints. Otherwise the system may optimize the wrong problem very efficiently, which is a classic management achievement.

The expertise test supports specialist reduction, not expert disappearance

One of the paper’s more commercially attractive claims is that PortAgent reduces dependence on port operations specialists. The evaluation tests this by rewriting the same scenarios in three styles:

Input level	Example character	What the test checks
Technician-level	Informal operational language	Can ordinary users describe the issue without OR formalism?
Engineer-level	Clear operational terminology	Can operations staff use precise but non-mathematical descriptions?
Scientist-level	Formal mathematical specification	Does formal input still help?

The statistical tests show no significant differences across expertise levels for CER, SSR, iterations, or computation time at the $p < 0.05$ level. The reported p-values include 1.0000 for CER, 0.3425 for SSR, 0.1125 for iterations, and 0.0846 for computation time.

That supports the paper’s argument that PortAgent can reduce the need for specialists in the critical path. But the wording matters. It does not mean experts become useless. It means the system may lower the required expertise level for generating and testing a candidate dispatching model.

The distinction is not cosmetic. In a real terminal, expertise may shift from “manually formulate every scenario” to “design the knowledge base, review ambiguous constraints, approve deployment boundaries, and investigate semantic failures.” That is still expertise. It is just moved upstream and into governance.

This is often how serious AI automation works. It does not delete the expert. It changes where the expert becomes economically valuable.

The speed result is impressive, but it measures workflow compression

The paper reports average end-to-end deployment time of about 83 seconds for PortAgent, compared with “several hours to several days” for the traditional specialist-driven method.

That is a dramatic compression. Still, the interpretation should be precise. The measured time covers the agent’s process from receiving the prompt to producing the final solution in the experimental setting. It is not a full enterprise rollout time including integration testing, operational sign-off, safety review, incident response planning, and the meeting where someone asks whether the dashboard can be blue.

So the business meaning is not “ports can deploy new dispatching logic in 83 seconds.” The better interpretation is: the model-formulation-code-debug loop, which normally consumes substantial expert labor, can be compressed into a short automated cycle for controlled problem classes.

That is still valuable. In fact, it is more credible because it is narrower.

Paper result	Direct meaning	Business inference	Boundary
100% CER	All generated scripts executed	The agent can produce runnable optimization artifacts	Runnable does not guarantee semantically correct
93.33% overall SSR	42 of 45 instances matched ground truth	The workflow transfers well in controlled MAPP variants	Tested on a representative but limited setup
One classic example works best	Clean workflow guidance beats noisy context	Curated examples matter more than example volume	May differ for broader VDS types
About 83 seconds average time	Fast automated formulation and debugging	Specialist workflows can be compressed	Not equivalent to full production deployment
Failures are semantic	Code checks miss meaning errors	Audit must inspect constraints, not just execution	Requires governance and human review for high-stakes use

The business value is controlled transfer, not “AI runs the port”

For Cognaptus readers, the most useful lesson is not that container terminals are suddenly autonomous. They are not. The useful lesson is that PortAgent offers a template for automating expert-heavy transfer workflows.

Many industrial and operational systems share the same pattern:

A base optimization or decision system works in one environment.
A new environment introduces local constraints.
Human experts translate those constraints into model changes.
Engineers implement the changes.
The system fails, gets debugged, and eventually works.
The whole process repeats at the next site.

That pattern appears in warehouses, fleet routing, airline scheduling, energy dispatch, manufacturing planning, and financial operations infrastructure. PortAgent suggests that an LLM agent can sit in the transfer layer: not replacing the solver, not replacing structured data, not replacing operational governance, but accelerating the translation from local requirements to executable models.

A practical enterprise architecture would look less like “ask the AI to optimize the port” and more like this:

Operational change
        ↓
Structured environment input + natural-language requirement
        ↓
Retrieved modeling primitives and clean workflow examples
        ↓
Model formulation
        ↓
Executable solver code
        ↓
Sandbox execution and self-correction
        ↓
Semantic review and deployment approval

Notice the last line. The paper’s own failure analysis earns it a place there.

The mistake would be to market this as full autonomy. The opportunity is to use it as controlled autonomy: automate the slow mechanical parts of expert deployment while reserving explicit review for semantic correctness, safety, and operational intent.

Where the paper is strong, and where a buyer should still ask questions

The strongest part of the paper is its mechanism-evidence alignment. The architecture has three claims: reduce specialist dependency, reduce data needs, and reduce deployment time. The evaluation maps reasonably well onto those claims: expertise-level tests, few-shot tests, deployment-time comparison, and ablations for RAG and self-correction.

The ablation study is especially useful because it prevents the architecture from becoming a list of fashionable components. RAG is tested. Self-correction is tested. Both matter.

The boundary is scope. The experiments are on MAPP scenarios in a representative network, not on a full messy terminal deployment with live disruptions, multiple interacting subsystems, shifting equipment availability, and safety-critical human procedures. The benchmark ground truth is specialist-coded Gurobi solutions, which is appropriate for the test, but still keeps the environment controlled.

A buyer or operator should therefore ask several questions before treating this as deployable infrastructure:

Question	Why it matters
How are natural-language requirements confirmed before code generation?	Most failures came from semantic misinterpretation
Can the generated model expose constraints in a reviewable form?	Audit needs to inspect logic, not just code
Does the agent generate adversarial or edge-case tests?	Execution success may miss wrong assumptions
How is the knowledge base curated and versioned?	One clean example worked best, but only if it stays clean
What happens when the target problem moves beyond MAPP?	Transferability across VDS types remains a broader question
Who approves deployment in safety-relevant scenarios?	Automation changes accountability; it does not erase it

These are not objections to the paper. They are the next layer of engineering seriousness.

Less context, better control

The charmingly annoying result in this paper is that more examples can make the agent worse. That finding deserves attention beyond ports.

Enterprise AI teams often behave as if the solution to unreliable output is to shovel more documents into the context window. More policies. More past tickets. More examples. More internal wiki pages. More everything. The model then has the cognitive experience of being trapped in a filing cabinet.

PortAgent points to a better principle: give the agent the cleanest transferable structure, not the largest pile of precedent. For code-generating optimization agents, the most useful example may be the one that teaches the workflow without contaminating the target logic.

That principle generalizes. In agentic business automation, knowledge bases should be designed as operational instruments, not storage museums. Retrieval should support the next action. Examples should teach reusable structure. Debugging should feed back into the workflow. And semantic ambiguity should be treated as a first-class risk.

Ports are a good setting for this lesson because they are physical, expensive, and unforgiving. A bad dispatching rule does not merely produce an ugly spreadsheet. It can create congestion, idling equipment, safety exposure, and operational delay. Industrial AI has less tolerance for theatrical intelligence.

PortAgent is promising precisely because it does not rely on theater. It decomposes the work, retrieves domain knowledge, generates executable optimization code, and tests its own output. The remaining problem — semantic misunderstanding — is not small. But at least it is the right remaining problem.

The future of agentic operations will not be one giant model “running the yard.” It will be structured agents taking over specific loops that used to require expert handoffs: interpret, model, code, execute, debug, review. Less glamorous than the slogan. Much more useful.

Cognaptus: Automate the Present, Incubate the Future.

Jia Hu, Junqi Li, Weimeng Lin, Peng Jia, Yuxiong Ji, and Jintao Lai, “PortAgent: LLM-driven Vehicle Dispatching Agent for Port Terminals,” arXiv:2512.14417, 2025. ↩︎

The expensive part is not dispatching once; it is transferring dispatching#

PortAgent works because it is a workflow, not a heroic prompt#

Retrieval is not decoration; it supplies the modeling grammar#

The self-correction loop is where autonomy becomes measurable#

The main test is a controlled transfer problem, not a live port deployment#

The failures reveal the real boundary: semantics, not syntax#

The expertise test supports specialist reduction, not expert disappearance#

The speed result is impressive, but it measures workflow compression#

The business value is controlled transfer, not “AI runs the port”#

Where the paper is strong, and where a buyer should still ask questions#

Less context, better control#