Breaking Rules, Not Systems: How Penalties Make Autonomous Agents Behave

Emergency is a terrible product requirement.

It sounds simple in a meeting: “The agent should follow policy, except when the situation is urgent.” Wonderful. Very human. Also almost useless.

A delivery robot should not enter a restricted zone. Unless the package is critical medicine. A warehouse agent should not skip safety checks. Unless a fire alarm requires rerouting. A self-driving system should obey traffic norms. Unless an emergency trip makes delay costly. But “unless urgent” does not tell the agent which rule can bend, which rule must hold, and which shortcut turns the system from flexible into reckless.

That is the practical problem behind Vineel Tummala and Daniela Inclezan’s paper, Autonomous Agents and Policy Compliance: A Framework for Reasoning About Penalties.¹ The paper does not ask whether autonomous agents should be compliant or non-compliant. That binary is too clean for the real world, and therefore suspicious. Instead, it asks a sharper question: when an agent must violate some policy to achieve a high-stakes goal, can it reason about the kind and severity of violation?

The authors’ answer is a logic-programming framework that extends the Authorization and Obligation Policy Language, $\mathscr{AOPL}$, with explicit penalties. The result, $\mathscr{AOPL}$-$\mathscr{P}$, lets policy rules carry penalty values, translates those rules into Answer Set Programming, and uses planning modules to select plans based on priorities such as cumulative penalty and execution time.

In plain business language: this is not “let the AI break rules.” It is closer to “make the AI show the invoice before it breaks one.”

The traffic case exposes the missing middle layer

The paper’s motivating case is a simplified traffic domain. A self-driving agent moves across a grid of locations. Roads have speed limits. Some places have stop signs, red lights, stopped school buses, pedestrians, and “Do not enter” signs. Policies express permissions and obligations: do not enter prohibited streets, do not roll through stop signs, stop for pedestrians, do not cross on red, and so on.

The older framework that this paper builds on already allowed different behavior modes. A “Normal” agent would avoid non-compliant actions. A “Risky” agent, useful in emergency situations, could choose actions that violate policy if they helped minimize plan length.

That sounds reasonable until the agent sees several short routes.

One route may ignore pedestrians. Another may enter a “Do not enter” street. Another may speed modestly. In the older risky mode, these plans can look equivalent if they have the same length. The agent knows only that they are short. It does not know that one shortcut is socially tolerable, another is legally costly, and another could injure someone. A planner that treats all violations as equal has technically noticed policy. It just has the moral resolution of a spreadsheet cell.

The paper adds the missing middle layer: penalties.

Some violations receive low penalties, some receive higher penalties, and some human-harm-related situations can be converted into hard constraints. For example, the traffic domain starts with a 1-to-3 penalty scale for ordinary infractions, but the authors later refine the treatment of pedestrians and stopped school buses by assigning a much higher penalty and using a constraint so the agent must comply when non-compliance could harm humans.

This distinction matters. A penalty system is not the same as a permission system. It does not say, “Everything is allowed if you pay enough.” It says, “The planner can rank undesirable options, and some options can still be ruled out entirely.”

That is the useful part.

What the paper actually builds

The technical contribution has three layers.

First, the authors adapt $\mathscr{AOPL}$ so that both strict and defeasible policy rules can be labeled. Labeling matters because the system must later identify which exact rule was violated. Without labels, the agent may know it did something wrong, but not which policy generated the penalty. That is not explainability; that is a shrug in formal notation.

Second, they extend the language with penalty statements. A penalty statement associates a rule label with a numerical penalty, optionally conditioned on static facts. This allows one policy rule to have multiple gravity levels. Speeding is the clean example: exceeding a speed limit slightly may incur one penalty; exceeding it by much more may incur a larger one.

Third, they translate the extended language into Answer Set Programming. ASP is a good fit because the problem is not merely prediction. It is structured search under constraints: given a dynamic domain, policies, penalties, an initial state, and a goal, find a plan that satisfies the chosen behavior mode.

The framework therefore has several components:

Component	What it does	Why it matters operationally
Dynamic-domain encoding	Describes actions, states, preconditions, and effects	The agent can reason about what can physically happen
Policy and penalty encoding	Represents permissions, obligations, priorities, and penalties	Rules become inspectable decision inputs
Policy-reasoning module	Determines which policy rules apply at each time step	The agent penalizes only relevant violations
Planning problem	Specifies initial state, goal, observations, and emergency status	The same policy base can support different missions
Ranking module	Selects plans using priorities such as penalty and time	“Best” becomes configurable rather than hard-coded

The important design choice is that penalties are attached to policy rules, not just to actions. If an action violates two rules, the framework can count both infractions. The paper explicitly contrasts this with the older framework, where a multi-rule violation could still be counted as just one non-compliant action.

That is not a minor bookkeeping fix. In enterprise terms, it is the difference between “this workflow step was non-compliant” and “this step violated privacy, authorization, and escalation policy at the same time.” Same click. Very different audit conversation.

Emergency mode is not “ignore policy”

The paper introduces two behavior modes around penalty and time.

In non-emergency mode, the planner prioritizes low cumulative penalty over short execution time. In emergency mode, it prioritizes time over penalty. The same planning system can therefore behave differently depending on context.

A useful example in the paper compares two plans in the traffic domain. In emergency mode, the agent chooses a plan with cumulative penalty 15 and cumulative time 30. In non-emergency mode, it chooses a plan with cumulative penalty 0 and cumulative time 67. The agent is not simply “ethical” in one mode and “reckless” in the other. It is optimizing different priorities.

This is the conceptual upgrade.

A policy-aware agent should not be forced into two childish modes:

obey every rule no matter what;
ignore every rule when the goal is urgent.

The paper’s replacement is more adult, and therefore less convenient: encode penalties, encode priorities, then let the planner search for the least bad plan under the declared behavior mode.

For business systems, this maps neatly to exception handling. A customer-support agent may be allowed to skip a low-level script step when a priority customer has an outage. It should not be allowed to expose another customer’s data. A procurement agent may accelerate approval for emergency inventory. It should not bypass conflict-of-interest checks. The difference is not “rules versus no rules.” The difference is which rules are priced, which rules are binding, and who gets to set that boundary.

The experiments support plan quality more than raw speed

The evidence comes from two domains: a Rooms Domain inherited from prior work, and the new Traffic Norms Domain.

The Rooms Domain has nine rooms, doors, locks, keys, badges, fires, contamination, and policies such as using a key before a badge, not entering contaminated rooms, and avoiding active-fire rooms unless equipped with protective gear. Penalties from 1 to 3 are added according to severity.

In this domain, the proposed framework is faster than the older Harders-Inclezan framework. Across the listed scenarios, the proposed runs complete under 0.5 seconds, while the earlier framework takes roughly 2 to 4 seconds. The authors attribute this to their framework using simpler metrics than the older framework, which considered more complex measures such as the percentage of strongly compliant elementary actions.

That is good, but it is not the whole story.

The Traffic Norms Domain is more revealing because the agent must choose not only a route but also driving speeds. Here the proposed framework is slower than the older framework. The older framework often finishes in well under 2 seconds, while the proposed framework takes several seconds and, in larger speed-choice scenarios, more than 10 seconds. The authors explain why: their planner goes through more optimization cycles, typically 3–16 compared with 2–6 in the older framework, because it is refining speed choices and penalty trade-offs.

So the clean summary is:

Experiment	Likely purpose	What it supports	What it does not prove
Rooms Domain comparison	Main evidence for feasibility and speed in a simpler dynamic-policy setting	Penalty-aware planning can be computationally efficient in some structured domains	That the method is always faster
Traffic Norms comparison	Main evidence for plan-quality differences	Penalties and time metrics produce safer and more realistic choices than length-only risky planning	That the method scales cheaply to richer continuous worlds
Varying number of speed values	Sensitivity test	Runtime rises as the action-choice space expands	That the runtime profile is solved
Improved applicability encoding	Implementation refinement	Filtering physically executable actions improves Traffic runtime substantially	That the framework is production-ready

The Traffic results are the most business-relevant part. In several scenarios, the older risky-mode agent finds shorter plans because it does not stop for pedestrians or stopped school buses. The proposed emergency-mode agent may choose a longer plan because it still obeys human-safety constraints. In Scenario 5, for example, the proposed framework generates a plan of length 3, while the older risky framework generates a plan of length 2. The difference is not mathematical elegance. The proposed agent stops for crossing pedestrians. The older risky agent proceeds.

That one extra step is the entire point.

Speed choice is where the paper becomes practical

A subtle but important result is that the two frameworks often choose similar paths but different speeds. The older framework may choose strange speeds because it optimizes plan length rather than execution time or penalty severity. The paper notes that the older Normal agent may drive at 5 mph where the speed limit is 25 mph. In real traffic, that is not harmless compliance. It may provoke unsafe behavior from other drivers.

The proposed framework incorporates execution time. A non-emergency agent therefore tends to choose speeds closer to the limit, while still respecting policy priorities. An emergency agent may accept some penalty for speed, but larger speed violations carry larger penalties.

This matters because many enterprise-agent failures will not look like “the agent broke the law.” They will look like “the agent technically followed the workflow but made an absurd operational choice.” A system can obey a rule and still behave badly. Anyone who has dealt with automated customer service already knows this; the chatbot was compliant all the way down.

By adding time and penalty metrics, the paper shows how policy reasoning can become less brittle. The agent can distinguish among actions that are all formally possible but operationally different.

The translator is more important than it looks

The paper also implements a Python translator from $\mathscr{AOPL}$-$\mathscr{P}$ policies into ASP. This may sound like engineering plumbing. It is not glamorous. Naturally, that means it is important.

A policy framework that requires every rule to be hand-coded in solver syntax will not survive contact with real organizations. The translator allows policies and penalties to be written in the higher-level policy language and then converted into ASP predicates such as rules, types, heads, bodies, preferences, and penalties.

The translator has to solve practical issues that are easy to underestimate. Rule labels may contain variables. Arithmetic comparisons, such as speed exceeding a limit by a threshold, must be represented safely for the solver. The translator must determine whether a rule is strict or defeasible, parse rule bodies, handle preferences, and generate appropriate ASP structure.

For enterprise use, this is the lesson: governance logic needs a compilation path.

Policies should not live only as prose in PDFs. They also should not be buried directly in low-level solver code. The workable middle is a policy-specification layer that domain experts can inspect, paired with a translation layer that execution systems can use. Yes, this adds complexity. The alternative is pretending compliance can be solved by a system prompt. A bold strategy, historically.

The business pathway is risk-priced orchestration

The paper directly shows a formal framework for penalty-aware planning in two small dynamic domains. Cognaptus’ business inference is broader but bounded: this kind of architecture points toward risk-priced orchestration for enterprise agents.

Imagine an operational agent choosing among workflows:

Agent situation	Ordinary rule	Possible exception	Penalty-aware interpretation
Customer support outage	Follow standard escalation sequence	Skip one approval step for a critical incident	Low penalty if logged and bounded
Warehouse routing	Avoid restricted aisles	Enter restricted zone during safety evacuation	Allowed only under emergency mode and physical constraints
Finance automation	Require multi-party approval	Accelerate low-value transaction correction	Penalty depends on amount, role, and audit status
Data assistant	Never access restricted data	Use restricted field for legally required reporting	Some rules remain hard constraints; others require authorization and explanation

This is not legal advice in code form. Nor is it a universal compliance brain. The practical value is narrower: when policy can be formalized, and when exception classes can be assigned severity, an autonomous system can make auditable trade-offs instead of hiding behind a vague “agent decided” story.

The output is also explainable in the right way. The agent can identify which rule applied, which rule was violated, what penalty was incurred, and why the selected plan ranked better under the current behavior mode. For internal governance teams, that is more useful than a generic natural-language justification generated after the fact.

The business question becomes less mystical:

Can we define the organization’s exception logic clearly enough that an agent can reason with it before acting?

Many firms cannot. That is not an AI limitation. That is an operations mirror.

The penalty scale is governance, not engineering

The paper uses simple penalty values, often on a 1-to-3 scale, for illustration. It also recognizes that penalty design is not merely a technical matter. The authors explicitly note future work involving ethics experts and alternative ethical perspectives beyond utilitarian scoring.

This is the section executives should not skip.

Once penalties exist, someone has to decide them. How bad is skipping an approval? How bad is entering a restricted area? How bad is speeding slightly in an emergency? How should the system treat human harm, privacy leakage, regulatory exposure, customer loss, or reputational damage?

These are not values that emerge from ASP. They are governance choices encoded into ASP.

A bad penalty scheme can make a system worse by giving false precision to poor judgment. A rule with a low penalty may become functionally optional. A rule with a high penalty may block necessary action. A hard constraint may be wise in one domain and dangerous in another. The paper’s pedestrian and school-bus refinement is a useful example: human-harm-related rules are moved beyond ordinary 1-to-3 penalties and can be enforced with constraints.

For enterprise deployment, the lesson is blunt: do not let engineers invent the penalty table alone. Compliance, operations, risk, ethics, and domain experts all belong in the loop. Otherwise the organization has not automated governance. It has automated the loudest person’s assumptions.

The boundary: formal policies, small domains, and expensive search

The paper is careful about scope, and the business interpretation should be too.

First, the framework assumes policies can be formalized. That is plausible for many operational rules, but not for all business judgment. “Treat customers fairly” is not automatically an $\mathscr{AOPL}$-$\mathscr{P}$ rule. It must be decomposed into conditions, actions, obligations, permissions, priorities, and penalties. This is hard work. It is also where most of the value lives.

Second, the paper assumes categorical, unambiguous policies. Non-categorical policies, where rules can produce unresolved ambiguity, are left for future work. In an enterprise, ambiguous policies are not rare edge cases. They are often called “Tuesday.”

Third, scalability remains open. The Traffic Norms experiments already show that adding more speed values increases computation time. The improved framework reduces runtime by considering only physically executable actions, especially in the harder traffic scenarios, but this is still a research framework tested in simplified domains. A real city, hospital, factory, bank, or logistics network would require heavier engineering.

Fourth, penalties do not solve moral conflict by themselves. The authors discuss hard constraints for preventing human harm and note that trolley-problem-like cases may require different treatment. That is the correct level of humility. Penalties are a way to represent trade-offs. They are not a machine for manufacturing moral truth.

The real takeaway is not “agents should break rules”

The tempting headline is that autonomous agents need permission to break rules. That is only half right, and the wrong half gets people excited.

The better reading is this: autonomous agents need structured ways to reason about exceptions before action, not decorative explanations after action. Penalties make non-compliance legible. Priorities make emergency behavior configurable. Hard constraints keep some boundaries from becoming negotiable. Translation into ASP makes the system executable rather than merely aspirational.

For businesses building agentic workflows, this paper points to a practical governance pattern:

specify policies in a formal language;
attach severity to violations;
separate priced exceptions from hard constraints;
define behavior modes for emergency and non-emergency contexts;
require the planner to return not only an action but an audit trail of violated rules and incurred penalties.

That pattern will not fit every AI system. It is too structured for open-ended creative tasks and too early for messy real-world legal compliance. But for operational agents acting inside bounded domains, it is exactly the kind of boring machinery that prevents “autonomy” from becoming a euphemism for unreviewed improvisation.

In other words, the goal is not to make agents obedient little clerks. It is to make them competent adults: aware of rules, aware of consequences, and aware that some lines do not move just because the clock is ticking.

That is a much better foundation for enterprise automation than blind compliance or heroic rule-breaking. The bar is low. Happily, the framework clears it.

Cognaptus: Automate the Present, Incubate the Future.

Vineel Tummala and Daniela Inclezan, “Autonomous Agents and Policy Compliance: A Framework for Reasoning About Penalties,” arXiv:2512.03931, https://arxiv.org/abs/2512.03931. ↩︎

The traffic case exposes the missing middle layer#

What the paper actually builds#

Emergency mode is not “ignore policy”#

The experiments support plan quality more than raw speed#

Speed choice is where the paper becomes practical#

The translator is more important than it looks#

The business pathway is risk-priced orchestration#

The penalty scale is governance, not engineering#

The boundary: formal policies, small domains, and expensive search#

The real takeaway is not “agents should break rules”#