The Policy Has to Work Somewhere: RL for Scale, Trust, and Other Inconveniences

Deployment is where elegant AI systems go to meet bandwidth caps, slow devices, noisy user preferences, and privacy policies written by committees with very strong coffee.

That is the useful lens for reading Guangchen Lan’s dissertation, Reinforcement Learning for Scalable and Trustworthy Intelligent Systems.¹ It is tempting to describe the work as a collection of four reinforcement-learning methods: one for synchronous federated RL, one for asynchronous federated RL, one for preference optimization, and one for contextual privacy. Technically, that is true. Editorially, it is the least interesting way to read it.

The stronger reading is this: reinforcement learning is treated as a policy-optimization toolkit for two deployment constraints that enterprise AI keeps rediscovering, usually after the pilot demo has already looked impressive.

First, intelligent systems need to scale. Agents may be distributed across devices, simulators, robots, edge environments, or organizational data silos. They cannot always send heavy model information back to a central server. They cannot always wait for the slowest participant. Apparently, “just centralize everything” is not a strategy when reality owns a network cable.

Second, intelligent systems need to be trusted. LLM agents are not merely generating polite paragraphs anymore. They rank responses, act on user requests, retrieve private records, and decide what information should leave one context and enter another. That means optimization is no longer just about reward maximization. It is also about preference calibration and context-sensitive disclosure.

The dissertation is therefore not only about making RL faster, and not only about making LLMs safer. Its real argument is more practical: once policies are deployed, they run into constraints. Some constraints are infrastructural. Some are behavioral. Both can be attacked with policy optimization, but not with the same policy trick.

The paper is best read as four bottlenecks, not four chapters

A chapter-by-chapter summary would make the dissertation look like four adjacent papers. That would be accurate in the way a warehouse inventory list is accurate: technically complete, cognitively unhelpful.

A better structure is to group the contributions by the deployment bottleneck they target.

Deployment constraint	Bottleneck	Method	What the paper directly shows	Business interpretation	Boundary
Distributed scale	Synchronous federated natural policy gradient requires expensive matrix communication	FedNPG-ADMM	Per-agent communication drops from $O(d^2)$ to $O(d)$ while preserving convergence guarantees and comparable MuJoCo performance	Curvature-aware RL becomes less allergic to federated and edge settings	Evidence is theory plus standard continuous-control benchmarks, not a production fleet
Distributed scale	Synchronous federated RL waits for slow agents	AFedPG	Asynchronous policy-gradient training with delay-adaptive lookahead achieves linear sample-complexity speedup and lower global training time in heterogeneous settings	Useful when agents, simulators, or devices run at uneven speeds	Does not solve adversarial, unreliable, or malicious-worker settings
Trustworthy behavior	Preference optimization can overfit pairwise winner-loser labels and ignore reward magnitude	MaPPO	Prior reward knowledge improves DPO-style optimization across several LLM families and preference methods	Alignment pipelines should care not only who won, but by how much	Depends on the quality of the prior reward signal and automatic evaluation
Trustworthy behavior	Privacy is context-dependent, not merely a static access-control label	CI-RL	Contextual-integrity reasoning plus RL reduces inappropriate disclosure while preserving task utility on synthetic and external benchmarks	Enterprise agents need disclosure judgment, not just retrieval permission	Synthetic training data and keyword rewards need domain-specific validation

This framing matters because the methods are not one integrated production architecture. FedNPG-ADMM does not automatically produce a privacy-safe LLM agent. CI-RL does not make distributed robot learning cheap. The dissertation is unified by a design stance, not by a single deployed system: policies should be optimized under the constraints that actually appear when systems leave the paper.

Bottleneck one: second-order RL is powerful, but matrices are rude

Natural policy gradient methods are attractive because they use curvature information to make policy updates more stable and efficient than plain first-order gradients. In a centralized setting, that is already mathematically involved. In a federated setting, it becomes an operations problem.

Standard federated natural policy gradient asks each agent to send not only a gradient vector $g_i \in \mathbb{R}^d$, but also a second-order matrix $H_i \in \mathbb{R}^{d \times d}$. That is the moment where “distributed intelligence” starts looking suspiciously like “distributed bandwidth consumption.”

The paper’s first contribution, FedNPG-ADMM, avoids transmitting the full matrix. Instead of approximating every local Hessian-like object and shipping it upward, the method uses ADMM to approximate the global natural-gradient update direction directly. Agents transmit vector quantities rather than matrix quantities. The headline complexity change is simple:

$$ O(d^2) \rightarrow O(d) $$

That is not a cosmetic improvement. When $d$ grows, matrix communication becomes the kind of cost that makes otherwise elegant algorithms quietly disappear from practical engineering meetings.

The theoretical result preserves the same stationary convergence rate class as standard federated natural policy gradient, while reducing communication. In the paper’s comparison table, FedNPG and FedNPG-ADMM both retain the per-agent sample-complexity benefit of federation, but FedNPG-ADMM changes total communication from an $O(d^2)$ dependence to an $O(d)$ dependence.

The experiments are used mainly as main evidence that the communication-saving approximation does not destroy learning behavior. On MuJoCo tasks such as Swimmer-v4 and Hopper-v4, FedNPG-ADMM produces reward curves broadly comparable to standard FedNPG. In Swimmer-v4, for example, the reported final rewards move from 109.4 with one agent to 128.5 with eight agents under FedNPG-ADMM, while standard FedNPG moves from 111.9 to 124.8 over the same agent counts. In Hopper-v4, both methods improve substantially as the number of agents increases, with eight-agent FedNPG-ADMM reaching 2719 compared with 2736 for FedNPG.

The reward differences are not the main story. The main story is that the method preserves the learning pattern while reducing communication overhead by orders of magnitude. The paper reports communication savings of roughly four orders of magnitude in Swimmer and around six in Humanoid compared with standard FedNPG.

That is the business-relevant part. If a company is training policies across edge devices, simulators, robotics sites, or distributed operational environments, the constraint is not always “can the algorithm learn?” Often it is “can the algorithm learn without turning the network into a memorial service?”

The boundary is equally important. The experiments are standard continuous-control benchmarks, not a messy enterprise deployment with unstable devices, confidential data regimes, and incentive-misaligned participants. Also, federated learning is not automatically privacy. FedNPG-ADMM does not send raw trajectories, but the paper does not claim formal privacy guarantees. Less communication is useful. It is not encryption wearing a lab coat.

Bottleneck two: waiting for the slowest agent is a tax disguised as discipline

Synchronous federated learning has a familiar problem: every round waits for participating agents to finish. In homogeneous simulation environments, this is irritating. In real distributed environments, it is structural.

Some agents run on slower hardware. Some collect data more slowly. Some environments are harder to simulate. Some clients live behind questionable network conditions because apparently the universe enjoys benchmarking distributed algorithms against hotel Wi-Fi.

The dissertation’s second contribution, AFedPG, targets this straggler problem. The key is not merely “make federated RL asynchronous.” That slogan is too easy. In supervised federated learning, stale gradients are already a problem, but the data distribution is often treated as fixed. In reinforcement learning, the data distribution depends on the policy that generated it. If an agent collects trajectories under an older policy and sends them later, the server is not just receiving late information. It is receiving information from a different behavioral world.

AFedPG introduces a delay-adaptive lookahead mechanism. When a delayed agent update arrives, the algorithm constructs a lookahead policy parameter that compensates for the lag. The simplified intuition is:

$$ \tilde{\theta}_k = \theta_k + \frac{1-\alpha_{k-\delta_k}}{\alpha_{k-\delta_k}} (\theta_k-\theta_{k-1}) $$

Here, $\delta_k$ reflects delay. The point is not to add generic momentum because momentum is fashionable and therefore must be present. The point is to cancel RL-specific error terms caused by stale policy data. This is a mechanism, not decoration.

The evidence has three layers.

First, the theoretical analysis gives sample-complexity speedup with the number of agents. AFedPG retains the linear sample-complexity benefit of federated training while improving global time complexity under heterogeneous agent speeds. In plain business language: it uses more workers without being held hostage by the slowest one.

Second, the MuJoCo experiments compare AFedPG against single-agent policy gradient, synchronous FedPG, vanilla asynchronous FedPG, and A3C-style baselines. Across Swimmer, Hopper, Walker2D, and Humanoid, AFedPG converges faster and more stably than single-agent PG as agent count increases. In global time comparisons, AFedPG reduces time consumption relative to synchronous FedPG under heterogeneous timing.

Third, the comparison with vanilla asynchronous FedPG functions as an ablation. This matters. If vanilla asynchrony performed just as well, the delay-adaptive lookahead would be an elegant theoretical accessory. Instead, the vanilla asynchronous baseline performs poorly in the reported experiments. That suggests the lookahead correction is doing real work.

For enterprise use, this contribution is most relevant wherever learning is distributed across uneven sources: robotics fleets, digital twins, edge simulators, logistics environments, or any multi-agent system where data generation is tied to active behavior. The return on investment is not just “more agents learn faster.” It is “the system stops wasting global training time waiting for the least convenient participant.”

But again, the boundary is not small. AFedPG handles heterogeneity and staleness, not hostile participants. It does not solve poisoned updates, unreliable agents, or governance problems around who is allowed to contribute data. It also focuses on first-order policy-gradient settings. The practical lesson is not that asynchronous RL is solved. The lesson is that in RL, asynchrony needs policy-aware correction, not just a queue.

Bottleneck three: preference optimization should know whether the loser barely lost

The dissertation then shifts from distributed RL to LLM alignment. This looks like a topic change. It is less of a change than it first appears.

The first two methods ask: how can policies be optimized when training infrastructure is constrained?

The third asks: how can policies be optimized when preference data is under-informative?

Direct Preference Optimization, or DPO, is popular because it turns preference alignment into a supervised-looking objective. Given a prompt, one response is chosen and another is rejected. The model is trained to prefer the chosen response over the rejected one. Convenient. Elegant. Slightly too confident.

The problem is that pairwise preference labels are relative. They say $y_w$ beat $y_l$. They do not necessarily say the winner was excellent, the loser was terrible, or the gap was large. A response can win by an inch. DPO may still treat it like a moral referendum.

The paper highlights a “squeezing effect”: DPO can increase the relative gap between chosen and rejected responses while reducing the log probability of both. In one illustrative example, the chosen response’s log probability falls from -14.3 to -121.5, while the rejected response falls from -43.4 to -443.2. The gap expands dramatically, but the chosen response itself becomes much less likely. Technically, the preference margin improved. Practically, this is the kind of victory one should not put in a quarterly report.

MaPPO, or Maximum a Posteriori Preference Optimization, adds prior reward knowledge to the preference objective. Instead of treating every chosen-rejected pair as equally decisive, it weights the rejected term according to the reward gap. If the winner is clearly superior, the method behaves more like DPO. If the winner only narrowly wins, the update is softened.

The objective modifies the DPO-style log-ratio comparison by introducing a reward-gap factor $\Delta r$:

$$ \mathcal{L}_{\text{MaP}}(\theta) = \mathbb{E} \left[ -\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} --- \Delta r \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] $$

When the prior reward gap is maximal, MaPPO reduces to the DPO case. So the method is not replacing DPO with a philosophical essay about nuance. It is making DPO sensitive to information that many alignment pipelines already have: reward scores, verifier outputs, or some prior estimate of response quality.

The main evidence comes from experiments across Qwen, Mistral, and Llama model families on instruction-following and preference benchmarks. The reported gains are not uniform across every metric, but several are large enough to deserve attention. On Qwen2.5-7B, DPO plus MaPPO improves AlpacaEval 2.0 from 32.01 to 38.24 and Arena-Hard from 45.5 to 59.2. On Mistral-7B, online I-DPO plus MaPPO improves AlpacaEval 2.0 from 17.11 to 33.28 and MT-Bench from 6.92 to 7.59.

The paper also tests MaPPO as a plug-in modification to several DPO variants, including SimPO, IPO, CPO, and I-DPO. That table is best read as an implementation compatibility extension, not as proof that MaPPO dominates every alignment method in every setting. The pattern is useful: the prior-weighting idea can be inserted into multiple preference objectives with no additional hyperparameter. The exact performance still depends on model, benchmark, and reward source.

The academic benchmark tests are also important, but they should be interpreted correctly. They are not the main proof that MaPPO aligns better. They are closer to robustness checks against the familiar alignment tax: improving preference behavior while accidentally damaging general capability. On IFEval, GPQA, MMLU, HellaSwag, TruthfulQA, and GSM8K, MaPPO generally maintains or improves performance relative to DPO-style baselines. For example, on Llama-3-8B, DPO plus MaPPO improves TruthfulQA from 51.5 to 58.2 and GSM8K from 75.5 to 79.5.

For businesses tuning LLMs, the implication is direct: pairwise preference data should not be treated as if every comparison carries the same strength. If a reward model, human rubric, or domain verifier can estimate magnitude, that signal can reduce overcorrection. This is especially relevant for customer support, financial analysis, legal drafting assistance, healthcare triage support, and other areas where “slightly better” and “clearly better” should not trigger identical updates.

The uncertainty is equally direct. MaPPO is only as useful as the prior reward signal is meaningful. If the prior reward model is biased, shallow, or miscalibrated, the method can faithfully incorporate bad judgment. Alignment math does not launder poor labels. It only makes the laundering more efficient.

Bottleneck four: privacy fails when agents cannot tell helpful from inappropriate

The final contribution, CI-RL, addresses a different kind of trust problem. It is not about whether an answer is preferred. It is about whether information should be disclosed in a given context.

This is where the paper becomes especially relevant for enterprise agents. In a simple access-control system, data is either allowed or not allowed. That works reasonably well for databases. It works less well for agentic workflows where the same piece of information may be appropriate in one context and inappropriate in another.

If an assistant is booking a spa appointment, sharing a customer’s name and treatment preference may be necessary. Sharing insurance details is not. If an agent is emailing a contractor, sharing delivery address may be appropriate. Sharing internal budget notes is not. The problem is not “never share private information.” The problem is “share what the task requires, and do not smuggle the rest of the user’s life into the output.”

CI-RL builds on contextual integrity: the idea that privacy norms depend on sender, receiver, subject, attribute type, and transmission principle. The method first uses a contextual-integrity chain-of-thought prompt, CI-CoT, that asks the model to reason about which pieces of information are necessary, helpful, optional, or inappropriate. Then it uses reinforcement learning to train models toward better disclosure behavior.

The training setup is intentionally structured. Each example contains a user task, required information that should be shared, and restricted information that should not be shared. The reward is rule-based: the model is rewarded for including required information and penalized for including restricted information, with a format penalty when the output does not follow the expected reasoning-answer structure.

That reward is simple. It is also auditable. In enterprise privacy work, that is not a trivial virtue. A beautiful black-box privacy reward model may be less attractive than a crude rule that compliance teams can actually inspect without summoning a priest.

The synthetic test results are the main evidence. The paper evaluates three metrics:

Metric	Meaning
Integrity	The response excludes all restricted information
Utility	The response includes all required information
Complete	The response both includes all required information and excludes all restricted information

Across several open-weight models, CI-RL improves the complete score substantially. For Mistral-7B-Instruct, integrity rises from 38.8 to 89.1, utility from 67.3 to 82.8, and complete correctness from 24.5 to 73.4. For Qwen2.5-7B-Instruct, integrity rises from 46.9 to 75.0 and complete correctness from 29.7 to 48.4. The paper notes that Qwen2.5-7B after CI-RL surpasses the baseline Qwen2.5-14B on integrity and complete correctness, which is a useful reminder that bigger models are not automatically better at respecting context. Shocking news: parameter count is not a privacy policy.

The comparison between standard instruction-tuned LLMs and distilled reasoning models is best read as an ablation and sanity check. Reasoning-model branding does not guarantee better contextual privacy. Some distilled reasoning models underperform their instruction-tuned counterparts, even after CI-RL. The authors suggest this may be because those distilled models are optimized for code and scientific reasoning rather than social-context reasoning. That interpretation is plausible, but the broader lesson is safer: do not assume a model trained to solve math problems will automatically understand disclosure norms.

The PrivacyLens evaluation functions as an external transfer test. It uses longer, more natural conversations and tool-use settings, judged by GPT-4o under leakage and helpfulness metrics. On open-weight models, CI-RL generally reduces adjusted leakage while preserving or improving helpfulness. For example, Qwen-14B-Instruct moves from a leakage rate of 52.9 to 33.9 and adjusted leakage from 51.2 to 34.4, while helpfulness shifts only from 2.37 to 2.30. Mistral-7B’s raw leakage rate is not uniformly better than CI-CoT alone, but CI-RL improves adjusted leakage and helpfulness relative to the prompted baseline. That distinction matters: raw leakage reduction without helpfulness can be achieved by refusing everything, a strategy beloved by no one except perhaps the risk committee’s sleep schedule.

For business use, CI-RL points toward a practical design pattern for enterprise agents:

Define what information is required for a task.
Define what information is restricted in that context.
Train or prompt the model to reason over the difference.
Evaluate both leakage and utility, not leakage alone.
Combine the model-level method with system-level access controls.

The last point is essential. CI-RL is not a replacement for access control, retrieval boundaries, data masking, audit logs, or policy engines. It is a behavioral layer. It helps the model decide what to say when information is already in context. That is valuable, but it should not be confused with preventing inappropriate retrieval in the first place.

The evidence table: what each test is doing

The dissertation contains theory, benchmark experiments, ablations, compatibility checks, and transfer tests. These are not interchangeable. A benchmark table is not always a deployment claim. An ablation is not a product roadmap. This distinction saves readers from the ancient academic-to-business translation error: treating every number as equally operational.

Evidence used in the paper	Likely purpose	What it supports	What it does not prove
FedNPG-ADMM vs FedNPG on MuJoCo rewards	Main evidence	Communication-efficient approximation can preserve comparable learning performance	Performance in real federated robotics, edge devices, or confidential enterprise environments
FedNPG-ADMM communication overhead comparison	Main evidence for scalability	Vector communication can cut overhead by orders of magnitude	End-to-end training cost under production networking and infrastructure constraints
Random partial-agent selection in FedNPG-ADMM	Robustness / sensitivity test	The method tolerates reduced participation in the tested setup with limited reward degradation	Full partial-participation theory or unreliable-client robustness
AFedPG vs synchronous FedPG in global time	Main evidence	Asynchrony plus delay-aware correction reduces wall-clock training time under heterogeneous speeds	Robustness to adversarial agents, poisoned updates, or device failures
AFedPG vs vanilla asynchronous FedPG	Ablation	Delay-adaptive lookahead is important, not optional decoration	That the specific correction is universally optimal
MaPPO on AlpacaEval, Arena-Hard, and MT-Bench	Main evidence	Prior reward knowledge can improve preference optimization across model families	Definitive human preference superiority in all domains
MaPPO plugged into SimPO, IPO, CPO, and I-DPO	Implementation extension / compatibility test	The prior-weighting idea can generalize across preference objectives	That every DPO variant will improve on every metric
Academic benchmarks after MaPPO	Robustness check	Preference tuning does not obviously destroy general ability and may improve it	Full safety, factuality, or domain-specific reliability
CI-RL synthetic contextual-integrity test	Main evidence	RL can improve the required-versus-restricted disclosure tradeoff	Real-world privacy compliance across messy enterprise workflows
LLM vs distilled reasoning-model comparison	Ablation / exploratory test	“Reasoning model” status does not guarantee contextual privacy	A complete ranking of model families for privacy-sensitive agents
PrivacyLens transfer evaluation	External benchmark / robustness test	CI-CoT and CI-RL can reduce leakage beyond the synthetic setting	Production readiness without policy customization and human evaluation

This table is more than housekeeping. It tells business readers where the claims are strongest. FedNPG-ADMM has a clean theoretical and communication story. AFedPG has a strong mechanism story around stragglers. MaPPO has broad empirical support across preference settings, but relies on reward priors. CI-RL has promising privacy-agent evidence, but still depends heavily on synthetic construction and rule-based rewards.

The enterprise reading: deployability is the product

The most useful business takeaway is not “reinforcement learning is back.” That sentence has been said often enough to qualify as a minor weather pattern.

The better takeaway is that RL becomes business-relevant when it is used to optimize behavior under constraints that ordinary supervised learning does not naturally handle.

For distributed systems, the constraints are infrastructural. Can the system learn without transmitting enormous matrices? Can it learn without waiting for the slowest participant? Can it use many agents while respecting the fact that agents live in different compute and data regimes?

For LLM agents, the constraints are behavioral. Can the model learn from preference signals without overreacting to thin pairwise labels? Can it disclose enough information to complete a task without leaking information that belongs to another context?

These are not abstract concerns. They map directly to operational categories.

Enterprise setting	Relevant bottleneck	Paper contribution that speaks to it	Practical question to ask
Edge robotics or distributed control	Communication cost	FedNPG-ADMM	Are we blocked by the size of policy-update messages, not just model quality?
Multi-site simulation or fleet learning	Straggler delays	AFedPG	Are slower environments dominating global training time?
LLM preference tuning	Pairwise-label overcorrection	MaPPO	Do we know how large the preference gap is, or only who won?
Enterprise RAG and workflow agents	Contextual disclosure	CI-RL	Can the agent distinguish required information from inappropriate leakage?
Privacy-sensitive assistants	Leakage-utility tradeoff	CI-CoT and CI-RL	Are we measuring helpfulness and leakage together?

Cognaptus’ inference is that deployability should be treated as an optimization target, not an afterthought. Too many AI projects treat training, alignment, privacy, and infrastructure as separate departments. The paper’s category-level contribution is to show that these are all policy constraints. They differ in form, but they share a practical structure: a policy must act well while satisfying the conditions of its environment.

That does not mean one RL framework should run the entire enterprise stack. Please do not build that slide. It means the design conversation should begin with the actual bottleneck.

If bandwidth is the bottleneck, do not celebrate a method that improves reward while doubling communication. If stragglers dominate training time, do not buy more GPUs before asking whether the synchronization protocol is the problem. If preference tuning behaves strangely, do not assume the pairwise labels are sufficiently informative. If privacy leakage appears in an agent workflow, do not measure safety only by refusal rate.

Where not to overread the dissertation

The dissertation is ambitious, but it is not claiming to solve enterprise AI deployment end to end.

First, the federated RL contributions are supported by theory and MuJoCo-style experiments. That is meaningful, but it is not the same as running across thousands of unstable devices, legacy systems, and compliance boundaries. The paper explicitly opens useful directions around partial participation and asynchronous extensions, but those are not already production guarantees.

Second, federated learning should not be confused with privacy preservation. FedNPG-ADMM reduces communication and avoids raw trajectory sharing, but formal privacy would require additional mechanisms such as differential privacy, secure aggregation, or other system-level protections.

Third, AFedPG handles stale updates from heterogeneous agents. It does not handle malicious workers, corrupted environments, or incentive conflicts. If the deployment environment includes adversarial participants, this method addresses only part of the problem.

Fourth, MaPPO depends on the quality of prior reward knowledge. In business domains, reward models can inherit organizational bias, weak rubrics, outdated policies, or evaluator inconsistency. Adding prior reward knowledge is useful when the prior is actually knowledge. When it is noise with a badge, the math will not save the meeting.

Fifth, CI-RL’s contextual privacy results are promising but still early. The training data is synthetic and semi-structured. The reward is keyword-based. That makes the method transparent, but also coarse. Real enterprise policies often involve exceptions, jurisdictional differences, relationship-specific rules, and evolving norms. A contextual-integrity agent for healthcare, finance, legal services, or internal HR workflows would need domain-specific policy extraction, human review, and continuous evaluation.

Finally, the four contributions are not one plug-and-play system. They are better understood as a map of methods around deployability. That is still valuable. A map is useful precisely because it does not pretend every road leads to the same parking lot.

The quiet message: RL is becoming a constraint language

The fashionable version of AI progress is model size, benchmark rank, and increasingly theatrical product demos. The less glamorous version is more important: systems must learn under bandwidth limits, timing heterogeneity, ambiguous preferences, and contextual privacy norms.

This dissertation belongs to the second category. Its value is not that it announces a single grand architecture. Its value is that it treats reinforcement learning as a language for constraints.

FedNPG-ADMM says: if second-order RL is too expensive to communicate, optimize the update direction differently.

AFedPG says: if distributed agents run at uneven speeds, do not let the slowest participant define time for everyone.

MaPPO says: if preference labels lose magnitude, put prior reward knowledge back into the objective.

CI-RL says: if privacy depends on context, train the model to reason about disclosure rather than merely reciting safety slogans.

For enterprises, this is the useful mental model. Do not ask whether RL is “the future.” That question is large, vague, and therefore perfect for conference panels. Ask where your policy is failing to deploy: communication, synchronization, preference calibration, or contextual disclosure.

Then choose the method that attacks that constraint.

That is less glamorous than saying agents will transform everything. It is also more likely to survive contact with an actual system.

Cognaptus: Automate the Present, Incubate the Future.

Guangchen Lan, “Reinforcement Learning for Scalable and Trustworthy Intelligent Systems,” arXiv:2605.08378, 2026, arXiv:2605.08378. ↩︎

The paper is best read as four bottlenecks, not four chapters#

Bottleneck one: second-order RL is powerful, but matrices are rude#

Bottleneck two: waiting for the slowest agent is a tax disguised as discipline#

Bottleneck three: preference optimization should know whether the loser barely lost#

Bottleneck four: privacy fails when agents cannot tell helpful from inappropriate#

The evidence table: what each test is doing#

The enterprise reading: deployability is the product#

Where not to overread the dissertation#

The quiet message: RL is becoming a constraint language#