TL;DR for operators

Most enterprise AI failures do not come from the model being “too small”. They come from the system around the model being too vague.

A model gives an answer. The workflow accepts it. Nobody knows whether the reasoning path was valid, whether the data path was stale, whether the tool should have been called, or whether the whole process should be redesigned after repeated mistakes. Then someone asks why the AI confidently did something expensive. Excellent. We have automated the intern, but forgot to hire the supervisor.

Two recent papers point to a more useful direction. One paper, Learning to Be A Doctor, treats a medical diagnostic agent as an evolvable graph: nodes can be added, prompts modified, branches inserted, loops constrained, and tool use reorganised based on diagnostic errors.1 The other, Kimina-Prover Preview, trains a formal theorem prover so that reasoning itself becomes part of the learned policy, guided by Lean 4 compiler verification and reinforcement learning.2

The shared lesson is not “AI doctors are ready” or “math bots will run the fund”. Please do not give the machine a Bloomberg terminal and a white coat just yet.

The useful lesson is narrower and more important: high-stakes AI systems need structured reasoning spaces, feedback signals, and revision mechanisms. Some of that structure lives outside the model, in the workflow. Some of it can be trained into the model, when the domain provides strong enough verification. Businesses that understand the difference will build systems that improve. Businesses that ignore it will build expensive autocomplete with a dashboard.

Why this matters now

The current AI deployment pattern is painfully familiar. A company starts with a promising demo, wraps it in a workflow, adds a few tools, and calls it an agent. The early results look magical because the demo cases were friendly. Then real operations arrive: missing data, contradictory documents, unusual customers, edge-case regulations, silent tool failures, changing market regimes, and human users who insist on asking questions the workflow designer did not anticipate. Rude, but realistic.

In quantitative investment, this problem appears every day. A signal pipeline that worked in one volatility regime fails in another. A portfolio optimiser quietly inherits stale assumptions. A crypto strategy treats a news spike as sentiment when it is actually liquidity panic wearing a fake moustache. A research assistant produces a plausible market narrative but cannot distinguish a causal argument from a bedtime story with tickers.

The problem is not just prediction. It is process.

A serious AI system needs to answer four questions:

Operating question Why it matters
What reasoning path did the system follow? Without traceability, decisions cannot be audited or improved.
Which workflow components contributed to the answer? Without observability, failures cannot be localised.
What feedback tells the system it was wrong? Without feedback, “learning” becomes decorative theatre.
What is allowed to change after failure? Without bounded adaptation, self-improvement becomes self-vandalism.

The two papers in this cluster sit at different layers of that stack. The medical-agent paper works on the outer layer: the architecture around the model. Kimina-Prover works on the inner layer: the model’s reasoning behaviour under formal verification. Together, they suggest a practical design principle for vertical AI: make reasoning optimisable, but keep the optimisation bounded, measured, and inspectable.

The outer layer: workflows that can evolve

Learning to Be A Doctor starts from a simple observation: many medical agent systems are manually designed. A team creates a workflow, assigns roles, defines prompts, links tools, and hopes the same design works across diverse diagnostic cases. That is already optimistic in ordinary software. In medicine, it is bordering on comedy with legal consequences.

The paper proposes an automated framework for designing medical multi-agent workflows. It represents the agent system as a graph. Nodes can be basic LLM nodes or tool nodes. Edges define execution order. The search space operates at three levels:

Level What changes Business translation
Node level Add nodes, remove nodes, modify node prompts Add a new analyst, retire a redundant check, rewrite an instruction
Structural level Add branches, loops, and parallel paths Route difficult cases differently, retry under constraints, compare independent views
Framework level Compose patterns resembling CoT, Reflexion, Round Table, and other collaboration styles Select the operating pattern appropriate to the task, rather than hard-coding one fashionable ritual

The key is not that the system “has agents”. Everyone has agents now, at least according to pitch decks. The key is that the workflow can change based on observed failure.

The system classifies errors into image-understanding errors and diagnostic errors. It then generates structural or prompt-level suggestions. Structural changes are validated to avoid broken graphs, duplicate edges, and infinite loops. Prompt changes are easier to apply because they do not alter execution topology. This distinction matters: changing what a node says is not the same as changing how the institution works.

The experiments use dermatology image classification datasets and compare evolved workflows against direct input-output prompting, chain-of-thought prompting, and a round-table multi-agent baseline. Across GPT-4o, GPT-4o-mini, and Claude 3.5 Sonnet, the evolved workflow reports stronger top-k diagnostic accuracy than the baselines. On the Skin Conditions dataset, for example, the GPT-4o version reports 90.83% Top-1 accuracy for the evolved workflow versus 50.83% for direct input-output prompting. On Skin Concepts, the corresponding Top-1 numbers are 29.28% versus 20.27%, a much more modest result and a useful reminder that hard datasets remain hard.

The paper also exposes the cost. The workflow becomes more complex. The reported final testing workflow contains 11 nodes and 5 branches, compared with training-phase averages of 8.7 nodes and 4 branches. Test-time execution also consumes hundreds of thousands of tokens. In other words, adaptation is not free. The machine learns to be more careful, and then sends you the bill.

For business systems, that trade-off is familiar. A risk committee improves governance by adding more checks, but every check adds time and cost. The useful question is not whether more structure is good. It is where structure pays for itself.

The inner layer: reasoning that can be trained against verification

Kimina-Prover approaches the same general reliability problem from the opposite direction. Instead of redesigning the workflow around a model, it trains the model to reason more effectively inside a formal environment.

The domain is Lean 4 theorem proving. That matters because Lean provides something most business environments do not: a hard verifier. A generated proof either compiles and closes the theorem, or it does not. The feedback signal is harsh, binary, and wonderfully indifferent to vibes.

Kimina-Prover is built from Qwen2.5-72B and trained through a pipeline involving autoformalization, supervised cold start, reinforcement learning, and a specific “formal reasoning pattern”. During training, the model learns to mix informal mathematical reasoning with Lean snippets before producing the final formal proof. The Lean compiler then verifies the generated proof and supplies the reward signal for reinforcement learning.

That setup changes the nature of search.

Earlier theorem provers often relied heavily on external search algorithms such as best-first search or Monte Carlo tree search. Kimina-Prover still samples multiple attempts at inference time, but the paper’s central claim is that more of the exploration is internalised into the model’s own reasoning process. The model learns not only to emit proof steps, but to plan, decompose, reflect, and refine.

The reported benchmark results are strong. On miniF2F-test, Kimina-Prover Preview reports 52.94% pass@1, 68.85% pass@32, 77.87% pass@1024, and 80.74% pass@8192. The paper also reports clear performance scaling with model size across 1.5B, 7B, and 72B variants, which the authors highlight as unusual for neural theorem proving.

The deeper business lesson is not the benchmark score. The lesson is the reward architecture.

Kimina-Prover works because the system can ask a precise question: did this proof pass formal verification? That is a very different situation from asking whether a portfolio commentary is “good”, whether a sales email is “persuasive”, or whether a loan recommendation is “reasonable”. Those domains have feedback, but not always clean feedback. Sometimes the feedback arrives late. Sometimes it is noisy. Sometimes the market rewards a bad process and punishes a good one, because markets enjoy character assassination.

So the business interpretation must be careful. Kimina-Prover does not prove that any company can train a self-improving reasoning model merely by adding reinforcement learning. It shows what becomes possible when a domain has a strong verifier and the training pipeline is built around it.

These papers are not making the same claim in two domains. They occupy different parts of a logic chain.

Step Paper contribution Larger implication
1. Fixed workflows fail under complex reasoning demands The medical-agent paper shows static diagnostic workflows can be replaced by evolvable graph architectures. AI reliability requires process adaptation, not just better prompting.
2. Reasoning paths need feedback The medical system uses diagnostic errors to modify workflow structure and prompts. Error analysis becomes an operating input, not a post-mortem document nobody reads.
3. Some domains allow hard verification Kimina-Prover uses Lean compiler feedback as a reward signal. When verification is strong, reasoning quality can be trained more directly.
4. Search can move inward or outward Medical diagnosis searches over workflows; theorem proving trains search-like behaviour into the model. The right architecture depends on where feedback is strongest and safest.
5. Business systems need layered control Quant and enterprise systems rarely have Lean-style certainty. Use external workflows, audits, simulations, expert review, and partial verifiers to approximate disciplined feedback.

The relationship is complementary. One paper says: design the process so it can improve. The other says: train the reasoning so it can improve when verification is available. The combined message is stronger than either paper alone: AI systems become more useful when they are built as feedback-shaped reasoning machines, not one-shot answer engines.

What this means for quant AI

The existing Cognaptus angle for this article is quantitative AI, and that is still the right business lens. Quant systems already combine models, tools, data pipelines, evaluation metrics, and governance procedures. They are agentic workflows before the word “agent” gets added to the investor deck in twelve-point font.

But neither of these papers tests a trading system. That boundary matters. We are borrowing design lessons, not pretending dermatology and Lean proofs secretly backtested Bitcoin.

A quant AI system inspired by these papers would have two layers.

First, it would have an adaptive workflow layer. A research agent might contain nodes for data validation, regime detection, factor selection, risk modelling, portfolio construction, execution constraints, and audit reporting. If the system repeatedly fails under high-volatility conditions, it should not merely produce a more apologetic explanation. It should be able to propose a bounded workflow modification: add a volatility-regime branch, insert a data-quality gate, require independent confirmation from a second signal family, or route the case to human review.

Second, it would have a reasoning-evaluation layer. In formal theorem proving, Lean verifies the proof. In quant finance, there is no equivalent universal judge. But there are partial judges:

Quant component Possible feedback signal Limitation
Data ingestion Missing values, stale timestamps, schema violations, vendor disagreement Passing data checks does not mean the signal is useful.
Signal generation Walk-forward performance, turnover, decay, stability across regimes Backtests can overfit with heroic enthusiasm.
Risk modelling Stress tests, VaR breaches, drawdown behaviour, exposure drift Risk models fail exactly when reality gets creative.
Portfolio construction Constraint satisfaction, concentration limits, liquidity fit Constraint compliance is not investment quality.
Execution Slippage, fill quality, latency, market impact Good execution cannot rescue a bad thesis.
Research explanation Citation checks, calculation reproducibility, logic consistency Plausibility is not causality.

This is where many AI-finance pitches go wrong. They imply that a model can “learn from the market” as if the market is a friendly teacher with neat answer keys. It is not. The market is a noisy adversarial examiner that sometimes rewards nonsense before publicly humiliating it.

So the lesson from Kimina-Prover must be adapted carefully. Use hard verification where available: code compilation, accounting identities, policy constraints, mathematical consistency, portfolio limits. Use softer evaluation where necessary: expert review, simulation, scenario analysis, consensus across models, and delayed outcome tracking. Do not confuse the two.

A practical architecture for operators

A business-ready vertical AI system should separate four functions that are often lazily bundled together.

Layer Purpose Failure if ignored
Reasoning layer Produces plans, explanations, hypotheses, and decisions The system becomes an answer generator with no inspectable logic.
Workflow layer Routes tasks through tools, checks, branches, loops, and human gates The process becomes brittle and hard to adapt.
Feedback layer Measures correctness, usefulness, constraint compliance, and failure modes The system cannot improve except by anecdote.
Governance layer Defines what may change automatically and what requires approval Self-improvement becomes uncontrolled operational drift.

The medical-agent paper is strongest on the workflow layer. Kimina-Prover is strongest on the reasoning and feedback layers. Business deployment needs all four.

For a quant research platform, this might look like:

  1. A reasoning model generates hypotheses, factor explanations, and strategy variants.
  2. A graph workflow controls data retrieval, cleaning, modelling, backtesting, risk checks, and report generation.
  3. A feedback engine records which hypotheses survive out-of-sample tests, which fail basic data checks, and which trigger policy violations.
  4. A governance controller decides which workflow changes can be applied automatically, which require analyst approval, and which are forbidden.

This is not glamorous. It sounds suspiciously like operations engineering. That is because useful AI often becomes operations engineering once the demo music stops.

The important tension: verification is unevenly distributed

The biggest difference between the two papers is the quality of feedback.

In Lean theorem proving, correctness is formally verifiable. That allows reinforcement learning to use a binary reward signal: proof accepted or proof rejected. In medical diagnosis, feedback is more contextual. The paper uses diagnostic labels and evaluation metrics, but real clinical deployment would involve uncertainty, patient history, comorbidities, specialist judgement, and safety protocols. In finance, feedback is even messier: returns are delayed, confounded, non-stationary, and occasionally cruel for sport.

This creates a hierarchy of confidence:

Domain type Verification quality AI design implication
Formal domains: theorem proving, code compilation, symbolic constraints Strong and immediate Train or optimise directly against verifiers where feasible.
Rule-heavy domains: compliance, accounting checks, policy workflows Partial but useful Combine hard rule checks with human review.
Empirical domains: medicine, credit, operations, finance Noisy and delayed Use structured workflows, audits, validation sets, simulations, and expert escalation.
Open-ended judgement: strategy, negotiation, leadership advice Weak and subjective Treat AI as decision support, not an autonomous judge.

This is why the article’s central claim is not “let AI self-improve”. That phrase is too broad to be useful and too cheerful to be trusted.

The better claim is: let AI systems improve only inside bounded search spaces, against explicit feedback, with visible change logs and human-governed limits.

What the papers show, and what they do not

The papers show that structured adaptation can improve performance in demanding reasoning settings. The medical-agent paper shows that workflow evolution can outperform static prompting baselines in dermatology image diagnosis benchmarks. Kimina-Prover shows that a formal reasoning model trained with Lean verification and reinforcement learning can achieve strong theorem-proving performance and benefit from larger model scale and test-time sampling.

Those are the paper claims.

The business interpretation is broader but more conditional: similar design principles can help enterprise AI systems become more reliable, especially where workflows are complex and feedback can be captured. That does not mean a medical workflow can be dropped into a hedge fund, or that Lean-style proof verification exists for every business decision. It means operators should stop treating the prompt as the product. The product is the entire adaptive reasoning system.

Operator checklist

Before deploying a vertical AI agent into a serious workflow, ask:

  • What is the system allowed to change: prompts, nodes, tools, routing logic, thresholds, or model weights?
  • Which changes require human approval?
  • What feedback signal tells the system it failed?
  • Is that feedback immediate, delayed, noisy, or formally verifiable?
  • Can every workflow modification be logged and rolled back?
  • Can the system distinguish data errors from reasoning errors?
  • Are evaluation metrics separated from business incentives?
  • Does added complexity improve outcomes enough to justify cost and latency?

If those questions sound bureaucratic, good. Bureaucracy is what we call safety engineering when it wears a cheaper suit.

Final thought: formation beats improvisation

The next generation of AI agents will not win because they improvise more fluently. They will win because their reasoning is formed: structured, tested, revised, constrained, and connected to feedback.

For quant AI, that means moving beyond the fantasy of a single omniscient model reading markets like tea leaves. The stronger architecture is layered: a reasoning model trained or prompted to expose its logic, a workflow graph that can adapt under constraints, a feedback layer that distinguishes useful signals from lucky noise, and a governance layer that prevents the system from “learning” itself into a regulatory incident.

Fine-tuning matters. Fine-structure matters more than people admit. The model may be the brain, but the workflow is the institution. And in high-stakes domains, institutions beat clever interns.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yangyang Zhuang, Wenjia Jiang, Jiayu Zhang, Ze Yang, Joey Tianyi Zhou, and Chi Zhang, “Learning to Be A Doctor: Searching for Effective Medical Agent Architectures,” arXiv:2504.11301, 2025, https://arxiv.org/abs/2504.11301↩︎

  2. Numina & Kimi Team, “Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning,” arXiv:2504.11354, 2025, https://arxiv.org/abs/2504.11354↩︎