TL;DR for operators
A recent case study from If P&C Insurance is useful because it does something most AI automation stories conveniently skip: it follows the work after the model is deployed.1 The company used an LLM to identify specialised claim parts in insurance claims, a task that had depended on human claim handlers and specialist knowledge. In offline evaluation, the fifth model iteration built around GPT-4o-0806 reached 81% recall in English, above the company’s 70% human baseline. That sounds like the usual “AI beats humans” headline. Mercifully, the paper is more interesting than that.
In production, over five months, the AI scanned 3743 claims and identified 1034 claim parts. Claim handlers identified only 68. The authors report this as a 1420% scaling of the identification process. That is the bottleneck turning into bottlenectar: suddenly the process produces far more “things worth looking at”.
But the nectar still needs someone to bottle it. Downstream investigators created only 23 investigation cases from AI-identified claim parts and 21 from human-identified ones. Among the 26 investigated claim parts in the combined analysis, AI missed 3 that humans caught, while humans missed 5 that AI caught. The uncomfortable business result is that AI did not automatically create end-to-end efficiency. It shifted the constraint.
The operational lesson is direct: do not evaluate AI workflow projects only at the model layer. Evaluate where the volume goes. A model can remove the visible bottleneck while exposing the next one, which is not failure but is also not ROI. It is a process redesign bill arriving with better packaging.
The real story starts after the model works
Insurance claims are an unusually honest setting for enterprise AI because the work is both textual and consequential. A claim description and its notes contain the clues. The business needs to identify certain claim parts that require special handling. The paper cannot disclose the specific business terms, which is mildly annoying but understandable. Insurance companies do enjoy their confidentiality, almost as much as they enjoy forms.
The original process relied on claim handlers. That was workable when volume was low or when the knowledge needed was widely distributed. It became less workable when identifying these claim parts required specialised expertise across many process variations. The insurer operates across Nordic and Baltic markets, three business areas, and 77 lines of business, but this case focuses on one specific line of business. The business goal was not to replace the whole claims operation. It was narrower: automate the identification of claim parts that might deserve further investigation.
That scope matters. The system did not directly approve, reject, or settle claims. It generated leads for investigators. Humans stayed in the loop. This is exactly the type of deployment that many organisations will find plausible: automate the detection layer, preserve human judgement at the decision layer, and hope the organisation becomes faster without accidentally setting itself on fire.
The paper’s contribution is that it tracks whether this actually happened.
The model target was recall, because missing cases mattered more than extra work
The AI component followed a structured development path using CRISP-DM: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. The team used manually labelled examples from claim part investigators as ground truth. The model consumed claim descriptions and notes, with personally identifiable information masked before inference. The implementation used Azure OpenAI in Sweden to meet internal data residency and privacy requirements.
The authors tested multiple approaches. Early proof-of-concept experiments with Microsoft Phi-3 and Mistral Large did not perform adequately on Finnish-language inputs and prompts. The production direction eventually settled on GPT-4o-0806, with chain-of-thought-style prompting, few-shot examples, and increasingly structured output controls.
The evaluation metric was recall. This was not a decorative metric choice. In this workflow, a false negative means a relevant claim part is missed, which can have financial consequences. A false positive creates additional validation work, but investigators can filter it. In normal benchmark theatre, people often worship F1-score because it gives the pleasant feeling of balance. In this business setting, balance was not the priority. Catching the relevant cases was.
The model iterations tell a useful implementation story:
| Model iteration | Language | Accuracy | Precision | Recall | F1-score | Interpretation |
|---|---|---|---|---|---|---|
| v1 | English | 0.76 | 0.83 | 0.67 | 0.74 | Below the 70% human recall baseline |
| v1 | Finnish | 0.77 | 0.86 | 0.67 | 0.75 | Also below baseline |
| v2 | English | 0.77 | 0.92 | 0.61 | 0.73 | Higher precision, weaker recall |
| v2 | Finnish | 0.76 | 0.85 | 0.64 | 0.73 | Still below baseline |
| v3 | English | 0.70 | 0.80 | 0.56 | 0.66 | Poor recall for the business goal |
| v3 | Finnish | 0.77 | 0.81 | 0.72 | 0.76 | First clear pass over baseline |
| v4 | English | 0.71 | 0.79 | 0.61 | 0.69 | Below baseline |
| v4 | Finnish | 0.74 | 0.76 | 0.72 | 0.74 | Above baseline |
| v5 | English | 0.80 | 0.81 | 0.81 | 0.81 | Best reported recall |
| v5 | Finnish | 0.77 | 0.83 | 0.69 | 0.76 | Just below baseline |
Two details are worth not missing. First, the best result was not simply “use a bigger model”. The paper links the improvement to richer instructions, metadata considerations such as insurance event timing, specialised prompts for specific claim parts, and OpenAI Structured Outputs to force strict schema adherence. In other words, part of the performance came from turning a language model into a disciplined component in a business system.
Second, the English version of the fifth iteration achieved the strongest recall, even though the workflow involved Finnish and English processing. The paper says the English version translated inputs and combined them with English instructions. This is an implementation finding, not a universal law about Finnish, English, or multilingual LLMs. Please do not convert one insurance workflow into a LinkedIn prophecy about language strategy. We have enough of those already.
The production deployment changed the volume, not the whole system
The model was integrated and deployed in production on 25 September 2024. The evaluation covers five months in which the AI and claim handlers operated in parallel. That parallel design is important: claim handlers continued to work as before, allowing the authors to compare the traditional and AI-enhanced variants in live operation rather than relying only on offline model scores.
The headline operational numbers are stark:
| Process measure over five months | Human claim handlers | AI workflow | Business interpretation |
|---|---|---|---|
| Registered claims in the analysed line of business | 3743 | 3743 scanned | Same claim population |
| Claim parts identified | 68 | 1034 | AI massively expanded detection volume |
| Share of claims with identified claim parts | 1.82% | 27.62% | Business experts considered the AI ratio closer to expectation |
| Investigation cases created from identified claim parts | 21 | 23 | Downstream investigation did not scale proportionally |
The most tempting reading is that the AI made the process 1420% better. The paper is more precise: comparing the AI-identified claim part ratio with the human-identified ratio, the identification process scaled by 1420%. That is not the same as saying the whole claims management process became 1420% more efficient.
This distinction is the article’s main meal. Identification scaled. Investigation did not. The system produced many more candidate claim parts, but the number of investigator-created cases remained nearly flat. Stakeholder discussion traced the constraint to limited investigator capacity. The number of investigators was sufficient for the traditional process because investigators did not examine every identified item; they selected cases based on business criteria. Once AI increased the pool of identified claim parts, the limiting factor moved downstream.
That is not an anti-AI finding. It is a useful AI finding. The automation exposed the organisation’s true capacity boundary.
Object-centric process mining did the boring thing that mattered
The paper’s second major contribution is the use of object-centric process mining, or OCPM, to evaluate the transition. This sounds like a specialist detail, and technically it is. Operationally, it answers a simple question: what actually happened to the work after AI entered the process?
Traditional process mining often flattens reality into one case type. That can be useful, but business processes rarely behave so politely. In this case, the relevant objects included customers, claims, claim notes, claim parts, AI models, and employees, with employees further distinguished as claim handlers and claim part investigators. The activities included registering a claim, creating a note, reporting a claim part, scanning a claim, predicting a claim part, and creating a claim part investigation.
A flat event log can easily obscure these relationships. It may show that an investigation happened, but not whether it followed a human report, an AI prediction, both, or neither in a temporally clean way. The authors used OCEL 2.0 because it preserves multiple object types and their relationships, allowing drill-down, unfolding, filtering, and reuse of the same extracted event log for multiple questions.
This matters because the evaluation questions were relational, not just numerical:
| Evaluation question | Why a simple count is not enough |
|---|---|
| How did claim handlers identify claim parts? | Human activities had to be separated from AI scan and prediction events |
| How did AI identify claim parts? | Investigation cases had to be attributed carefully because some were created before AI scanned the claim |
| Which investigated cases did AI miss but humans caught? | The analysis needed to separate human-only from AI-involved paths |
| Which investigated cases did humans miss but AI caught? | The analysis needed to identify cases where AI contributed without human reporting |
This is the kind of evaluation many AI programmes lack. They run an offline accuracy test, deploy the model, and then treat operational impact as a matter of faith, dashboards, and quarterly vibes. Here, the authors asked whether the process path changed.
The answer was yes, but not in the naïve way.
The uncomfortable finding: the better detector created a downstream queue
The paper reports that among the 26 investigated claim parts examined in the combined analysis, AI missed 3 cases that claim handlers caught. Conversely, AI identified 5 cases that claim handlers missed. AI and claim handlers overlapped on 18 investigated cases.
Those numbers are small, so they should not be inflated into a grand theory of human-machine complementarity. Their purpose is more specific: they show that the AI and human paths were not identical. The AI added detection coverage. Humans still contributed cases that AI missed. The process was not a clean replacement story; it was a messy coexistence story, otherwise known as actual enterprise adoption.
The larger finding comes from the mismatch between identification and investigation. The AI identified 1034 claim parts. Investigators created 23 cases from AI-identified claim parts. Human claim handlers identified 68 claim parts. Investigators created 21 cases from those. The identification layer changed dramatically; the investigation layer barely moved.
That is the operational warning. If a company automates the intake or detection step without changing downstream capacity, it may create a more accurate backlog. That is still progress, but it is not the same as realised value.
The business response in the paper is sensible: stakeholders recognised the need to employ more investigators or otherwise scale the claim part management process. The company also saw enough value to plan extension of AI usage to other lines of business. This is the right interpretation. The model did not fail because it created more work. It succeeded at its assigned task and revealed the next management problem.
AI often does this. It makes hidden scarcity visible. Then someone has to manage it. Cruel, but educational.
The paper’s evidence is a production case, not a universal benchmark
The case gives useful evidence, but it is not a broad causal benchmark. It covers one company, one insurance sub-process, one line of business, and confidential claim-part categories. The five-month production window is meaningful, but not long enough to answer every question about seasonality, drift, investigator adaptation, or long-term financial outcomes.
The study also uses stakeholder validation and source-system checks for the process findings. That is appropriate for a production case, but it is not the same as a randomised controlled trial. The paper shows a credible operational mechanism: model improves detection, process mining shows volume movement, downstream investigation becomes the constraint. It does not prove that every insurer, every claim category, or every LLM workflow will behave the same way.
The model evaluation table should be read similarly. The 81% recall result is important because it exceeded the internal human baseline and unlocked deployment confidence. It does not mean GPT-4o-0806 will reach 81% recall on other insurance tasks. The performance depended on labelled data, prompt iteration, metadata, structured outputs, privacy-safe deployment, and business-specific definitions that the paper cannot disclose.
For operators, the exportable lesson is not the number. The exportable lesson is the evaluation pattern.
What business leaders should copy, and what they should not
The case suggests a practical adoption sequence:
| Step | What the paper directly shows | Cognaptus inference for operators | Boundary |
|---|---|---|---|
| Pick a narrow bottleneck | Claim-part identification was selected because manual expertise limited scalability | Start with a task where missed detections have measurable cost and human review can absorb false positives | Not every knowledge task has this risk profile |
| Optimise for the business loss function | Recall was prioritised because false negatives mattered more than false positives | Choose metrics based on operational consequences, not benchmark aesthetics | Precision may dominate in other workflows |
| Make outputs system-readable | Structured Outputs helped remove unwanted output variation | LLM automation needs schema discipline, not just fluent text | Schema compliance does not guarantee semantic correctness |
| Run humans and AI in parallel | Claim handlers continued normal work during the five-month evaluation | Parallel operation can reveal overlap, misses, and workload shifts | Parallel runs cost time and require clean logging |
| Mine the process, not just the model | OCPM showed how work moved across handlers, AI, and investigators | Model monitoring should be paired with process monitoring | OCPM can be technically and visually complex |
| Redesign downstream capacity | AI shifted the bottleneck to investigators | Treat AI detection as demand generation for the next function | Hiring more people is only one possible response |
The point is not that every company needs OCPM from day one. The point is that every serious AI automation programme needs an answer to the same class of question: after the model acts, which human, queue, control, system, or policy receives the extra volume?
If nobody knows, the project is not being evaluated. It is being admired.
Visualisation is part of the system, not a reporting afterthought
One of the paper’s quieter but important findings is about process mining usability. The authors found object-centric directly-follows graphs and object-centric Petri nets useful for analysis, but too complex for stakeholders. Even after simplification, business users struggled to interpret the full object-centric models.
So the team used object-centric methods for the analytical work, then flattened filtered logs around the Claim object type and presented traditional directly-follows graphs to stakeholders. This is a practical compromise: use the richer model to avoid wrong analysis, then translate the result into a simpler view for decision-making.
That distinction deserves attention. Technical teams often assume that if an analysis is correct, it is done. It is not. If the business cannot interpret the output, the analysis has not entered the organisation. It is just sitting there, highly accurate and socially unemployed.
The paper also notes that open-source process mining tools lacked the flexibility needed for stakeholder engagement in this context. That is not the central scientific result, but it is operationally relevant. If process mining is going to become part of AI governance, organisations need tools that support drill-down for analysts and readable simplification for managers. Otherwise, the gap between “we measured the process” and “the business changed the process” remains pleasantly wide.
The business value is not cheaper labour; it is better process visibility
A lazy reading says the insurer automated claim handlers. A better reading says the insurer created a higher-throughput detection layer and then used process mining to discover the next constraint.
That changes the ROI conversation. Labour substitution may happen in some settings, but this case is more about capability expansion. The AI identified many more candidate claim parts than humans did. Business experts considered the AI-identified ratio plausible and valuable. Yet the organisation still needed investigators, rules, prioritisation, and downstream scaling to convert detection into realised impact.
So the value stack looks like this:
- Detection capacity: the LLM scans all claims in the target line of business and identifies far more claim parts than the manual process.
- Consistency: structured prompts and schema-constrained outputs reduce variation in the automated component.
- Coverage insight: parallel operation reveals where humans and AI overlap or miss different investigated cases.
- Process visibility: OCPM shows that the next bottleneck sits in claim part investigation.
- Redesign agenda: management now has evidence for workforce adjustment, downstream automation, triage rules, or process redesign.
Only the first two belong mostly to the model. The rest belong to operations. This is where many AI projects become less glamorous and more useful.
Boundaries that should affect deployment decisions
The paper’s limitations are not fatal, but they are operationally material.
First, the claim-part categories are confidential. That prevents external readers from judging task difficulty, class distribution, or semantic ambiguity in detail. We can evaluate the reported method and numbers, but not fully inspect the domain.
Second, the study covers one specific line of business. The insurer has many process variations across 77 lines of business, and the company plans expansion, but the paper does not prove that transfer will be painless. Each line may require new prompts, labels, privacy checks, stakeholder validation, and process mining configuration.
Third, the five-month window captures early production dynamics. It shows a real transition period, which is valuable. It does not settle long-term questions about model drift, investigator learning, seasonal claim patterns, or whether downstream scaling eventually produces financial gains.
Fourth, the analysis depends on event data quality. The authors had to extract from multiple systems, use a temporal data warehouse, pull claim-part information from an operational system via a KPI, and filter claim notes because OCEL 2.0 could not distinguish expired related notes in the required way. That is a healthy reminder that “just mine the process” usually means “first survive the data plumbing”.
Fifth, the stakeholder-facing process models required simplification. The richest analytical representation was not the most usable business representation. That is a deployment constraint, not a cosmetic issue.
The operator’s takeaway: automate the step, then audit the system
This paper is valuable because it resists the standard enterprise AI fairytale. It does not stop at “the model works”. It asks what happened to the business process when the model entered production.
The answer is useful precisely because it is inconvenient. The AI scaled claim-part identification from 68 human-identified cases to 1034 AI-identified cases over five months. It exceeded the internal human recall baseline in the best reported model iteration. It also exposed the next bottleneck: investigators could not absorb the newly expanded detection volume without further process change.
That is what mature AI adoption looks like. Not a single heroic model. Not a dashboard announcing transformation because the demo was smooth. A narrow automation improves one part of the workflow, measurement shows the side effects, and management has to redesign the surrounding system.
The practical lesson is simple enough to be dangerous: if AI makes one step 14 times larger, the next step had better know.
Otherwise the bottleneck has not disappeared. It has just moved desks.
Cognaptus: Automate the Present, Incubate the Future.
-
Shahrzad Khayatbashi, Viktor Sjölind, Anders Granåker, and Amin Jalali, “AI-Enhanced Business Process Automation: A Case Study in the Insurance Domain Using Object-Centric Process Mining,” arXiv:2504.17295, 2025, https://arxiv.org/abs/2504.17295. ↩︎