Judge, Jury, and GPT: Bringing Courtroom Rigor to Business Automation

In the high-stakes world of business process automation (BPA), it’s not enough for AI agents to just complete tasks—they need to complete them correctly, consistently, and transparently. At Cognaptus, we believe in treating automation with the same scrutiny you’d expect from a court of law. That’s why we’re introducing CognaptusJudge, our novel framework for evaluating business automation, inspired by cutting-edge research in LLM-powered web agents. ⚖️ Inspired by Online-Mind2Web Earlier this year, a research team from OSU and UC Berkeley published a benchmark titled An Illusion of Progress? Assessing the Current State of Web Agents (arXiv:2504.01382). Their findings? Many agents previously hailed as top performers were failing nearly 70% of tasks when evaluated under more realistic, human-aligned conditions. ...

April 4, 2025 · 3 min

Rules of Engagement: Why LLMs Need Logic to Plan

Rules of Engagement: Why LLMs Need Logic to Plan When it comes to language generation, large language models (LLMs) like GPT-4o are top of the class. But ask them to reason through a complex plan — such as reorganizing a logistics network or optimizing staff scheduling — and their performance becomes unreliable. That’s the central finding from ACPBench Hard (Kokel et al., 2025), a new benchmark from IBM Research that tests unrestrained reasoning about action, change, and planning. ...

April 2, 2025 · 4 min