TL;DR for operators
If your agent stack is becoming expensive because every “reflection” step means another model call, this paper is worth reading. Its proposal, Introspection of Thought (INoT), tries to compress an external multi-agent debate loop into one structured prompt. The LLM is not literally running multiple agents. It is being instructed, through a hybrid Python-and-natural-language prompt called PromptCode, to simulate two internal debaters that reason, critique, rebut, revise, and then return an answer.1
The paper reports stronger benchmark performance across code, math, and QA tasks than seven prompting baselines, including Chain-of-Thought, Tree-of-Thought, and Iteration-of-Thought variants. The headline table gives INoT an average score of 81.3 against the strongest baseline average of 76.4. The authors also claim a 58.3% average token-cost reduction compared with the best-performing baseline. That combination, better reasoning with fewer tokens, is exactly the kind of thing operators like to hear, which is also why it deserves careful inspection rather than applause by reflex.
The practical lesson is not “agents are becoming introspective.” Please do not put that on a strategy slide. The lesson is narrower and more useful: some agent-like reasoning behaviours may be moved from orchestration infrastructure into prompt-side control. If that generalises, teams can get some of the benefits of debate, critique, and self-correction without paying for a full external loop every time.
The boundary is equally important. The experiments are benchmark tests, not production deployments. The PromptCode is not executed by an interpreter. The multimodal results are framed as versatility evidence, not proof of broad visual reliability. And the paper contains some numeric inconsistencies between tables, prose, and contribution statements. In other words: interesting mechanism, promising numbers, still not a procurement-grade miracle. Tragic, I know.
The expensive part of agent reasoning is the conversation around the answer
Many agent systems improve output by not trusting the first answer. They ask the model to reason, check, revise, compare alternatives, or consult another role. That can work. It can also turn one question into a small committee meeting held inside your API bill.
External reasoning frameworks often follow a simple pattern:
Task -> LLM output -> critique -> revised output -> comparison -> final answer
When that loop is implemented outside the model, each stage usually requires additional input and output tokens. Multi-agent frameworks make the pattern more elaborate: one agent proposes, another critiques, another judges, and sometimes the whole group keeps talking until the answer looks respectable enough to leave the building.
INoT attacks that cost centre. Instead of managing the debate externally, it places the reasoning procedure inside a single prompt. The paper calls this “Introspection of Thought” because the reflection is supposed to happen within the model rather than through repeated external interaction.
That distinction is the centre of the paper. INoT is not primarily a new model, a fine-tuning method, or a tool-calling architecture. It is a prompt-engineering framework designed to make one LLM behave as though it were internally coordinating a debate.
PromptCode is code-shaped instruction, not executable software
The paper introduces “PromptCode,” described as an LLM-readable code format. It is written as a mixture of Python-like structure and natural language, wrapped in XML-style prompt sections. The authors argue that code-like structure reduces ambiguity compared with ordinary prose prompts, while natural language keeps the format flexible enough for abstract reasoning tasks.
That sounds more formal than it really is. PromptCode is not code in the normal software-engineering sense. There is no interpreter executing a loop. There is no runtime enforcing state. There is no actual Agent_A object with memory isolation from Agent_B. The prompt asks the model to behave as if those things exist.
This matters because the most likely reader misconception is also the most commercially dangerous one: treating INoT as a genuine multi-agent runtime. It is better understood as a single-model simulation of multi-agent deliberation.
The structure has three main pieces:
| INoT component | What it does | Operational interpretation |
|---|---|---|
| PromptCode Definition Module | Tells the model it is a “PromptCode Executor” and should follow structured reasoning logic | A role-and-format priming layer |
| Image Augment Module | Adds systematic visual-analysis instructions when an image is present | A multimodal checklist, not an image-processing tool |
| Reasoning Module | Simulates two agents that reason, critique, rebut, adjust, and converge | A compressed debate scaffold inside one prompt |
The paper’s reasoning logic assumes two agents, Agent_A and Agent_B. Each produces an answer and thought process. They then present arguments, critique each other, rebut critiques, adjust their reasoning, and check agreement. The intended output is the final result without explanation.
The clever move is not that the model suddenly has two minds. It does not. The clever move is that the prompt gives the model a structured procedure for generating internal disagreement before committing to an answer. That is useful because many LLM failures are premature-commitment failures: the model grabs a plausible path, keeps walking, and only later discovers the bridge was decorative.
The mechanism is compressed debate
A normal external debate setup might look like this:
Agent A call -> Agent B call -> critique call -> rebuttal call -> judge call -> final call
INoT tries to collapse the pattern:
Task + PromptCode -> one LLM call -> simulated debate -> final answer
The operational implication is straightforward. External orchestration gives you more control and observability, but it costs tokens and latency. Prompt-side introspection gives you less control, but possibly enough of the same reasoning benefit at a lower cost.
That trade-off is familiar in automation. You can build a workflow engine with explicit stages, logs, retries, and validators. Or you can write a stronger instruction that makes the model perform several steps internally. The first is more auditable. The second is cheaper and easier to deploy. The right choice depends on how expensive mistakes are.
INoT sits on the cheaper side of that trade-off. It says: before building a multi-agent apparatus with all the dignity of a committee and all the cost of a committee, try putting the committee inside the prompt.
The main benchmark evidence supports the mechanism, but not equally across tasks
The paper evaluates INoT on six benchmarks across code, math, and QA tasks. For text tasks, it uses DeepSeek-V2.5 as the common base model when comparing prompting frameworks. The baselines include IO, CoT, SCCOT, LogiCoT, ToT, GIoT, and AIoT.
The main results table reports the following:
| Method | HumanEval | MBPP | MATH | GSM8K | HotpotQA | SQuAD | Average |
|---|---|---|---|---|---|---|---|
| IO | 88.2 | 72.2 | 48.2 | 90.0 | 65.2 | 84.2 | 74.7 |
| CoT | 89.3 | 72.2 | 49.2 | 89.2 | 63.7 | 85.3 | 74.8 |
| SCCOT | 89.7 | 73.3 | 48.9 | 90.8 | 65.1 | 86.7 | 75.8 |
| LogiCoT | 89.7 | 73.7 | 48.7 | 90.2 | 64.3 | 85.2 | 75.3 |
| ToT | 90.6 | 74.2 | 50.2 | 90.4 | 66.5 | 86.2 | 76.4 |
| GIoT | 90.2 | 72.2 | 47.5 | 90.4 | 63.3 | 85.1 | 74.8 |
| AIoT | 90.4 | 72.8 | 47.2 | 91.2 | 64.3 | 86.3 | 75.4 |
| INoT | 95.9 | 84.3 | 53.4 | 93.3 | 72.2 | 88.8 | 81.3 |
The strongest overall baseline is ToT at 76.4. INoT reaches 81.3, a 4.9-point absolute gain over that strongest baseline. The paper also says INoT exceeds baselines by an average margin of 7.95%. That figure appears to be a relative improvement over the average baseline score, not an absolute point gain over the strongest baseline.
The gains are not uniform. They are largest on MBPP and HotpotQA, more modest on MATH and GSM8K, and still visible on SQuAD. That pattern is plausible. A structured critique loop should help tasks where the model benefits from checking assumptions, decomposing constraints, or catching inconsistencies. It may help less where the bottleneck is raw knowledge, exact symbolic manipulation, or benchmark-specific evaluation quirks.
The code results are especially interesting. HumanEval rises from a best baseline of 90.6 to 95.9, and MBPP from 74.2 to 84.3. That suggests the internal debate scaffold may be useful when generated code benefits from adversarial checking: “Does this handle edge cases?” “Does the function satisfy the prompt?” “Is there an off-by-one lurking in the bushes, as usual?”
The cost claim is the business hook, but read its scope carefully
The cost result is the paper’s operator-facing headline. The authors compare token cost and performance on HumanEval across several base models. They report token costs ranging from 6.1 million to 84.7 million tokens across methods, with INoT typically around 26 million tokens. They then claim INoT’s token cost is 58.3% lower on average than the best-performing baseline.
This is where the paper becomes commercially interesting. Many enterprise agent pilots die not because the demo fails, but because the scaled workflow becomes slow, expensive, or operationally fussy. Reflection loops are seductive in prototypes. In production, they become line items.
The business interpretation is therefore:
| Paper claim | Evidence role | Business meaning | Boundary |
|---|---|---|---|
| INoT improves benchmark performance | Main evidence | Structured self-critique may increase answer quality without model training | Benchmarks are not workflows |
| INoT reduces token cost | Cost-performance comparison | Some external agent loops may be replaced by prompt-side reasoning | Cost analysis is centred on benchmark token usage, not full production systems |
| INoT works across several LLMs | Robustness/sensitivity test | The technique may not be tied to one model family | Model settings differ, and production prompts may behave differently |
| Image Augment improves multimodal QA | Exploratory extension plus ablation | The same scaffold can guide visual reasoning checklists | This does not prove broad visual reliability |
For a high-volume coding assistant, customer-support QA tool, analyst copilot, or internal knowledge assistant, a 58.3% token reduction would matter. It could mean lower inference cost, lower latency, or both. But it should be tested against the actual workload. A benchmark prompt is a tidy laboratory creature. Enterprise inputs arrive wearing muddy boots.
The model-agnostic test is useful, but not a free portability certificate
The paper tests INoT with several LLMs: DeepSeek-V2.5, DeepSeek-V2, Claude-3.5 Sonnet, Gemma 2, Qwen2.5-Coder, and Llama 3.2. The authors report that performance variation across models does not exceed 5% on the same datasets.
This is best read as a robustness test. It suggests the framework is not merely exploiting one model’s idiosyncratic response to a particular prompt format. That is encouraging, especially for teams that do not want their reasoning layer locked to one vendor.
Still, “model-agnostic” should be translated carefully. It does not mean all models follow PromptCode equally well. It means that, in the tested benchmark setup, the framework produced reasonably stable scores across selected models. In production, prompt-following behaviour varies by model, context length, safety tuning, decoding settings, and task domain. A model that politely simulates two agents on a math benchmark may become less cooperative inside a messy procurement document with conflicting dates, duplicated clauses, and an executive asking it to “just be practical.”
The multimodal extension is a checklist, not a vision breakthrough
INoT also includes an Image Augment Module. This part instructs the model to analyse images through stages: basic visual understanding, advanced visual analysis, context awareness, and inference verification. The paper tests this on ScienceQA-IMG, LLaVA-Bench (COCO), and LLaVA-Bench (In-the-Wild), using LLaVA, LLaMA-Adapter, and MM-CoT settings.
The results show full INoT outperforming IO, CoT, and INoT without the Image Augment Module across the reported image QA benchmarks. For example, with LLaVA, full INoT reports 90.2 on ScienceQA-IMG, 83.4 on LLaVA-Bench (COCO), and 72.4 on LLaVA-Bench (In-the-Wild), compared with 88.9, 81.2, and 69.8 without Image Augment.
The likely purpose of this experiment is not to prove that INoT is a state-of-the-art multimodal reasoning system. It is to show that the framework can absorb modality-specific instructions. The Image Augment Module acts like a structured checklist: look for objects, colours, spatial relationships, text, shadows, context, symbolic references, and uncertainty.
That matters because many multimodal failures are not caused by the model seeing nothing. They are caused by the model seeing something and then reasoning too casually. A visual checklist can slow the model down. Sometimes that is all the dignity a system needs.
The ablation supports PromptCode, but the paper leaves some details under-specified
The ablation section attempts to test whether the custom PromptCode contributes to performance. The paper says that removing custom PromptCode causes performance declines, and that execution of the PromptCode logic matters more than the surrounding PromptCode and “PromptComplier” design.
This is useful, but not as clean as the main table. The ablation appears as a heatmap rather than a fully tabulated result, and the surrounding explanation is brief. The intended purpose is clear: separate the effect of the reasoning logic from the effect of simply giving the model a long, structured prompt. But the exact implementation differences between ablated variants are not described in enough operational detail for a practitioner to reproduce or generalise confidently.
The ablation still supports the paper’s main mechanism-first story: the gains are not claimed to come only from XML tags or from giving the model more words. They are claimed to come from the simulated debate procedure encoded in PromptCode.
But from an engineering perspective, this is where the next paper needs to do more work. Which part matters most?
- Role priming?
- XML segmentation?
- Python-like variable naming?
- Explicit debate phases?
- The agreement check?
- The instruction to output without explanation?
- The sheer fact that the model is forced to reconsider?
These are not academic trivia. They decide whether the technique becomes a reusable design pattern or remains a charmingly elaborate prompt artefact.
The paper’s own numbers need careful handling
There are several internal inconsistencies worth noting because they affect how confidently the results should be cited.
First, the abstract says INoT improves performance by 7.95% over baselines. The contribution list says 11.6%. The main results table supports the 7.95% relative improvement over the average baseline score, but not the 11.6% statement as written.
Second, the main table reports INoT at 72.2 on HotpotQA and 88.8 on SQuAD. The prose later describes 75.2 on HotpotQA and 87.8 on SQuAD. The table and figure align with 72.2 and 88.8, so those are the safer numbers to use.
Third, the reasoning code contains a loop condition written as While not agreement or Counter < MaxRounds. If read as executable Python-style logic, that condition does not match the prose description of stopping when agreement is reached or the maximum round limit is hit. This is a small but revealing issue: PromptCode is not executable code. It is code-shaped guidance for generation.
None of this invalidates the paper. It does lower the confidence level one should attach to the exact numeric claims. The correct business reaction is not dismissal. It is replication on your own task distribution before changing architecture.
Where this can matter in real systems
The strongest business use case is any workflow where three conditions hold.
First, the task benefits from reflection. Coding, mathematical reasoning, policy QA, compliance interpretation, technical support, and analytical summarisation often fall into this category. Simple extraction tasks probably do not. If you ask a model to pull an invoice number, forcing it to role-play two debating philosophers may be a spectacular waste of everyone’s remaining patience.
Second, external iteration is expensive. If your current architecture calls the model repeatedly for proposal, critique, revision, and judgment, a prompt-side debate scaffold may reduce token usage and latency.
Third, the answer does not require strict auditability of every intermediate step. External agent loops can log each stage. INoT hides the deliberation inside one generation. That is acceptable for some productivity tools and less acceptable for regulated decision workflows.
A practical adoption path would look like this:
| Step | What to test | Pass condition |
|---|---|---|
| 1. Baseline current prompt | Measure accuracy, latency, and token cost on real tasks | Establish a credible comparison |
| 2. Add INoT-style debate scaffold | Keep model and evaluation set constant | Quality improves without unacceptable latency |
| 3. Compare against external critique loop | Test one-call introspection vs multi-call reflection | INoT wins on cost-adjusted quality |
| 4. Stress test failure cases | Use adversarial, ambiguous, and long-context examples | No systematic degradation |
| 5. Decide logging needs | Determine whether hidden internal reflection is acceptable | Fit with risk and compliance requirements |
The ROI question is not “Does INoT beat Chain-of-Thought on a benchmark?” The useful question is: “Can we replace three calls with one without losing the checks that keep our system from embarrassing us in production?”
The real contribution is architectural pressure, not mystical introspection
The word “introspection” gives the paper a slightly grand costume. The actual contribution is more practical: it pressures the boundary between prompt engineering and agent architecture.
For the last two years, many teams have treated “agentic” as a reason to add orchestration: planners, critics, judges, memory stores, tool routers, reflection loops, and dashboards to watch the whole little civilisation argue with itself. Some of that architecture is necessary. Some of it is theatre with invoice privileges.
INoT asks whether part of that architecture can be compressed into the prompt. The answer, based on this paper, is: sometimes, on benchmarks, with promising results, and with enough caveats to keep adults employed.
That is still valuable. Prompt-side control is cheaper to deploy than a new agent framework. It is easier to A/B test. It can be layered into existing applications. And if it captures even part of the benefit of external debate, it offers a practical middle path between naive single-shot prompting and full multi-agent orchestration.
The bottom line: put the critic where the economics make sense
INoT is a useful reminder that “agentic” does not always mean more infrastructure. Sometimes it means better internal procedure. A model that is prompted to argue with itself may produce better answers than a model asked to answer immediately. The paper’s results support that idea across code, math, QA, and multimodal benchmarks.
But the mechanism should be named accurately. This is not true multi-agent execution. It is simulated multi-agent reasoning inside one model call. That makes it cheaper and easier to deploy, but also less observable and less enforceable.
For operators, the near-term lesson is simple: before building another external reflection loop, test whether a structured internal critique prompt gives you enough quality improvement at lower cost. If it does, congratulations. You have replaced a committee with a monologue pretending to be a committee. In enterprise AI, that may count as progress.
Cognaptus: Automate the Present, Incubate the Future.
-
Haoran Sun and Shaoning Zeng, “Introspection of Thought Helps AI Agents,” arXiv:2507.08664, 2025. https://arxiv.org/abs/2507.08664 ↩︎