A workflow looks harmless until it starts waiting on itself.
One LLM call asks for a plan. Another evaluates the plan. A third revises the result. A fourth retrieves evidence. Somewhere in the middle, three subtasks could have run at the same time, two repeated calls could have been reused, and one prompt should probably have been tuned before anyone proudly called the system “agentic.” Instead, the whole thing runs as a neat little chain: expensive, slow, and quietly brittle. Very elegant, in the way a traffic jam is elegant if viewed from a drone.
The paper “Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs” is useful because it shifts attention from the surface of reasoning prompts to the machinery that executes them.1 Its core contribution is not another magic phrase, not a new “think step by step” variant, and not a heroic claim that graphs are always better than chains. The paper introduces Framework of Thoughts, or FoT, as infrastructure for implementing, executing, caching, and optimizing reasoning schemes built from chains, trees, and graphs.
That distinction matters. Much of the public discussion around LLM reasoning still treats performance as a property of the prompt or the model. Better prompt, better reasoning. Bigger model, better answer. More test-time compute, more intelligence. Sometimes true. Often incomplete. FoT’s argument is more operational: once reasoning becomes multi-step, performance also depends on how the reasoning process is represented, scheduled, reused, and optimized.
In other words, the prompt is no longer the whole product. The execution system is part of the intelligence.
FoT is not a reasoning trick; it is a reasoning operating layer
The likely misunderstanding is simple: readers may see “Framework of Thoughts” beside Chain of Thought, Tree of Thoughts, and Graph of Thoughts, and assume it is another prompting scheme competing for the next acronym slot. The paper says otherwise.
FoT is designed as a foundation framework for building reasoning schemes. Chain, tree, and graph structures become objects that can be implemented inside the framework. The framework itself provides the mechanics: dynamic graph construction, safe parallel execution, caching, and hyperparameter or prompt optimization.
This is why the paper’s framing is more important than its name. The authors are not claiming that every task needs a complicated graph. They are asking a more basic engineering question: if reasoning schemes already use chains, trees, and graphs, why are so many of these structures still manually specified, statically executed, and under-optimized?
The answer is partly historical. Chain-of-thought prompting made reasoning look like text. Tree and graph variants made it look like structured exploration. But many implementations still behave like fixed pipelines: define the topology, run the prompts, collect the answer. FoT tries to turn that pipeline into a modifiable execution graph.
That change sounds abstract, so here is the practical translation.
| Layer | Usual prompting view | FoT view | Business meaning |
|---|---|---|---|
| Reasoning structure | A fixed chain, tree, or graph designed before execution | An execution graph that operations can modify during execution | Better fit for heterogeneous tasks where the required decomposition is not known upfront |
| Execution | Mostly sequential unless manually parallelized | Scheduler and controller run ready operations concurrently | Lower latency for multi-step workflows |
| Reuse | Repeated calls often recomputed | Process and persistent caches reuse operation outputs | Lower cost in repeated workflows and optimization loops |
| Improvement | Prompt and hyperparameters often hand-tuned | Built-in Optuna and DSPy-based optimization | Better configurations under explicit cost or quality objectives |
The central idea is not “graphs are smarter.” The central idea is “reasoning systems need runtime architecture.”
The key mechanism is graph mutation, not graph decoration
FoT separates two concepts that are easy to blur: the execution graph and the reasoning graph.
The execution graph describes operations and information flow. Its nodes are operations: LLM calls, retrieval calls, code execution, tool calls, or any Python-defined unit. Its edges carry thoughts between operations. This graph can include completed, currently executing, and planned operations.
The reasoning graph is simpler. It records which thoughts influenced which other thoughts after operations have produced outputs. It is the trace of reasoning, not the machinery that scheduled the work.
That distinction is not cosmetic. It allows FoT to treat reasoning as a live process rather than a pre-drawn diagram. In FoT, operations can generate thoughts, but they can also modify the execution graph itself: adding operations, removing operations, or changing connections. A reasoning process can therefore discover its own next steps during execution.
For business AI systems, this is the difference between a rigid workflow and an adaptive one. A static workflow says: classify the email, extract fields, call a database, draft a response. A dynamic workflow might decide that one email needs contract lookup, another needs customer-history retrieval, and a third should branch into escalation because the evidence is inconsistent. The point is not that FoT solves that entire business case out of the box. The point is that its abstraction makes such behavior a first-class design target.
The paper still requires users to specify an initial execution graph with at least one operation. FoT is not a self-born software organism, despite what a less disciplined LinkedIn post might imply. But once execution begins, operations can derive later parts of the graph from the problem instance.
This is the mechanism that justifies a mechanism-first reading of the paper. Caching, parallelism, and optimization are not isolated features. They become more valuable because the reasoning process is represented as operations and dependencies.
Parallel execution is useful only if the graph remains logically safe
Parallelism is the easy part to sell and the hard part to implement safely.
If a reasoning scheme branches into many candidate thoughts, some branches can run at the same time. If a document-merging workflow generates multiple candidate merges, those candidates need not wait politely in a queue. If a multi-hop question decomposes into leaf questions, some leaf answers can be produced independently. This is the obvious value.
The non-obvious problem is that FoT allows operations to modify the graph while other operations may be running. That creates the possibility of race conditions: two operations attempt conflicting graph modifications, or one operation changes a dependency that another operation assumes still exists.
FoT addresses this with dynamic execution constraints. The scheduler only runs operations that are ready, meaning their required ancestor outputs are available. More importantly, operations are limited in what parts of the graph they can inspect and modify. They cannot rewrite already executed ancestors. They cannot freely modify non-exclusive descendants that may also be reachable through another running branch. They can modify their exclusive descendants, add certain edges from ancestors to exclusive descendants, and move some downstream connections under controlled conditions.
This is not the glamorous part of the paper. It is also the part that prevents the whole idea from becoming a concurrency accident wearing a research badge.
For companies building agentic workflows, the lesson is direct: “parallel agents” are not automatically efficient agents. Parallel execution needs dependency control. Otherwise, the workflow may become faster at producing inconsistent intermediate states. That is not productivity. That is merely accelerated confusion.
Caching turns reasoning optimization from theoretical to affordable
FoT includes two forms of caching.
The process cache stores results temporarily within a single execution for one problem instance. The persistent cache stores outputs across problem instances, so repeated operations can be reused later.
This sounds like normal software engineering. It is. That is exactly why it matters. A large part of LLM infrastructure progress will come from rediscovering boring engineering principles and applying them to expensive probabilistic calls. The cloud bill, unlike the model, is very deterministic.
Caching has two distinct roles in the paper.
First, it reduces cost and runtime during normal execution when operations repeat with the same inputs. Second, and more strategically, it makes optimization more feasible. Prompt and hyperparameter optimization require repeated evaluations of many configurations. Without caching and parallelization, the search process itself can become too expensive to justify.
The paper’s results support this distinction. In per-instance evaluation, parallel execution and persistent caching produce runtime speed-ups between 1.9x and 35.4x depending on task and scheme, with an average 10.7x acceleration under parallel execution plus persistent caching. Caching reduces costs on Game of 24, Document Merging, and MuSiQue by 14–46%, while having no effect on Sorting and HotpotQA in the reported setup.
That unevenness is important. Caching is not magic dust. It helps when there are reusable operations. It does little when the workflow does not repeat meaningful calls. For business use, this means caching value depends on workflow shape. Contract comparison, repeated document processing, customer-support templates, recurring research pipelines, and evaluation loops are promising. One-off bespoke reasoning tasks may see less benefit.
The experiments are demonstrations of infrastructure leverage, not proof of universal reasoning superiority
The paper reimplements three popular schemes inside FoT: Tree of Thoughts, Graph of Thoughts, and ProbTree. It evaluates them across five task settings: Game of 24, Sorting, Document Merging, HotpotQA, and MuSiQue.
The experimental structure serves several purposes.
| Experiment area | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| ToT on Game of 24 | Main evidence for tree-style reasoning with repeated evaluation | Parallelism and caching can sharply reduce runtime and cost | FoT is not shown to make all math reasoning reliable |
| ToT and GoT on Sorting | Comparison across tree and graph implementations | FoT can host multiple existing reasoning schemes | Caching is not always useful; Sorting shows little or no cache benefit |
| GoT on Document Merging | Main evidence on a more business-like task | Caching and prompt optimization can help multi-step document workflows | NDA merging is not the same as production legal review |
| ProbTree on HotpotQA and MuSiQue | Demonstration of dynamic double-tree and retrieval-compatible reasoning | FoT can implement retrieval-including multi-hop QA workflows | It does not prove hallucination is solved |
| Optimization experiments | Evidence that prompt and hyperparameter search can improve quality under cost constraints | Optimization becomes more viable with caching and parallelization | The reported gains are modest and benchmark-specific |
The results are strongest as evidence for execution efficiency and optimization feasibility. They are more modest as evidence for raw reasoning improvement.
For example, optimization improves Game of 24 accuracy from 63.0% to 66.0%. Sorting mistakes decrease from 18.4 to 18.2 for ToT and from 12.7 to 12.1 for GoT. Document Merging F1 rises from 8.4 to 8.8. These are not dramatic leaps. They are small improvements achieved while constraining costs so that the optimized variants do not simply buy better scores by spending more test-time compute.
That cost constraint is editorially important. Without it, a reasoning benchmark can become a spending contest. FoT’s optimization setup is more disciplined: improve task score, but do not exceed the cost of the unoptimized variant. That makes the result more relevant to business deployment, where “just sample more” is a strategy until finance notices.
The larger result is in the optimization process itself. With parallel execution and persistent caching, total optimization duration improves by factors ranging from 8.7x to 50.2x across the reported settings. Persistent caching reduces total optimization costs to 9–36% of the no-cache baseline. This is where FoT’s business relevance becomes clearer: the framework does not merely run a reasoning scheme faster; it lowers the barrier to systematically tuning reasoning schemes.
The real business lesson is not “use FoT”; it is “treat reasoning as an operations problem”
A company deploying LLM workflows usually asks three questions too early:
- Which model should we use?
- What prompt should we write?
- How do we make the agent more autonomous?
FoT suggests a fourth question should come first:
What is the execution architecture of the reasoning process?
This question changes the design conversation.
A customer-support agent is not just a prompt with tools. It is a graph of classification, retrieval, policy checking, escalation logic, response drafting, and verification. A financial-research assistant is not just a long context window. It is a sequence of decompositions, evidence retrieval, contradiction checks, table extraction, synthesis, and audit trails. A document automation system is not just “summarize this contract.” It is a branching process of clause detection, comparison, missing-term identification, risk scoring, and revision suggestions.
Once the workflow is seen this way, several practical design principles follow.
| FoT contribution | Operational consequence | ROI relevance |
|---|---|---|
| Dynamic execution graphs | Workflows can adapt their decomposition to the instance | Better handling of heterogeneous cases without manually designing every path |
| Safe parallel execution | Independent operations can run concurrently without corrupting graph state | Lower latency in multi-step workflows |
| Process and persistent caching | Repeated calls can be reused within and across runs | Lower inference cost, especially in repeated tasks |
| Hyperparameter optimization | Search over branching, filtering, scheduling, and other design choices | Better quality-cost tradeoffs |
| Prompt optimization | Improve operation-level instructions systematically | Less reliance on artisanal prompt tinkering |
The mildly uncomfortable implication is that many “AI agent” prototypes are under-engineered. They may have tools, memory, and a charming demo video, but their reasoning structure is still a hand-wired sequence. When the task changes, the workflow either fails, becomes expensive, or requires manual redesign.
FoT does not remove that engineering burden. It makes the burden visible.
Where the paper is strong, and where it is still early
The paper is strongest in three areas.
First, it gives a clean abstraction for representing reasoning schemes as executable and modifiable graphs. This matters because current reasoning methods are fragmented across chains, trees, graphs, retrieval pipelines, and tool-use systems. A common framework makes comparison and implementation less ad hoc.
Second, it shows meaningful runtime improvements from parallelization. These are not subtle. A reported range of 1.9x to 35.4x speed-up is enough to change whether a workflow feels interactive or painfully ceremonial.
Third, it makes optimization economically plausible. The paper’s optimization gains in task score are modest, but the reduction in optimization cost and duration is strategically more interesting. If optimization is too expensive, teams do not optimize; they ship whatever prompt survived the afternoon. FoT’s caching and parallelism reduce that friction.
The boundaries are also clear.
FoT is evaluated through reimplementations of ToT, GoT, and ProbTree, not through a broad population of newly invented fully automatic reasoning schemes. The authors themselves note that GoT’s execution graph is static, while ToT and ProbTree only exhibit some of the dynamic graph modification FoT is designed to support. So the paper demonstrates the framework’s capacity, but the full value of dynamic graph evolution remains partly future-facing.
The benchmark tasks are useful but limited. Game of 24 is synthetic. Sorting is controlled. HotpotQA and MuSiQue test multi-hop QA, but production knowledge work often includes messy documents, changing policies, ambiguous user intent, and accountability requirements. Document Merging is closer to business use, but even there, a merged NDA benchmark is not legal assurance.
Finally, FoT improves orchestration; it does not guarantee truth. A faster graph can still route bad evidence into a fluent answer. Caching can preserve useful repeated outputs, but it can also preserve outputs that should be invalidated when context changes. Prompt optimization can find better configurations for a validation set, but it can overfit or miss operational risks not represented in the benchmark.
These are not fatal weaknesses. They are boundaries around what the paper actually proves.
What Cognaptus would take from FoT for real deployments
For business automation, the immediate takeaway is not to rebuild every workflow in FoT tomorrow morning. The practical lesson is to audit LLM systems at the graph level.
A useful audit would ask:
| Audit question | Why it matters |
|---|---|
| Which operations are truly dependent, and which can run in parallel? | Latency often hides in unnecessary sequencing |
| Which operation outputs repeat across cases? | Repetition is where persistent caching becomes valuable |
| Which prompts and hyperparameters are hand-tuned but never validated? | Untested design choices may dominate performance |
| Which workflow branches should be dynamically derived from the instance? | Static pipelines struggle with heterogeneous inputs |
| Which cached outputs require invalidation rules? | Reuse without freshness control becomes technical debt |
| Which evidence checks sit after generation rather than before it? | Verification placement affects reliability |
This is where FoT connects to AI process automation more than prompt engineering. The framework encourages teams to model reasoning systems as composable operations with dependencies, costs, and search spaces. That is a healthier mental model than treating an agent as a chat transcript with accessories.
The business value, therefore, is not simply cheaper reasoning. It is cheaper experimentation with reasoning designs. A team can test whether more branches help, whether fewer candidates preserve quality, whether prompt variants improve a scoring function, and whether parallelization reduces response time enough to change the user experience.
That is not glamorous. It is also how software becomes dependable.
From static prompts to self-optimizing reasoning graphs
The phrase “self-optimizing reasoning graphs” should be used carefully. FoT does not produce a fully autonomous system that invents all its own objectives, validates itself in the wild, and sends the engineering team home. Convenient fantasy, wrong decade.
What FoT does show is more grounded and more useful: reasoning schemes can be represented as dynamic execution graphs; those graphs can be scheduled safely; repeated operations can be cached; prompts and hyperparameters can be optimized under explicit objectives; and the result can be faster, cheaper, and sometimes more accurate.
That is enough.
The next generation of business AI systems will not be won only by the team with the cleverest prompt or the newest model endpoint. It will be won by teams that understand the shape of the work: which parts branch, which parts repeat, which parts can run together, which parts need validation, and which design choices should be optimized rather than guessed.
FoT is valuable because it moves the discussion from prompt text to reasoning infrastructure. It reminds us that intelligence in deployed AI systems is not only generated. It is orchestrated.
And once reasoning becomes orchestration, static prompts start to look less like strategy and more like a prototype that forgot to grow up.
Cognaptus: Automate the Present, Incubate the Future.
-
Felix Fricke, Simon Malberg, and Georg Groh, “Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs,” arXiv:2602.16512, 2026. https://arxiv.org/abs/2602.16512 ↩︎