Plug Me In: Why LLMs with Tools Beat LLMs with Size

TL;DR for operators

The Athena paper is useful because it makes a simple operational point that many AI buying committees still manage to avoid: a bigger language model is not the same thing as a better workflow.¹ An LLM can explain, infer, and format. It is still a poor substitute for a calculator, a live database, a calendar API, a search service, or a domain-specific computation engine. This is not a moral failure. It is just architecture.

Athena connects an LLM to external tools through a structured orchestration layer. Tools are registered with schemas, the system monitors whether a user query requires outside execution, extracts parameters, calls the relevant API, then feeds the result back into the conversation. The model remains the linguistic interface. The tools do the parts that should not be left to vibes.

The reported results are strong but narrow. On selected MMLU mathematics questions, Athena reaches 83% accuracy, compared with the best tested baseline at 67%. On selected science questions, it reaches 88%, compared with the best tested baseline at 79%. That is a meaningful gap, especially in mathematics, where computation matters more than polished prose.

The business implication is not “buy Athena tomorrow” or “tool use solves reasoning.” Please, let us not make the usual pilgrimage to the altar of overstatement. The better takeaway is that organisations should stop treating model choice as the whole AI strategy. For workflows involving calculations, current information, structured records, compliance checks, scheduling, finance, research, or operations, the winning design may come from connecting a competent model to reliable tools.

The boundary is equally clear. The paper evaluates 100 mathematics and 100 science questions. It does not test production latency, cost, security, adversarial tool calls, bad parameter extraction, broken APIs, auditability, or whether the framework generalises to messier enterprise workflows. The evidence supports tool orchestration as a serious design pattern. It does not support declaring the scaling era over while standing dramatically beside a bar chart.

The familiar failure: asking a poet to do a spreadsheet’s job

Anyone who has used a frontier model for quantitative work has seen the performance split. Ask it to explain a concept and it may sound like a patient tutor. Ask it to perform a multi-step calculation, preserve every constraint, select the correct unit, and output the exact answer, and suddenly the tutor has misplaced the denominator.

This is the gap Athena targets. The paper starts from an increasingly practical observation: language models are strong at natural language processing, but weak when a task requires access to current data or active computational capability. A static model cannot know today’s weather unless connected to a weather service. It cannot manage a real meeting unless connected to a calendar. It can imitate arithmetic, but imitation is not computation. Close enough is not a business process.

The old reflex was to ask for a larger model. More parameters, more training, more expensive inference, more executive confidence in a procurement slide. Athena’s argument is quieter and more useful: keep the model, but change what it is allowed to touch.

That is the mechanism-first reading of the paper. The contribution is not merely that Athena scores higher on a benchmark. The contribution is the workflow pattern behind the score: schema-mediated delegation from the LLM to external tools.

Athena’s useful idea is not “tools”; it is controlled delegation

Many systems now claim to connect models to tools. The difference between a demo and an operational architecture is whether the handoff is explicit enough to inspect, manage, and improve.

Athena is built around a sequence of components that turn a user query into tool-assisted response generation. The paper names five main elements:

Component	What it does	Operational consequence
ExternalServiceIntegrator	Manages the tool repository and registers tools through schema-like descriptions	Tools become discoverable capabilities rather than hard-coded tricks
MessageSubmission	Handles user query submission and conversational context	The model receives the task in a consistent interaction flow
RunMonitoring	Detects when external tools may be needed	Tool use becomes a monitored decision rather than an accidental prompt flourish
HandleRequiredAction	Extracts parameters, formats calls, and invokes the external API	Natural language is converted into executable structured action
UpdateMessage	Integrates the tool output back into the LLM dialogue	The final answer can combine language fluency with external execution

The centre of gravity is the tool schema. Each tool is described with its name, function, expected arguments, and descriptions. The authors illustrate this with a simple function definition in the spirit of Pydantic-style schemas. The point is not that adding two integers is exciting. It is that a tool becomes legible to the system: what it does, what it needs, and what it returns.

That legibility matters. Without it, “tool use” becomes another magical phrase, like “agentic,” often used to mean “we let the model wander around until something happened.” Athena’s approach is more disciplined. The LLM analyses the user’s query, identifies whether a registered tool is relevant, extracts the parameters, sends a formatted request, receives the tool output, and updates the conversation.

A model with tools is not automatically smarter. A model with tools and a sane handoff protocol is less likely to confuse language generation with execution. That distinction is where most of the business value lives.

The plugged-in tools are ordinary, which is exactly the point

Athena’s evaluation integrates a small set of familiar services: Wolfram Alpha, Google SERPer, ArXiv, OpenWeatherMap, and Google Calendar. This list is not exotic. It is almost aggressively normal.

That normality is the useful part. Most enterprise AI value does not require inventing a new reasoning paradigm under fluorescent lighting. It requires a model that can call the right system of record, compute accurately, retrieve relevant material, schedule correctly, and return the result in language a human can use.

In the paper’s implementation discussion, Athena is also described through a LangChain and Unify setup. LangChain acts as middleware for connecting LLMs and tools, while Unify provides access to different open-source LLMs through a common API. The practical implication is modularity: the orchestration layer can sit above multiple models and tools rather than being welded to one model choice.

This matters for operators because model markets change quickly. The best model this quarter may be merely adequate next quarter. APIs change. Costs move. Governance requirements mutate, because apparently organisations enjoy making architecture behave like paperwork origami. A modular framework lets teams improve pieces without rebuilding the whole system.

The paper’s architecture therefore suggests a useful design principle: do not put all intelligence in the model. Put some of it in the routing, some in the schemas, some in the tools, and some in the monitoring layer. The system becomes less glamorous, but more useful. A tragedy for conference keynotes, perhaps. A win for operations.

The benchmark result is evidence for the mechanism, not a magic scoreboard

The evaluation uses selected MMLU mathematics and science questions. For mathematics, the authors build a 100-question test set from Elementary Mathematics, High School Mathematics, and College Mathematics. For science, they use 100 questions from high school and college Physics, Chemistry, and Biology.

Each question is presented as a multiple-choice prompt with four options, and the model is instructed to return a JSON answer. Responses are compared against the dataset’s correct answer. The baselines are GPT-3.5, GPT-4o, LLaMA-Large, Mistral-Large, and Phi-Large.

The headline numbers are straightforward:

Model	Mathematics accuracy	Science accuracy
GPT-3.5	36%	56%
GPT-4o	53%	77%
LLaMA-Large	67%	79%
Mistral-Large	57%	66%
Phi-Large	47%	66%
Athena Framework	83%	88%

The mathematics result is the sharper signal. Athena beats the best baseline, LLaMA-Large, by 16 percentage points. In science, the gain is 9 percentage points over the best baseline. The paper interprets this difference sensibly: science questions include more direct recall of concepts and definitions, where strong language models already perform relatively well; mathematics more often punishes fuzzy internal approximation.

That pattern is important. Athena’s advantage is not evenly magical across all knowledge work. It grows when the task requires execution. A calculator does not make a model wise. It makes arithmetic less hostage to token prediction. Wolfram Alpha does not solve every reasoning problem. It gives the system an external computational spine where the LLM’s internal representations are soft.

The paper’s results should therefore be read as mechanism-consistent evidence. The framework improves performance most clearly where tool access addresses the actual failure mode.

What each part of the experiment supports

The paper does not contain a large ablation suite, robustness analysis, or production stress test. It is closer to a framework proposal with a focused benchmark comparison. That does not make it useless. It just tells us what kind of evidence we are looking at.

Paper element	Likely purpose	What it supports	What it does not prove
Athena architecture	Implementation detail and core technical contribution	The system has a structured way to register tools, monitor tool need, call APIs, and update responses	That the routing will be reliable under messy enterprise conditions
Integrated tools list	Implementation detail	The framework can connect to computational, retrieval, academic, weather, and calendar services	That all tools are equally useful or safely callable
MMLU mathematics test	Main evidence	Tool-assisted execution can improve accuracy on selected math questions	General superiority across all quantitative tasks
MMLU science test	Main evidence	Tool use can improve selected science reasoning performance	General scientific reasoning reliability or lab-grade correctness
Baseline comparison	Comparison with standalone models	Athena outperforms tested models on this setup	That tool systems always beat larger models
Absence of ablations	Boundary, not evidence	The paper remains focused and readable	Which component contributes most to the gain

That last row matters. Without ablations, we do not know how much of Athena’s gain comes from Wolfram Alpha specifically, from prompting format, from tool routing, from base model selection, or from the interaction among these elements. The results are promising. They are not a component-level causal decomposition.

For business use, that means teams should not copy the architecture as religious doctrine. They should copy the discipline: register tools explicitly, separate interpretation from execution, measure tool-use decisions, and test the system against workflow-specific cases.

The misconception: tools do not prove size is dead

The tempting interpretation is obvious: Athena beats GPT-4o and other large models, therefore tools beat size. Convenient. Punchy. Slightly too neat, which is how we know it is dangerous.

The better interpretation is narrower. In selected educational multiple-choice tasks, a tool-augmented framework outperforms several standalone LLM baselines. This is strong enough to influence design choices. It is not strong enough to settle the scaling debate, prove general dominance, or declare that model capability no longer matters.

A weak base model with tools can still misunderstand the question, call the wrong API, extract the wrong parameters, or misread the returned result. A strong base model may use tools more effectively because it can parse intent and constraints better. Tool orchestration and model scale are not enemies. They are different levers.

The paper’s own results hint at this. Baseline performance varies substantially. In science, GPT-4o reaches 77% and LLaMA-Large reaches 79%, not far below Athena’s 88%. In mathematics, the gap widens because external computation better matches the task bottleneck. The business question is not “tools or models?” It is “which part of this workflow should be handled by language modelling, and which part should be delegated?”

That is less viral. It is also more likely to survive contact with reality.

Where businesses should copy Athena’s pattern

The operational lesson is especially relevant in workflows where the answer depends on information or execution outside the model.

Workflow type	What the LLM should do	What the tool should do	Why Athena’s pattern helps
Finance analytics	Interpret user intent, explain outputs, format summaries	Pull market data, compute ratios, run models	Reduces arithmetic and stale-data errors
Compliance support	Parse policy questions, map them to procedures	Retrieve current rules, check records, log actions	Separates language from auditable evidence
Education platforms	Explain concepts, adapt tone, guide learners	Calculate, verify answers, retrieve references	Improves precision without removing tutoring flexibility
Operations dashboards	Translate natural language requests into actions	Query databases, update tickets, schedule tasks	Makes AI an interface to systems, not a hallucinating dashboard
Research assistance	Summarise, compare, and contextualise papers	Search sources, retrieve metadata, manage citations	Keeps retrieval grounded in external databases
Scheduling and admin agents	Understand constraints and preferences	Access calendars, create events, check availability	Prevents “helpful” fictional scheduling

The ROI logic is not simply that tool use improves accuracy. It is that tool use can move reliability from the model’s statistical memory into systems already built for precision. Enterprises have spent decades creating databases, ERPs, CRMs, calendars, permission systems, audit logs, search indexes, and calculation engines. Ignoring all of that because a chatbot can write a confident paragraph would be an expensive form of amnesia.

A better enterprise AI stack treats the LLM as the interface and coordinator. It should not be the calculator, database, auditor, scheduler, and compliance officer simultaneously. That is not intelligence. That is job-title hoarding.

The implementation burden shifts from prompting to orchestration

Athena also points to a less comfortable truth: once tools enter the system, the hard problem moves.

A standalone chatbot can be improved with prompts, examples, and model upgrades. A tool-using system requires orchestration design. Teams must define tool schemas, parameter requirements, error handling, authentication, permission boundaries, logging, tool ranking, fallback behaviour, and response integration. The model may be fluent, but the workflow still needs plumbing.

This is where many enterprise agent pilots quietly suffer. The demo works because the happy path is obvious. The production version fails because real users ask ambiguous questions, APIs return unexpected formats, permissions block access, calendars conflict, databases disagree, and nobody decided what the agent should do when the calculator returns an answer that contradicts the model’s reasoning.

Athena’s paper does not solve all of this. It gives a framework-level route through part of the problem. The useful management conclusion is that tool integration is not a feature toggle. It is a product architecture decision.

If an organisation wants tool-augmented AI, it should evaluate at least four layers:

Layer	Question to ask
Tool eligibility	Which tasks genuinely require external execution or current data?
Schema quality	Are tool descriptions precise enough for reliable selection and parameter extraction?
Control and governance	Who is allowed to call which tools, with what data, under what conditions?
Evaluation	Are we measuring final answer accuracy, tool-call accuracy, latency, cost, and failure recovery separately?

The last point is easy to skip and expensive to rediscover. A tool-using model can fail in more ways than a standalone model. It can answer wrongly, call the wrong tool, call the right tool with wrong parameters, mishandle the returned data, or present a correct result with an incorrect explanation. More capability means more failure surfaces. Progress, annoyingly, comes with a maintenance bill.

The boundaries: promising benchmark, not production proof

The paper is concise and its evidence is focused. That focus should shape how operators interpret it.

First, the evaluation size is small: 100 mathematics questions and 100 science questions. The numbers are informative, but not exhaustive. A larger benchmark suite could reveal weaker performance on different question types, more complex multi-step tasks, or adversarially phrased prompts.

Second, the paper does not provide a detailed ablation analysis. We do not know which integrated tool contributed most, how often Athena selected tools correctly, or how performance changes if one tool is removed. This limits how confidently we can attribute gains to specific components.

Third, the comparison is against standalone model responses under the described testing procedure. It is not a comparison against other mature agent frameworks, retrieval pipelines, solver-augmented systems, or carefully prompted tool-use baselines.

Fourth, the production concerns remain outside the experiment. Latency, cost, API downtime, authentication, privacy, security, prompt injection through retrieved content, audit trails, and human override are not side issues in business systems. They are the part where the legal department develops facial tension.

None of these limitations erase the result. They locate it. Athena is evidence that structured tool integration can materially improve accuracy in selected educational reasoning tasks. It is not a universal guarantee that any LLM plus any API becomes enterprise-grade intelligence.

The better lesson: stop buying models as if workflows do not exist

The most useful message from Athena is architectural humility. LLMs are impressive, but they are not complete business systems. Their strength is language-mediated interpretation. Their weakness appears when the task demands exact execution, current data, or structured action.

Athena’s framework works because it assigns responsibilities more sensibly. The LLM reads and reasons over the user’s request. The schema defines available external capabilities. The monitoring layer decides when help is needed. The required-action handler turns language into structured API calls. The update step returns the external result to the conversation.

That design does not make the model obsolete. It makes the model useful in a system that knows its limits.

For Cognaptus readers, the takeaway is practical: when designing AI workflows, do not begin with “Which model is biggest?” Begin with “Which parts of this task should never be left inside the model?” Then wire those parts to tools, measure the handoffs, and treat orchestration as a first-class product layer.

The future of useful AI will not be built by asking one model to know everything. It will be built by teaching models when to stop pretending and start calling the right tool.

A modest proposal, apparently radical.

Cognaptus: Automate the Present, Incubate the Future.

Nripesh Niketan and Hadj Batatia, “Integrating External Tools with Large Language Models (LLM) to Improve Accuracy,” arXiv:2507.08034, 2025. https://arxiv.org/abs/2507.08034 ↩︎

TL;DR for operators#

The familiar failure: asking a poet to do a spreadsheet’s job#

Athena’s useful idea is not “tools”; it is controlled delegation#

The plugged-in tools are ordinary, which is exactly the point#

The benchmark result is evidence for the mechanism, not a magic scoreboard#

What each part of the experiment supports#

The misconception: tools do not prove size is dead#

Where businesses should copy Athena’s pattern#

The implementation burden shifts from prompting to orchestration#

The boundaries: promising benchmark, not production proof#

The better lesson: stop buying models as if workflows do not exist#