A notebook is not just a file. In most companies, it is where the analyst tried three joins, fixed the date column, discovered the leakage, reran the model, cursed quietly, and eventually produced the chart that made it into Monday’s meeting.

Then the notebook was archived, copied, half-forgotten, and treated as residue.

JUPITER starts from the opposite premise: that residue is training data. The paper, JUPITER: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search, builds a dataset called NbQA from real Jupyter notebooks, then uses notebook execution itself as the substrate for tree search.1 The result is not another “ask the model to analyse a CSV” demo with a heroic prompt and a suspiciously clean dataset. It is a more interesting proposition: data analysis agents should learn from the paths analysts actually took, including the dead ends, and should search over executable notebook states rather than emit one long, fragile answer.

That is the useful part. Not the leaderboard theatre. The mechanism.

The paper’s real object is the notebook trace

The obvious summary is that the paper introduces a dataset, NbQA, and a framework, JUPITER. Accurate, but mildly anaesthetic. The better reading is that the authors are turning notebooks into a structured memory system for tool-using agents.

They begin by crawling roughly 1.6 million Jupyter notebooks and 3.2 million associated data files from about 47,000 GitHub repositories. The raw crawl is aggressively filtered. Notebooks are removed if their JSON structure is invalid, if cells were not executed in order, if cells are unexecuted, if execution raised unhandled errors, if the data is too small, or if the notebook appears to use common educational or competition datasets such as Iris, Titanic, Boston Housing, Wine Quality, MNIST, and similar usual suspects. Neural-network, pretrained-model, GPU-heavy, and external-API-style notebooks are also filtered out, keeping the scope closer to classical Python data analysis.

That matters because the paper is not trying to build a general AI research assistant. It is building a training and evaluation resource for the kind of data analysis that already happens in Python notebooks: summary statistics, distributions, correlations, outlier detection, preprocessing, feature engineering, classical machine learning, and visualisation.

The extraction process then uses GPT-4o mini and GPT-4, but not in the way a casual reader might assume. This is the first misconception worth removing. NbQA is not mainly “synthetic QA generated by an LLM”. The answers are meant to be grounded in notebook code and outputs. GPT-4o is used to extract representative subtasks, add constraints, standardise output formats, generate multi-step solution traces from existing notebook content, and filter bad examples. The paper explicitly prohibits fabricating outputs or skipping steps. Except for visualisation tasks, numeric answers must be directly extractable from code outputs.

After filtering, NbQA contains 38,635 task-solution pairs. Of these, 6,845 have complete, fully accessible data dependencies and low randomness, making them suitable for interactive execution and value-model training. Another 31,790 are still useful for supervised fine-tuning, even if they are less clean for full execution. The authors also generated 12,976 visualisation tasks, but excluded them from the main dataset because the reported experiments use text-based models and benchmarks rather than visual evaluation.

The important move is not just scale. It is standardisation.

Each extracted task receives constraints and an output format, using labels such as @answer name[answer], so the system can evaluate outputs automatically. This is a small implementation detail with a large operational meaning. Most enterprise analytics failures are not caused by the absence of a model. They are caused by ambiguity: wrong metric definitions, hidden preprocessing assumptions, inconsistent output formats, and the analyst equivalent of “you know what I meant”. NbQA tries to make those assumptions explicit enough that a tool-using agent can be trained, executed, and judged.

JUPITER makes analysis a tree, not a monologue

Once the dataset exists, JUPITER reframes notebook-based data analysis as a sequential decision problem. The root node is the initial question and data context. Each child node is created by generating a thought-action pair, executing the code in a Jupyter-like environment, and recording the output. A path through the tree is a partial notebook. A terminal node is either a candidate final answer or a failed branch.

This is a better mental model than the standard ReAct loop for hard data work. ReAct is basically a single walk: think, run code, observe, think again. That can work, but one bad intermediate step poisons the rest of the trajectory. JUPITER instead searches across multiple partial notebook states. It can try several branches, keep the ones that look promising, and preserve candidate answers discovered along the way.

The paper uses Monte Carlo Tree Search during trajectory collection. For each of the 6,845 fully interactive NbQA tasks, a fine-tuned model builds a search tree. Branches terminate when they produce a final answer, exceed the maximum depth, hit repeated errors, or otherwise fail. Correct terminal answers receive positive reward; invalid outputs, errors, and failures receive negative reward. Those rewards are backpropagated through the tree, producing value estimates for intermediate notebook states.

The collected trajectories are then used to train a value model. Technically, the value model attaches a regression head to the base model fine-tuned on NbQA. It takes the notebook context as input and predicts a scalar value in $[-1, 1]$, trained against normalised MCTS-derived Q-values. Operationally, it learns something more practical: “Does this partial analysis look like it is on the way to a correct answer?”

That is the part many agent stacks still lack. They can execute tools. They can log traces. They can retry. But they often cannot judge whether a partial workflow is becoming useful until the final answer fails, at which point the model has already spent the budget on nonsense. JUPITER’s value model is a learned triage function for notebook states.

A compact version of the mechanism looks like this:

Stage What happens Why it matters
Notebook mining Real notebooks and data files are crawled, filtered, and deduplicated The training material reflects actual tool-use patterns, not only prompt-written examples
Task extraction GPT-4o extracts self-contained tasks, answers, constraints, and formats The result becomes evaluable rather than merely readable
SFT Open models are fine-tuned on multi-turn notebook-style traces The model learns the syntax of interactive analysis
MCTS collection Search trees generate successful and failed trajectories The system gets evidence about which partial states tend to work
Value-model training A regression head learns to score notebook states Inference can focus on promising branches
Inference search JUPITER expands selected states and collects candidate answers The agent becomes less dependent on one fragile chain of thought

The punchline is slightly counterintuitive: during inference, JUPITER removes the exploration term from PUCT by setting $c_{\text{puct}} = 0$. In ordinary MCTS, exploration is precious because the system must discover unknown territory. Here, the authors argue that data-analysis search space is vast and sparse: most branches are invalid, and correct solutions cluster in relatively few high-quality regions. Once a value model has learned what promising notebook states look like, extra exploration becomes expensive curiosity. Charming in humans. Costly in production.

The dataset alone helps; the search changes the slope

The first main evidence is supervised fine-tuning on NbQA. The authors sample 8,975 NbQA instances for multi-turn ReAct-style fine-tuning, excluding benchmark data from both fine-tuning and trajectory collection. The model generates thoughts and code actions; code is executed in a sandbox; observations are fed back into the interaction.

On InfiAgent-DABench, fine-tuning substantially improves smaller models:

Model Before SFT After NbQA SFT Absolute gain
Mistral-7B-Instruct-v0.3 2.33% 59.14% +56.81 pp
Llama-3.1-8B-Instruct 48.25% 69.65% +21.40 pp
Qwen2.5-7B-Instruct 43.97% 68.09% +24.12 pp
Qwen2.5-14B-Instruct 69.65% 77.04% +7.39 pp

This is the first practical result: the notebook-derived dataset teaches models useful multi-step data-analysis behaviour. The stronger 14B model starts higher and gains less; the smaller models gain more. That is what one would expect if the dataset is teaching interaction discipline rather than simply injecting benchmark-specific trivia.

The second evidence layer is JUPITER’s inference-time search. On InfiAgent-DABench, the authors compare ReAct, majority voting, agent frameworks such as AutoGen, TaskWeaver, and Data Interpreter, and JUPITER variants with or without a value model and with or without the PUCT exploration term.

The best configuration is clear:

Setting Qwen2.5-7B SFT Qwen2.5-14B SFT
ReAct 68.09% 77.04%
Majority voting 75.10% 83.66%
JUPITER without value model, no exploration term 70.04% 79.38%
JUPITER with value model, with exploration term 68.87% 74.71%
JUPITER with value model, no exploration term 77.82% 86.38%

The 14B JUPITER configuration reaches 86.38%, slightly above the 85.99% reported for TaskWeaver with GPT-4o in the comparison table. The 7B JUPITER configuration reaches 77.82%, improving over its ReAct baseline and over search without the value model.

The ablation is more important than the headline. Search alone helps a bit. Majority voting helps too. But the strongest result appears when the search is guided by the value model and the exploration term is removed. With the exploration term present, performance drops. In the 14B case, JUPITER with value model and exploration reaches only 74.71%, worse than the same model’s plain ReAct score. That is not a rounding error; it is a warning label. “More agentic exploration” is not automatically better. Sometimes it is just a very expensive way to wander into bad code.

The appendix is doing stress testing, not telling a second story

The supplementary experiments are useful, but they should not be inflated into a grand claim that JUPITER solves all reasoning. They mostly test whether the mechanism survives changes in task format, budget, and training strategy.

Evidence item Likely purpose What it supports What it does not prove
NbQA SFT on InfiAgent-DABench Main evidence for dataset utility Real notebook-derived traces improve multi-step data-analysis performance That the dataset covers all enterprise analytics tasks
JUPITER variants with and without value model / exploration term Ablation The value model and removal of exploration are central to the best result That exploration is always bad in all domains
DSBench data-modeling experiments Generalisation and task-format transfer Value-guided search can help even when the task format differs from NbQA QA-style tasks That the model understands messy business context without simplification
AIME 2025 math experiments Exploratory out-of-domain extension NbQA SFT improves tool-use and numerical reasoning somewhat outside data analysis That JUPITER is a competitive math-reasoning system
Hyperparameter plots for iterations, exploration, and temperature Robustness / sensitivity More iterations help; high exploration hurts; temperature around 0.5–0.7 appears best in their setup That those settings transfer unchanged to every deployment
Failed GRPO-style RL attempt Implementation detail and negative result Sparse rule-based rewards are hard for multi-turn tool-use RL That RL is useless for data agents

DSBench is particularly interesting because it is not the same problem format as InfiAgent-DABench. DSBench data-modeling tasks are adapted from Kaggle competitions: train a model, produce predictions, write a submission file, and score it against ground truth. The authors use vanilla Qwen2.5-7B and 14B models, but apply the trained value model during inference-time search. They also simplify the original task descriptions by removing redundant platform text, acknowledgements, company lists, and other clutter.

That simplification step is not a footnote. It materially affects interpretation. After task simplification, ReAct with Qwen2.5-7B and 14B reaches 63.51% and 66.22% task completion at 40 iterations. With value-model-assisted search, the 7B and 14B models reach 89.19% and 98.65% at 50 iterations. The paper’s own reading is blunt: much of DSBench’s difficulty lies in excessive and redundant context, not necessarily in the small model’s inability to solve the modelling task.

For business readers, this is both useful and inconvenient. It suggests that instruction hygiene can unlock a lot of performance. It also means one should not attribute the whole DSBench gain to deep analytical intelligence. Some of it is search. Some of it is value guidance. Some of it is removing noise before the model starts.

AIME 2025 is even more bounded. The benchmark is out-of-domain math competition reasoning, and JUPITER is not adapted for math-specific workflows. The vanilla Qwen2.5-7B model gets 0% with Python-tool-style solving. After NbQA SFT, ReAct reaches 10%. JUPITER without a value model reaches 26.7% under the “OR” metric, meaning at least one sampled answer is correct, but only 13.3% by majority vote. With the value model, that rises to 33.3% OR and 20.0% vote. This shows improved candidate discovery and some transfer of tool-use reasoning. It does not show that a notebook-trained value model suddenly understands olympiad mathematics. Let us remain adults.

The business value is not “smaller model beats GPT-4o”

The tempting business headline is that a 14B open model plus JUPITER beats GPT-4o-based agents on a benchmark. That is technically attractive and strategically incomplete.

The real business value is that JUPITER gives a pattern for converting analytics exhaust into operational capability. Many organisations already have repositories of notebooks, SQL scratchpads, experiment logs, BI transformation scripts, and model-building artefacts. Usually these are treated as documentation at best and clutter at worst. JUPITER suggests they can become three things:

First, a fine-tuning corpus. Historical workflows can teach a smaller model how the organisation actually performs analysis: preferred libraries, naming conventions, metrics, preprocessing choices, and output formats.

Second, an evaluation suite. Standardised tasks with executable files and strict answer labels can become regression tests for data agents. This is more useful than asking whether a model “sounds like a good analyst”, a managerial ritual that should probably be taxed.

Third, a value-learning substrate. Successful and failed trajectories can train a scoring model that identifies promising intermediate states. In production, that scoring model can reduce wasted tool calls and help decide when to continue, branch, stop, or ask a human.

A practical pathway would look like this:

Business asset JUPITER-style conversion Operational use
Old notebooks Extract self-contained tasks, constraints, outputs, and data dependencies Build internal agent training and evaluation sets
Executed cells and logs Convert into thought-code-output trajectories Teach models how analysis progresses over time
Failed runs and errors Keep as negative trajectories Train value models to avoid bad branches
Reproducible outputs Standardise labels and tolerance rules Automate grading and regression testing
Sandbox execution traces Store code, outputs, files, hashes, and metadata Support audit, debugging, and governance

Cognaptus inference from the paper: for enterprise analytics, the question is not only “Which foundation model should we use?” It is “What evidence do we have about our own analytical workflows, and can we turn that evidence into search guidance?” That is a more durable question. Models change. Notebook graveyards persist.

Reliability comes from execution, but governance still has to show up

The paper’s strongest design choice is execution grounding. Each branch in the search tree is not just text. It is code run in context, producing outputs, errors, and candidate answers. This creates a much better audit surface than a single generated response. You can inspect the branch, replay the code, review the intermediate outputs, and ask where the agent went wrong.

That does not automatically make the system enterprise-safe.

The paper uses sandbox execution, filtered public notebooks, and benchmark-style tasks with known answers. A business deployment would need data permissioning, secrets handling, lineage capture, row-level access controls where relevant, PII redaction, package governance, compute budgeting, and human-review policies. The JUPITER mechanism helps with observability, but it does not solve the politics of who is allowed to run what against which table. Annoying, yes. Also the part that prevents your “AI analyst” from becoming a compliance incident with a progress bar.

There is also a latency and cost boundary. JUPITER uses multiple iterations and expands multiple candidate branches per selected node. On InfiAgent-DABench, the reported search setting uses up to 40 iterations, three expansions per iteration, maximum tree depth 10, and up to three code execution errors per path. On DSBench, the search depth is higher and the iteration budget reaches 50. This may be acceptable for batch analysis, model-building, and high-value workflows. It may be too slow for interactive dashboard Q&A unless budgets are adaptive.

The value model also has a transfer boundary. It is trained from classical notebook-style tasks, with many data files in CSV format and a heavy focus on reproducible Python analysis. It may transfer to adjacent data-modelling tasks, as DSBench suggests. It should not be assumed to transfer cleanly to messy executive questions such as “why did Southeast Asia underperform last quarter?” where the answer depends on business definitions, missing data, incentives, and the delightful human habit of changing the target metric after seeing the chart.

The negative RL result is more useful than it looks

The appendix reports that the authors attempted additional GRPO-style reinforcement learning after SFT. Reward increased slightly during training, but downstream performance was worse than the SFT models, so they did not use those RL-finetuned models.

This is worth noticing. The failure mode is credible: multi-turn tool-use tasks are hard for reinforcement learning because reward is sparse, task difficulty varies, and exact rule-based answer checks may not provide sufficiently nuanced feedback. The model can learn to solve easy cases repeatedly while failing to improve on harder cases. In other words, the reward graph smiles while the product gets worse. Beautifully on-brand for machine learning.

For businesses, this argues against treating RL as the default next step for analytics agents. Before reinforcement learning, get the workflow data clean, standardise outputs, run execution-based evaluation, and collect both successful and failed trajectories. JUPITER’s value-model route is less glamorous than “agent RL”, but it is better aligned with what analytical work actually produces: partial states, errors, and recoveries.

What this paper directly shows, and what we can infer

The paper directly shows that NbQA fine-tuning improves several open models on InfiAgent-DABench, with especially large gains for smaller models. It directly shows that value-guided JUPITER search, with the exploration term removed, delivers the best InfiAgent-DABench results among the tested JUPITER variants. It also shows transfer-style evidence on DSBench and limited out-of-domain improvement on AIME.

Cognaptus would infer a more general product lesson: the next improvement in data agents may come less from ever-larger models and more from turning historical workflows into executable, scored, searchable state spaces. This is not a replacement for strong models. It is a way to stop wasting strong models on unguided wandering.

What remains uncertain is how well the approach performs on proprietary enterprise notebooks, inconsistent schemas, private data warehouses, access-controlled environments, subjective business analyses, and tasks where the “correct answer” is a negotiated interpretation rather than a string-matched output. Those are not minor details. They are the difference between benchmark competence and deployment reliability.

Still, JUPITER points in the right direction. It treats data analysis as something analysts already know it is: iterative, stateful, execution-grounded, and full of recoverable mistakes. The notebook is not a prompt transcript. It is a map of decisions under uncertainty.

And if an organisation already has thousands of those maps lying around, the first strategic move is not to buy a bigger model and hope. It is to stop treating the maps as trash.

Cognaptus: Automate the Present, Incubate the Future.


  1. Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, and Dongmei Zhang, “JUPITER: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search,” arXiv:2509.09245v2, 2025. https://arxiv.org/abs/2509.09245 ↩︎