MIRAGE-VC: Teaching LLMs to Think Like VCs (Without Drowning in Graphs)

Deal flow is rarely scarce. Attention is.

A venture-capital team may receive hundreds of startup introductions, each surrounded by founder biographies, investor histories, comparable companies, co-investment relationships, sector narratives, and enthusiastic claims about an inevitable Series A. The practical problem is not obtaining more evidence. It is deciding which fragments deserve serious attention before the partnership meeting begins.

This distinction is easy to miss when building AI systems. Give a large language model more documents, more graph neighbors, and a larger context window, and surely it should make a better decision. That sounds reasonable. It is also the assumption MIRAGE-VC is designed to challenge.

MIRAGE-VC is a venture-capital prediction framework that selects compact, task-relevant paths from investment networks, sends different evidence types to specialist LLM agents, and learns how much weight to assign to each perspective for each startup.¹ Its central contribution is not that an LLM can write a plausible investment memo. We had already cleared that rather modest bar. The more useful contribution is a mechanism for deciding which relational evidence the LLM should see before it begins writing.

The result is a system that performs better by reading less of the graph—but reading the right parts.

The Graph Is Evidence, Not the Answer

Many graph-and-LLM systems are built for questions whose answers already exist inside the graph.

A knowledge graph might contain a company, an investor, and an investment edge. Ask which investor funded the company, and the system searches for a path leading to the answer node. Retrieval quality can be judged by whether it reaches the correct entity or relation.

Startup-success prediction is different. Whether a seed-stage company will raise Series A within the next year is not currently stored as a node waiting to be discovered. It is a future outcome. The investment graph provides clues, but the answer exists outside it.

This makes venture-capital prediction an off-graph task:

The graph contains investors, startups, financing events, and relationships.
The model must predict an external outcome.
Retrieved paths are useful only when they improve that external prediction.

That final condition matters. A graph path can be factually correct, semantically interesting, and completely useless for the decision at hand.

Consider an investor connected to hundreds of companies and co-investors. A three-hop expansion can quickly produce thousands of valid paths. Some may reveal a meaningful chain of screening ability, sector expertise, syndication strength, or proximity to successful follow-on investors. Others merely document that active investors tend to know other active investors. Fascinating. The investment committee remains unmoved.

MIRAGE-VC therefore treats graph retrieval as a utility problem rather than a completeness problem.

The Real Bottleneck Is Choosing the Next Node

The paper’s most important mechanism begins with a simple question:

If one more company or investor is added to the current path, does the prediction improve?

Starting from a target startup, MIRAGE-VC expands through the investment graph one hop at a time. At each step, it considers candidate neighboring nodes and estimates the marginal value of adding each candidate to the current path.

The logic resembles information gain in a decision tree. A decision-tree split is valuable when it reduces uncertainty about the target label. MIRAGE-VC extends that intuition to graph paths: a candidate node is valuable when adding it pushes the LLM toward a more accurate and confident prediction.

The system evaluates two prompts during offline supervision:

A baseline prompt describing the current path.
An expanded prompt containing the current path plus one candidate node.

A frozen LLM produces success probabilities for both prompts. The difference becomes a task-specific gain label that rewards movement toward the correct class and greater confidence once the prediction is correct.

This is not generic relevance scoring. A candidate is not selected merely because its description resembles the startup or because it occupies a central position in the network. It is selected because historical supervision suggests that including it improves prediction of the defined outcome.

That distinction separates evidence that looks relevant from evidence that has demonstrated decision value.

Expensive judgment is learned offline

Calling an LLM to score every possible graph expansion during live inference would be painfully inefficient. MIRAGE-VC avoids this by using the LLM-generated gain scores as offline supervision for a lightweight path selector.

For each expansion step, the selector receives representations of the existing path and candidate extension. A small neural network learns to rank candidates according to the gains previously estimated by the frozen LLM. At inference time, the selector chooses high-value expansions without asking the LLM to evaluate every alternative again.

The mechanism can be summarized as:

Historical graph expansions
        ↓
Frozen LLM estimates each candidate’s marginal decision value
        ↓
Lightweight selector learns the candidate ranking
        ↓
Selector retrieves compact high-value paths at inference
        ↓
LLM reasons over selected paths rather than the full neighborhood

The internal selector evaluation supports this mechanism. On its held-out test split, the trained selector reaches an NDCG@1 of 63.43%, compared with 44.92% for random ranking. Its Hit@1 rises from the random baseline’s 33.33% to 42.12%.

These figures are not the final startup-prediction results. They are an intermediate test asking whether the selector has learned to prefer candidates with higher oracle-assigned gain. Its purpose is narrower and useful: it confirms that the system is doing more than disguising random graph traversal with an impressive diagram.

More Context Helps Only After Selection

A tempting interpretation of the paper is that deeper graph reasoning produces better predictions. The evidence is more specific.

Correctly classified startups had deeper accessible paths on average: 4.44 hops, compared with 3.31 for misclassified startups. This suggests that richer relational environments can provide useful evidence.

It does not show that adding arbitrary depth improves prediction.

The paper’s sensitivity analysis finds that shallow or single-path configurations can provide insufficient evidence, while overly deep searches with many paths introduce noise and increase reasoning cost. In other words, useful network depth and indiscriminate graph expansion are not the same thing.

The ablation study makes the point more directly:

Configuration	Precision	F1
Full MIRAGE-VC	24.32	36.54
Without graph retrieval	23.01	34.06
Without path selector, using all neighbors	22.72	33.29
Without path selector, using random paths	23.24	34.76

Removing graph evidence hurts. Yet feeding all available three-hop neighbors performs even worse than removing graph retrieval entirely.

That result is the paper’s cleanest correction to the “more context is better” assumption. Unfiltered graph context does not merely waste tokens. It can dilute useful evidence enough to reduce predictive performance.

The investment graph is valuable. The graph dump is not.

Three Analysts Read Three Different Kinds of Evidence

After retrieving graph paths, MIRAGE-VC does not hand every available input to one heroic prompt and ask it to become McKinsey, Sequoia, and a graph neural network before lunch.

Instead, it separates evidence into three perspectives.

The peer-company analyst

This agent receives the target company’s profile and descriptions of historically similar companies. Comparable firms are retrieved using semantic similarity, with temporal restrictions intended to prevent later information from entering earlier predictions.

The perspective answers a familiar due-diligence question: how have companies with similar products, sectors, or market positions progressed?

The investor-profile analyst

This agent examines the lead investor, defined as the investor contributing the largest amount in the startup’s first disclosed financing round. It receives the investor’s prior roles, demographic attributes, and time-filtered investment records.

Its task is to assess whether the investor’s history and capabilities provide a meaningful prior for the target company’s next financing round.

The investment-chain analyst

This agent reasons over the compact paths selected from the investment graph. These paths connect companies and investors through historical investment and co-investment relationships.

Rather than receiving a numerical graph embedding, the agent sees a verbalized relational chain and supporting node profiles. It can therefore produce a human-readable argument about why a particular coalition, track record, or structural position may affect the target startup’s prospects.

All three specialist roles use the same frozen LLM. That design reduces one source of ambiguity: differences among their outputs should primarily reflect differences in evidence, rather than differences in underlying model capability.

The separation also creates an operationally useful audit trail. A reviewer can see whether a positive recommendation was driven by comparable companies, the lead investor, or the wider investment network. That is more informative than receiving a single score labeled “AI confidence: 83%,” the traditional unit of corporate reassurance.

The Gate Learns Which Analyst Deserves Attention

Specialization alone does not solve evidence fusion. The three analysts may disagree, and their relative usefulness can vary by startup.

A company with sparse network connections may receive little value from the investment-chain perspective. A startup operating in a mature category may have highly informative comparables. Another may depend heavily on the credibility and syndication ability of its lead investor.

MIRAGE-VC handles this with a learnable gating network. The gate receives representations of the three agents’ rationales alongside structured company attributes such as industry, region, and financing stage. It produces case-specific weights indicating how much each evidence stream should contribute.

The weighted evidence is then passed to a manager agent, which produces the final binary decision and a consolidated explanation.

The full flow is:

Target startup
   ├── Similar-company retrieval → Peer-company analyst ──┐
   ├── Lead-investor retrieval → Investor-profile analyst ├── Learned gate → Manager agent
   └── High-gain graph paths → Investment-chain analyst ──┘

The gate matters because evidence quality is conditional. A fixed rule such as “always weight the investor view at 40%” assumes that the same due-diligence recipe applies to every company. Real investment committees do not behave that way, at least not the functional ones.

The ablations support the value of separating and adapting the perspectives:

Removed component	Precision	F1	Likely interpretation
Similar-company evidence	23.45	35.54	Peer trajectories add complementary information
Investor analysis	23.32	35.43	Lead-investor history contributes beyond company features
Multi-agent fusion	22.97	35.13	One combined agent handles heterogeneous evidence less effectively
Learnable gating network	24.05	35.94	Adaptive weights improve over fixed fusion, though the gain is smaller than the retrieval effects

The largest degradation comes from replacing selected paths with all graph neighbors. The smaller gating ablation still matters, but it should not distract from the main mechanism. Better fusion cannot rescue badly selected evidence.

The Main Results Measure Triage Quality, Not Investment Returns

MIRAGE-VC is evaluated using PitchBook data spanning investment activity from 2005 to November 2023. The final evaluation contains 2,510 startups that completed their first financing round between October 2021 and November 2023. Of these, 533 are positive cases and 1,977 are negative.

The paper defines success narrowly: a seed- or angel-stage startup must secure Series A financing within one year.

This definition creates a manageable supervised-learning task, but it is not equivalent to long-term venture success. A company can raise Series A and later fail. Another can build a durable business without following the expected financing timeline. The model predicts near-term fundraising progression, not eventual fund returns, founder quality, or social value.

Within that defined task, MIRAGE-VC produces the strongest overall F1 and top-of-ranking performance among the tested methods.

Method	Monthly AP@5	Precision	Recall	F1
SHGMNN	25.41	20.65	82.37	32.97
GST	26.71	21.75	83.54	34.51
Standard RAG	24.43	23.12	60.34	33.43
SSFF	28.23	23.23	69.41	34.81
GNN-RAG	29.42	22.81	71.10	34.54
MIRAGE-VC	34.29	24.32	73.44	36.54

Compared with the strongest baselines, the paper reports a relative improvement of 16.6% in monthly-averaged Precision@5 and approximately 5.0% in F1.

The AP@5 improvement is particularly relevant to a screening workflow. An investment team rarely has the capacity to pursue every company classified as potentially successful. It needs a small, ranked shortlist. Better performance among the highest-ranked candidates can therefore be more operationally useful than maximizing recall across the entire market.

Still, the absolute figures deserve adult interpretation. A precision of 24.32% means that most positive classifications remain false positives. An AP@5 of 34.29% is an improvement, not clairvoyance.

MIRAGE-VC is better understood as a triage system that improves the quality of attention allocation. It is not an automated partner with unusually low carried-interest requirements.

What the Experiments Support—and What They Do Not

The paper contains several experiments serving different purposes. Treating all of them as equal “proof” would flatten the argument and overstate what has been established.

Test	Likely purpose	What it supports	What it does not prove
Comparison with GNN, RAG, and LLM-based baselines	Main evidence	The complete system performs better on the selected PitchBook task	That it will outperform human investors or transfer unchanged to other datasets
Component ablations	Mechanism test	Selected graph paths, separate analysts, and adaptive fusion each contribute	That each component is globally optimal
Selector Hit@1 and NDCG@1	Intermediate validation	The lightweight selector learns useful candidate rankings	That oracle gain labels perfectly represent real investment value
Stronger LLM-backbone experiments	Robustness and extension	The architecture remains useful with other backbones	A clean leakage-free comparison, because newer models may know test-period companies
Encoder substitutions	Robustness test	Results are not highly dependent on one sentence encoder	That retrieval is insensitive to all representation choices
Depth and path-count sweeps	Sensitivity analysis	Performance is stable near the tuned region; extremes are weaker	That the same settings work for different graphs or business tasks
WhatsApp reasoning example	Interpretability illustration	The system can produce coherent, multi-perspective narratives	That generated rationales are faithful causal explanations

The appendix’s stronger-backbone results are encouraging. MIRAGE-VC improves over few-shot prompting with GPT-3.5 Turbo, GPT-4o-mini, and Qwen-3 4B. But the authors correctly treat the newer-backbone experiments cautiously because those models may have been pretrained on information overlapping the evaluation period.

That makes these experiments robustness evidence, not a second headline result.

The paper also reports AUC-PR of 0.354 and AUC-ROC of 0.591, outperforming the strongest tested baselines on those metrics. These broader discrimination measures reduce the risk that the main result is purely an artifact of one classification threshold. They do not change the more important operational conclusion: the system remains a noisy screening instrument.

Human-Readable Rationales Are Useful, but They Are Not Causal Proof

MIRAGE-VC’s manager agent can produce an investment-style explanation that combines peer-company evidence, investor history, selected graph paths, and learned perspective weights. This is a substantial usability improvement over opaque graph embeddings.

A human reviewer can inspect the cited relationships and challenge the argument. The system can expose whether it relied heavily on a lead investor’s track record or on a network path connecting the startup to successful companies and investor coalitions.

However, readable reasoning and faithful reasoning are not identical.

The learned gate provides evidence-level attribution: it indicates which perspective received more weight. The generated explanation then turns those perspectives into prose. That prose may be coherent and grounded in retrieved inputs, but the experiment does not establish that every sentence reflects the true causal process behind the model’s prediction.

For business deployment, the rationales should therefore be treated as review interfaces, not unquestionable explanations. Their best use is to help analysts inspect evidence, identify unsupported leaps, and decide what to investigate next.

A polished paragraph remains capable of being wrong. Language models have invested heavily in this feature.

The Transferable Business Pattern Is Utility-Aware Retrieval

The paper is framed around venture capital, but its more general contribution is architectural.

Many organizational decisions depend on relationship networks while targeting outcomes outside those networks:

A recommendation system predicts whether a user will engage with an item.
A lender predicts whether an applicant will default.
A procurement team predicts whether a supplier will create operational risk.
A compliance team predicts whether a transaction pattern warrants investigation.
A sales team predicts whether an account is likely to convert.

In each case, the organization may possess a large graph of interactions, affiliations, transactions, or historical relationships. The naive approach graph of retrieves the nearest or most similar connections. The MIRAGE-VC pattern asks a more demanding question:

Which relationship evidence has historically improved this particular downstream decision?

That leads to a reusable design sequence:

Define the operational target precisely. Retrieval can only be utility-aware when the system knows which outcome it is meant to improve.
Generate candidate relational evidence. Build possible paths, neighborhoods, or interaction chains from the organization’s graph.
Estimate marginal decision value. Measure whether adding each evidence item improves prediction on historical cases.
Train a lightweight selector. Distill expensive historical evaluation into a cheaper retrieval policy.
Separate evidence by analytical role. Ask specialist agents to examine different sources rather than forcing one prompt to reconcile everything immediately.
Fuse perspectives conditionally. Learn which evidence types matter for which cases.
Expose the evidence to human reviewers. Preserve paths, source records, and perspective weights so decisions can be challenged.

The likely return on investment does not come merely from improving an accuracy metric. It comes from reducing the cost of reviewing irrelevant evidence while preserving enough context to support informed escalation.

For a high-volume screening process, even a modest improvement at the top of a ranking can redirect analyst hours toward better candidates. Whether that improvement justifies the data engineering, supervision, and governance costs remains an organization-specific calculation.

The Deployment Boundary Is Decision Support, Not Autonomous Investing

Three boundaries materially affect how MIRAGE-VC should be interpreted.

The outcome is narrow

The model predicts Series A financing within one year. It does not predict eventual exit value, return multiples, time to liquidity, or whether an investment fits a particular fund’s strategy.

A production system would need target labels aligned with the organization’s actual decision. Optimizing for follow-on financing may unintentionally favor companies already connected to established capital networks, even when the fund is seeking overlooked opportunities.

The evidence comes from one proprietary dataset

The system is trained and evaluated using PitchBook. That dataset offers broad coverage, but the paper does not establish performance on another venture database, a regional market, or an internally maintained deal-flow system.

Replication is also difficult because the complete proprietary dataset cannot be released. Before deployment, organizations would need their own temporal validation and careful checks for missingness, market bias, and entity-resolution errors.

Path selection is locally greedy

The selector evaluates the immediate value of adding one candidate node. This makes retrieval tractable, but it may miss combinations of nodes that are weak individually and powerful together.

The paper identifies this as a limitation: local information gain does not guarantee a globally optimal subgraph. Future work could use look-ahead or sequence-level optimization, although such methods would add complexity and cost.

There is also a governance issue beyond the paper’s primary evaluation. Investor profiles include attributes such as education, age, and gender. Even when used in licensed or aggregated records, organizations should examine whether these variables or their proxies create inappropriate decision patterns. A model that reproduces historical funding networks may become very good at recommending the kinds of founders the market already funds. That is prediction. It is not necessarily judgment.

The Valuable Lesson Is Not “Use More Agents”

It would be easy to summarize MIRAGE-VC as another multi-agent LLM system: three analysts, a manager, some retrieval, and a neural network quietly ensuring the agents appear coordinated.

That misses the contribution.

The system’s strongest idea occurs before the agents speak. MIRAGE-VC turns graph retrieval into a supervised decision problem, selecting paths according to their marginal usefulness for an external objective. The specialist agents and adaptive gate then preserve the distinctions among evidence types rather than blending everything into one large prompt.

The empirical results support that sequence. Selected graph evidence improves predictions. Unfiltered graph evidence degrades them. Separate analytical views add complementary value. Adaptive weighting improves their combination. The strongest gains appear where a screening process needs them most: near the top of the ranked shortlist.

MIRAGE-VC does not teach an LLM to possess venture instinct. It teaches the surrounding system to allocate the model’s attention more carefully.

For organizations building decision-support agents, that is the more durable lesson. The next improvement may not come from a larger model or a larger context window. It may come from becoming far less generous about what enters the prompt.

Cognaptus: Automate the Present, Incubate the Futue.

Haoyu Pei, Zhongyang Liu, Xiangyi Xiao, Xiaocong Du, Suting Hong, Kunpeng Zhang, and Haipeng Zhang, “The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction,” arXiv:2512.23489, https://arxiv.org/abs/2512.23489. ↩︎

The Graph Is Evidence, Not the Answer#

The Real Bottleneck Is Choosing the Next Node#

Expensive judgment is learned offline#

More Context Helps Only After Selection#

Three Analysts Read Three Different Kinds of Evidence#

The peer-company analyst#

The investor-profile analyst#

The investment-chain analyst#

The Gate Learns Which Analyst Deserves Attention#

The Main Results Measure Triage Quality, Not Investment Returns#

What the Experiments Support—and What They Do Not#

Human-Readable Rationales Are Useful, but They Are Not Causal Proof#

The Transferable Business Pattern Is Utility-Aware Retrieval#

The Deployment Boundary Is Decision Support, Not Autonomous Investing#

The outcome is narrow#

The evidence comes from one proprietary dataset#

Path selection is locally greedy#

The Valuable Lesson Is Not “Use More Agents”#