Graphs, Gains, and Guile: How FinKario Outruns Financial LLMs

TL;DR for operators

FinKario is useful because it attacks a dull but expensive problem: financial research is rich, long, inconsistent, and usually trapped inside documents that models can quote more easily than they can use. The paper’s answer is not “ask a better LLM.” It is “turn research reports into a dynamic financial knowledge graph, then retrieve graph context before asking the LLM to reason.” Small difference. Large operational consequences.

The system builds two linked structures from equity research reports: an Attribute Graph for relatively stable company facts, and an Event Graph for time-sensitive drivers such as demand shifts, strategic actions, technology innovation, policy changes, and profitability movements. FinKario then uses a two-stage retrieval pipeline: first locate the relevant stock and date, then expand into surrounding entities, relations, and event context. That matters because most finance RAG systems still behave like interns with highlighter pens: they retrieve chunks, not causality.

The headline result is strong. In the paper’s weekly long-only backtest, FinKario-RAG reports the highest annualized return, Sharpe ratio, Calmar ratio, and directional accuracy among market indices, vanilla LLMs, financial LLMs, and selected institutional strategies. Its accuracy is 0.581, only modestly above the best institutional baselines, but its reported return profile is much stronger. The distinction matters: the paper is not just claiming “better prediction”; it is claiming that structured event retrieval improves the conversion of research signals into portfolio outcomes.

For business readers, the relevant lesson is not that FinKario is a deploy-and-print-money stock picker. Please do not let the spreadsheet goblin take over. The practical insight is that financial AI products need memory architectures that respect market time. Static company profiles are insufficient. Raw document search is blunt. Dynamic event graphs are closer to how analysts actually explain price movement.

The boundary is equally important. The evidence is based on East Money research reports from 2024-08-28 to 2025-02-28, Tushare/Wind market data, a weekly strategy evaluated through 2025-03-07, and paper-defined baselines. That is enough to make the architecture interesting. It is not enough to certify live trading performance, suitability for other markets, or regulatory readiness.

The problem is not that LLMs cannot read reports. It is that reports are the wrong shape.

Equity research reports are full of useful material: company fundamentals, target prices, risk assessments, industry context, supply-demand shifts, policy exposure, product cycles, margin drivers, and analyst judgement. They are also long, uneven, repetitive, and temporally awkward. A report may describe a company’s stable identity in one section, a recent event in another, and an implied causal chain across several paragraphs. A vanilla LLM can summarise that. A retrieval system can retrieve chunks from it. Neither automatically turns it into a structure that can be reused across companies, dates, industries, and investment questions.

That is the gap FinKario tries to close. The paper, FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph, introduces an automated pipeline for converting financial research reports into a dynamic knowledge graph and then using that graph for retrieval-augmented stock trend prediction.¹

The paper’s framing begins with a familiar asymmetry: institutional investors have more resources, better research workflows, and more disciplined access to information. Individual investors are buried in material they cannot process quickly. That is a fair motivation, though slightly philanthropic in the way finance papers often are. The more interesting business problem is broader: even professional research organisations struggle to convert narrative analysis into machine-usable, continuously updated decision context.

The usual RAG answer is to chunk the reports, embed the chunks, and retrieve the nearest passages. That helps with access. It does not solve structure. It does not know whether “overseas expansion” is a strategic action, whether “capacity adjustment” belongs to supply, whether a margin improvement is tied to automation, or whether a statement is a stable attribute or a time-sensitive event.

FinKario’s useful move is to treat those distinctions as architecture.

FinKario splits financial knowledge into what changes slowly and what moves the market

The system constructs a dual financial knowledge graph from equity research reports. The first part is an Attribute Graph. This captures relatively stable company-level information: stock ticker, exchange, industry, rating, current price, market capitalisation, target price, major shareholders, risk assessment, key products, and research institution. These attributes are not necessarily permanent, but they behave more like background context than trading events.

The second part is an Event Graph. This is where the paper becomes more interesting. The Event Graph captures drivers: supply, demand, revenue, efficiency and cost, strategic action, technology innovation, policy and regulation, and macro factors. These are the kinds of concepts analysts use when explaining why a company’s future may diverge from its past.

The architecture can be read as a simple distinction:

Layer	What it stores	Why it matters	Failure mode if missing
Attribute Graph	Company identity, industry, exchange, rating, market cap, target price, risk factors, key products	Provides stable context and entity grounding	The model may confuse entities, miss comparables, or reason without basic company scaffolding
Event Graph	Demand shifts, supply changes, earnings drivers, cost efficiency, strategic actions, technology innovation, policy, macro events	Captures time-sensitive causal signals	The model sees company facts but misses why expectations changed
FinKario-RAG	Retrieved subgraphs anchored by stock/date and expanded into related context	Turns the graph into usable decision context	The system falls back into ordinary chunk retrieval or isolated entity lookup

The paper reports that FinKario contains 305,360 entities, 9,625 relational triples, and 19 relation types. The entity count is large relative to the triple count, which is worth noticing. This is not a giant market-wide graph in the “tens of millions of triples” sense. Its claim is not maximum graph volume. Its claim is automated, event-aware, research-report-grounded structure.

That distinction is commercially relevant. Many firms already own large databases. Fewer have reliable systems for extracting event logic from analyst narratives and making it retrievable in context. In finance, the bottleneck is often not lack of text. The bottleneck is that the text refuses to sit nicely inside a decision engine.

The construction pipeline matters because finance punishes sloppy extraction

FinKario’s construction process has four modules.

First, the authors collect raw equity research reports from East Money and use MinerU to convert them into Markdown. They then remove non-informative material such as disclaimers, images, and repeated legal boilerplate. This sounds mundane because it is. It is also necessary. Financial PDFs are where clean data goes to experience character development.

Second, the system constructs schemas. For the Attribute Graph, it uses equity research templates from sources such as the CFA Institute and J.P. Morgan-style report structures to prompt the model into identifying core company attributes. For the Event Graph, it uses a top-down schema: high-level driven categories are derived with reference to a University of Wisconsin causal-analysis template, then refined using the Financial Industry Business Ontology.

Third, the system populates the graph. It extracts timestamped entities and relations from the cleaned reports according to the generated schemas. This gives the graph its temporal shape: company information is not merely stored as a static profile, but as dated report-derived knowledge.

Fourth, the system performs quality control. It normalises entities, completes missing numeric attributes through Tushare, and asks GPT-4o-mini to correct placeholders or extraction errors by re-reading source passages.

This quality-control step deserves more attention than the paper’s headline return chart. In financial AI, a wrong entity is not a cosmetic flaw. If a model confuses a stock code, company alias, industry peer, or dated event, the downstream answer may still sound plausible. Plausibility is the toxin. The paper’s appendix case study makes this concrete: several baseline models either refuse to give actionable prediction, offer generic caution, or make factual mistakes such as misidentifying a stock code. FinKario-RAG’s advantage in that example is not rhetorical confidence; it is better grounding.

Two-stage retrieval is the engine, not a garnish

After constructing the graph, FinKario uses FinKario-RAG, a two-stage retrieval-augmented generation pipeline. This is where the mechanism-first reading pays off.

The system vectorises the knowledge graph at three levels: entities, relations, and the graph-level representation. A query is then encoded and retrieval proceeds in two stages.

The first stage is coarse-grained retrieval. It identifies broad anchors such as the relevant stock and date. In ordinary terms: before analysing anything, the system tries to know which company and which time window it is dealing with.

The second stage is fine-grained retrieval. It expands from those anchors into surrounding financial entities and relations: industry, market cap, price, related events, and graph neighbourhood. The retrieved pieces are mapped back into a coherent subgraph, which is then provided to the reasoning model to produce a prediction label, confidence level, and grounded rationale.

A crude but useful diagram looks like this:

Research reports
   ↓
Cleaned Markdown corpus
   ↓
Attribute Graph + Event Graph
   ↓
Entity / relation / graph embeddings
   ↓
Coarse retrieval: stock + date anchors
   ↓
Fine retrieval: surrounding financial subgraph
   ↓
LLM reasoning: direction + confidence + rationale

The key difference from ordinary RAG is not “more context.” More context is often just a larger haystack, now with a subscription fee. The difference is that FinKario retrieves a structured neighbourhood around a stock and time period. That makes it easier for the model to reason over connected signals instead of isolated text fragments.

For operators, this is the strongest design lesson in the paper. If a user asks, “What is the investment outlook for this stock?”, the useful answer rarely lives in one passage. It lives in a relation among company facts, sector dynamics, reported events, market timing, and peer comparison. Retrieval should therefore reconstruct the relation, not merely quote the nearest paragraph.

The main evidence is a weekly backtest, and it should be read carefully

The paper evaluates FinKario-RAG through a weekly long-only trading strategy. On a given trading day, the system generates a signal for a stock. If the signal is a buy signal, the strategy purchases at that day’s closing price and sells at the closing price on the last trading day of the following week.

The dataset combines research reports from East Money, stock price data from Tushare, and index/industry data from Wind. The reports cover 2024-08-28 to 2025-02-28. Price data for backtesting runs to 2025-03-07. The metrics include annualised rate of return, volatility, Sharpe ratio, maximum drawdown, Calmar ratio, and directional accuracy.

The comparison set is broad:

Baseline category	Examples	What the comparison is testing
Market indices	CSI 300, CSI 500, SSE Composite, Dividend Index	Whether the strategy beats passive market exposure in the same period
Vanilla LLMs	Qwen3-8B, GPT-4o-mini	Whether general-purpose models can use report information effectively
Financial LLMs	FinMA, FinGPT, DISC-FinLLM, XuanYuan-6B, Stock-Chain	Whether finance-tuned or finance-oriented models solve the task without FinKario’s graph structure
Institutions	Selected brokerages with frequent research-report publication	Whether the system compares favourably with real-world institutional recommendation strategies
Retrieval variants	Raw reports, HiDy, vanilla RAG, LightRAG, graph ablations	Whether the graph source and retrieval design actually matter

FinKario-RAG reports the strongest overall performance in the main table: ARR 2.633, volatility 0.534, Sharpe ratio 4.926, maximum drawdown 0.172, Calmar ratio 15.315, and accuracy 0.581.

A few comparisons make the result easier to interpret:

Model / strategy	ARR	Sharpe	Max drawdown	Accuracy
FinKario-RAG	2.633	4.926	0.172	0.581
Guolian-Minsheng	2.012	3.108	0.169	0.575
SOOCHOW	1.625	3.115	0.132	0.557
Stock-Chain	1.177	0.971	0.190	0.546
Qwen3-8B	0.941	2.051	0.132	0.475
GPT-4o-mini	0.351	0.944	0.178	0.471
Raw-report RAG using GPT-4o-mini	0.336	0.932	0.197	0.559

The accuracy result needs sober reading. FinKario-RAG’s 0.581 is the best reported accuracy, but it is only slightly above Guolian-Minsheng’s 0.575 and China-Fortune’s 0.573. The larger difference is not directional accuracy alone. It is the combination of return, risk-adjusted return, and drawdown profile under the paper’s trading rule.

That matters because a trading system can be only modestly better at direction and still produce much stronger returns if its correct calls occur in higher-payoff situations, if its selection set is better, or if its sector exposure aligns with market momentum. The paper’s case study suggests FinKario-RAG concentrated more heavily in high-growth sectors such as electrical equipment, semiconductors, and healthcare during a technology-led rally around February 2025. That is a useful behavioural explanation, though it should be treated as interpretive evidence rather than a separate proof.

The ablations say the Event Graph is doing real work

The ablation studies are where the paper’s mechanism becomes testable. They ask two practical questions:

Does the knowledge source matter?
Does the retrieval method matter?

The answer to both is yes, at least within the reported setup.

When FinKario-RAG is replaced with raw research-report input, performance falls sharply: ARR is 0.336, Sharpe is 0.932, and maximum drawdown rises to 0.197. This supports the claim that simply injecting long report text is not enough. The model struggles to filter and structure useful information from long documents.

When the system uses HiDy, an open-source financial knowledge base, ARR rises to 0.462 and Sharpe to 1.353, but it still remains far below the full FinKario setup. This suggests that generic or external financial knowledge helps, but the event-enhanced report-specific graph is carrying much of the advantage.

The most revealing ablations are the graph removals:

Variant	Likely purpose	Result pattern	Interpretation
Raw research reports	Test whether unstructured report text is enough	Low ARR and Sharpe, though accuracy remains 0.559	Long-form text retrieval gives signal but weak portfolio conversion
HiDy knowledge source	Test whether another financial knowledge base can substitute for FinKario	Better than raw reports but far below full FinKario	Financial knowledge helps, but source/task alignment matters
Without Event Graph	Test whether dynamic event drivers matter	ARR falls to 0.386, Sharpe to 0.903, accuracy to 0.474	Event structure appears central to the performance gain
Without Attribute Graph	Test whether stable company facts matter	ARR remains high at 2.230, Sharpe 4.691, but accuracy falls to 0.433	Event signals drive much of the return profile, while attributes may improve grounding and directional reliability
Full FinKario	Combined structure	ARR 2.633, Sharpe 4.926, accuracy 0.581	The best result comes from combining stable attributes, event drivers, and graph retrieval

One small textual wrinkle: the paper’s prose appears to conflate one table value when describing the event-graph removal, but the table itself reports 0.386 ARR for the “w/o Event graph” variant and 0.336 ARR for raw research-report injection. The table values are the safer basis for interpretation.

The retrieval ablation is equally blunt. Vanilla RAG over FinKario produces ARR 0.377, Sharpe 0.758, and accuracy 0.413. LightRAG improves those figures to ARR 0.821, Sharpe 1.313, and accuracy 0.495. Full FinKario-RAG reaches ARR 2.633, Sharpe 4.926, and accuracy 0.581.

This is the paper’s central engineering lesson: even after building the graph, retrieval design still matters. A graph that is retrieved poorly becomes an expensive filing cabinet. Elegant, perhaps. Still a filing cabinet.

The result is not “LLMs beat institutions.” It is “structured retrieval changes the comparison.”

The tempting headline is that FinKario outruns financial LLMs and brokerage strategies. The paper encourages that reading, and the title of this article does not exactly flee from it. But for business interpretation, the better reading is narrower.

FinKario-RAG does not prove that an autonomous LLM can replace institutional research. It uses institutional-style research reports as input. It also compares against selected institutional strategies reconstructed from brokerage recommendations, not against full internal buy-side workflows with risk teams, execution systems, portfolio constraints, and human override. Those are very different animals. One lives in a paper table. The other has a compliance department and a coffee budget.

What the paper more directly shows is this:

Claim	What the paper shows	Business meaning	Boundary
Event-enhanced graphs improve financial retrieval	Removing the Event Graph sharply reduces ARR and Sharpe	Extracting event drivers from reports can add usable signal beyond static company profiles	Tested on one report corpus and market window
Raw report RAG is insufficient	Raw-report injection performs far below full FinKario-RAG	Document search is not the same as decision infrastructure	Results depend on prompt, model, chunking, and retrieval implementation
Two-stage graph retrieval matters	Vanilla RAG and LightRAG underperform FinKario-RAG	Anchoring by stock/date before expanding context is operationally sensible	Retrieval design may need adaptation by asset class and data regime
FinKario-RAG beats selected baselines	Strongest ARR, Sharpe, Calmar, and accuracy in the reported backtest	The architecture deserves serious evaluation for analyst copilots and investment tools	Short backtest, Chinese A-share context, weekly long-only rule
Accuracy is not the whole story	Accuracy lead is small, return lead is larger	Portfolio outcomes depend on payoff distribution and selection quality	Requires deeper trade-level analysis before live allocation

This distinction matters because many financial AI projects still chase model prestige. They ask whether to use a larger LLM, a finance-tuned LLM, or a multi-agent wrapper with just enough theatre to impress a demo committee. FinKario points elsewhere: the edge may come from the shape of the memory layer and the discipline of retrieval.

Where this becomes useful in business practice

The obvious market for FinKario-like systems is investment decision support. But the deeper use case is any workflow where long financial narratives need to become structured, time-aware reasoning context.

For an investment platform, the architecture could support explainable stock screening: not merely ranking stocks, but showing which report-derived event drivers contributed to the ranking. For a wealth-advice interface, it could help prevent generic answers by grounding commentary in company-specific and sector-specific context. For an analyst team, it could act as a research memory layer: a system that remembers when a company’s narrative changed, which events drove that change, and how related companies were discussed in the same period.

For a data vendor, the product opportunity is even cleaner. Research reports are already a premium content category. A vendor that converts them into event graphs can sell not just documents, but queryable financial state. That shifts the product from “search our reports” to “interrogate the evolving investment narrative.” Less glamorous than “AI analyst,” but much more likely to survive procurement.

Cognaptus would separate the business implications into three layers:

Layer	What can be adopted now	What needs validation
Research operations	Automated extraction of attributes, events, and report-derived causal drivers	Extraction accuracy across report formats, languages, sectors, and issuers
Analyst copilots	Graph-grounded answers with entity identifiers, peer comparison, and event rationale	Human review workflows, citation traceability, compliance logging
Investment automation	Signal generation from event-enhanced retrieval	Out-of-sample live testing, transaction costs, turnover, risk limits, position sizing, market-regime robustness

The first two are closer to production-readiness than the third. Turning reports into structured research memory is a workflow improvement. Turning that memory into trades is a financial product with a very different risk surface. Same architecture, different blast radius.

The short backtest is the largest interpretive boundary

The backtest window is short: reports from late August 2024 through late February 2025, with price data through early March 2025. Annualised return metrics over such a period can look dramatic because annualisation stretches a few months into a full-year equivalent. That does not make the numbers fake. It does make them sensitive.

The paper’s own narrative notes market movements around late September 2024 and February 2025. This matters because FinKario-RAG’s sector concentration appears to have aligned well with a rally in technology-related sectors. That could indicate strong event understanding. It could also indicate a favourable regime for the extracted signals. The difference only becomes clear through longer, rolling, out-of-sample evaluation across market cycles.

Several further questions remain open:

Would the system work on markets where broker reports are less frequent, less structured, or less retail-accessible?
How much of the result depends on East Money report coverage and Chinese A-share market conditions?
What happens after transaction costs, liquidity constraints, position limits, and realistic execution?
How stable is performance when the model must process tables, charts, and time-series data inside reports rather than mainly text?
Can the system provide sufficiently auditable evidence chains for regulated advisory use?
Does the graph update process avoid look-ahead leakage when reports, prices, and recommendations are timestamped imperfectly?

These are not polite academic footnotes. They are product requirements wearing a trench coat.

The appendix is implementation evidence, not a second thesis

The paper’s appendices are useful because they show how much of the system depends on prompt design and schema discipline. The Attribute Graph prompt extracts 11 relation types. The Event Graph prompt uses categories such as Supply, Demand, Revenue, Efficiency Cost, Strategic Action, Technology Innovation, Policy Regulation, and Macro. The event extraction prompt asks for subject, object, relevant entities, timeframe, driven category, and a reasoning statement.

That is important for reproducibility. It also highlights a commercial risk: automated graph construction is only as good as its schema governance. If the schema drifts, if relation categories become too broad, or if LLM extraction begins inserting plausible but unsupported causal links, the graph may become confidently polluted.

The appendix case study on Haier Biomedical is not main evidence for performance. It is a qualitative illustration of answer behaviour. Its likely purpose is to show that FinKario-RAG can provide more grounded, entity-correct, comparative investment analysis than models that either avoid prediction or reason from incomplete context. That supports the mechanism story, but it does not replace quantitative evaluation.

The supplementary institutional-strategy experiment is also useful, but it should be treated as a robustness-style extension. It expands the institutional comparison from agencies with at least 300 reports to those with at least 100. FinKario-RAG remains strongest on major return-oriented metrics in that broader table. Still, the comparison is only as strong as the method used to translate brokerage reports into backtestable signals.

The real lesson is architectural: finance AI needs temporal structure

FinKario is not important because it adds one more benchmark table to the already crowded theatre of financial LLM evaluation. It is important because it gives a plausible answer to a more stubborn question: how should AI systems represent financial knowledge that changes over time?

A static database can store company facts. A vector database can retrieve similar passages. A language model can write fluent analysis. None of those, by itself, is a financial reasoning system. Finance needs time, entities, events, relations, and causality stitched together carefully enough that a model can retrieve the right context before it starts sounding clever.

That is the quiet strength of FinKario. It does not ask the LLM to become an analyst from vibes and embeddings. It gives the model a structured memory of what the reports said, when they said it, which company it referred to, and what event category it belonged to. Then it retrieves a coherent subgraph instead of throwing paragraphs into the prompt and hoping probability behaves professionally.

For operators, the next step is not blind adoption. It is targeted replication. Rebuild the pipeline on your own report corpus. Test extraction quality. Validate timestamp handling. Compare graph retrieval against your current document search. Run paper trading before capital allocation. Add compliance traceability before client-facing advice. Then decide whether the graph is producing durable signal or just a beautifully organised backtest souvenir.

The pun in FinKario’s result is that the “intelligence” may not live mainly in the language model. It may live in the graph that tells the model what the market narrative has become. Bigger models can read. Better systems remember what changed.

Cognaptus: Automate the Present, Incubate the Future.

Xiang Li, Penglei Sun, Wanyun Zhou, Zikai Wei, Yongqi Zhang, and Xiaowen Chu, “FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph,” arXiv:2508.00961, 2025. https://arxiv.org/abs/2508.00961 ↩︎

TL;DR for operators#

The problem is not that LLMs cannot read reports. It is that reports are the wrong shape.#

FinKario splits financial knowledge into what changes slowly and what moves the market#

The construction pipeline matters because finance punishes sloppy extraction#

Two-stage retrieval is the engine, not a garnish#

The main evidence is a weekly backtest, and it should be read carefully#

The ablations say the Event Graph is doing real work#

The result is not “LLMs beat institutions.” It is “structured retrieval changes the comparison.”#

Where this becomes useful in business practice#

The short backtest is the largest interpretive boundary#

The appendix is implementation evidence, not a second thesis#

The real lesson is architectural: finance AI needs temporal structure#