Pricing Plans, Meet Prompt Engineering: LLMs and the Future of SaaS Monetization

TL;DR for operators

SaaS pricing has become too complex to live only as a web page. Plans, feature gates, usage limits, add-ons, annual discounts, enterprise exceptions, and product bundles now behave like operational logic. Yet in many companies, that logic is still scattered across marketing pages, billing systems, sales decks, spreadsheets, and someone’s memory. A robust governance model, naturally.

The paper behind this article, From Static to Intelligent: Evolving SaaS Pricing with LLMs, proposes a practical direction: use LLM-assisted extraction to transform static SaaS pricing pages into machine-readable “intelligent pricing” artifacts.¹ The authors implement a basic system, AI4Pricing2Yaml, that scrapes SaaS pricing pages and uses an LLM to identify four core pricing elements: plans, features, usage limits, and add-ons.

The key business lesson is not “LLMs can optimise your prices.” That would be a different paper, and a much more dangerous meeting with Finance. This paper is about making existing pricing structures legible to machines. Once pricing becomes structured, it can be audited, compared, versioned, monitored, and integrated into DevOps workflows.

The evidence is promising but uneven. On 30 commercial SaaS sites, feature extraction performs strongly, with mean accuracy of 88.2%, precision of 91.1%, and recall of 96.4%. Plan extraction reaches perfect recall, but only 64.3% mean accuracy because the model often confuses add-ons or neighbouring products with actual plans. Usage limits and add-ons are weaker: usage limits reach 67% mean accuracy, while add-ons reach only 53.5%.

For operators, the near-term value is pricing intelligence and operational hygiene, not autonomous monetisation. The system can help build structured pricing inventories, compare competitors, detect plan drift, and support pricing-governance tooling. The boundary is sharp: the current implementation is best suited to static, single-page pricing layouts with clear comparison tables. Dynamic pages, modals, multiple products, ambiguous add-ons, and hallucinated structure remain the tax bill for pretending the web is organised.

Pricing pages are becoming operational systems wearing marketing clothes

A SaaS pricing page looks simple until someone has to maintain it.

There are usually three or four plan cards, a reassuring “contact sales” button, a table of features, and a sprinkle of phrases like “advanced collaboration” and “premium support,” because apparently software buyers enjoy semantic fog. Underneath that familiar surface is a configuration system. Every feature belongs somewhere. Every limit has a unit. Every add-on has dependencies. Every plan boundary has consequences for billing, onboarding, feature toggling, revenue recognition, and customer support.

The paper frames this as part of “Pricing-driven Development and Operation of SaaS”: the idea that pricing is not merely a sales artifact but a driver of software behaviour. If a feature is available only to Business-tier users, that pricing distinction eventually becomes application logic. If storage limits differ by plan, that pricing decision becomes enforcement logic. If add-ons can be purchased independently, that pricing choice becomes entitlement logic.

This is why the paper’s starting point matters. The authors are not asking whether a model can invent the perfect price. They are asking whether pricing information already visible on SaaS websites can be transformed into a structured, machine-readable representation. That is a humbler goal. It is also more immediately useful.

The authors use GitHub as an example of how quickly pricing complexity can grow. Their cited prior analysis reports that GitHub’s pricing reached 8,960 possible subscription combinations in 2024, built from 3 plans, 81 features, 9 usage limits, and 14 add-ons. The specific number is less important than the operational pattern: even a compact pricing page can imply a large configuration space. When that space is managed manually, errors are not an edge case. They are a scheduling feature.

Intelligent pricing is not dynamic pricing with nicer branding

The phrase “intelligent pricing” is easy to misread. It sounds like AI changing prices in real time based on demand, customer profile, competitor behaviour, lunar phase, or whatever dashboard metric has most recently seduced the growth team.

That is not what this paper does.

Here, intelligent pricing means a dynamic, machine-readable pricing model that can be treated as a software artifact. It is closer to infrastructure-as-code than to airline ticket pricing. The target output is a structured representation, using the Pricing4SaaS metamodel and Pricing2Yaml-style syntax, that captures plans, features, usage limits, and add-ons in a format downstream systems can process.

The distinction matters because it changes the business interpretation.

Reader assumption	What the paper actually targets	Why the difference matters
“The LLM decides what prices to charge.”	The LLM extracts existing public pricing information.	This is pricing operations, not revenue optimisation.
“The system replaces pricing strategy.”	The system structures pricing artifacts for review and reuse.	Humans still decide packaging, discounting, positioning, and margins.
“The main risk is bad price recommendations.”	The main risk is incorrect extraction or modelling.	Governance should focus on verification, schema validation, and audit trails.
“This is just scraping.”	The model must classify semantic pricing elements from messy page structures.	The hard part is not fetching HTML; it is interpreting product logic.

The useful mental model is not “AI pricing strategist.” It is “LLM-assisted compiler for SaaS pricing pages.” The source language is messy HTML and marketing copy. The target language is structured pricing logic. Like any compiler, it is only useful if it handles ambiguity, rejects invalid input, and does not silently invent syntax because it felt inspired.

The proposed machine has three layers, but the prototype mostly tests the first

The paper proposes an automatic intelligent pricing modeler with three components.

First, an Information Extractor retrieves the pricing page and uses an LLM to identify plans, features, usage limits, and add-ons. In the implementation, Selenium is used for scraping, and Gemini 1.5 Flash is used as the LLM, largely because its context window can process large page contents and because it is easy to access through an API.

Second, a Process Engine is supposed to validate and clean the extracted information. This layer is where duplicate elements, inconsistent monthly and annual prices, hallucinated items, and other structural problems should be flagged. In production terms, this is the boring but essential layer. Naturally, therefore, it is also where much of the real value lives.

Third, a Results Modeler converts the cleaned output into a structured pricing file, such as Pricing2Yaml, along with logs and warnings that help developers review the resulting artifact.

A simplified version of the mechanism looks like this:

Pricing URL
   ↓
Scraped HTML
   ↓
LLM-based Information Extractor
   ↓
Validation and consistency checks
   ↓
Pricing2Yaml-style machine-readable model
   ↓
Warnings, logs, review, downstream automation

The important nuance is that the paper’s implemented prototype fully focuses on the Information Extractor. The broader architecture is the destination; the validation mainly tests whether the extraction step can recover the raw ingredients needed for machine-readable pricing.

That makes the evidence easier to interpret. The paper is not proving that a complete autonomous pricing-governance system is ready. It is testing whether the most fragile front-end step—extracting structured pricing elements from public SaaS pages—is feasible enough to justify the architecture.

That is a sensible research move. It is also a useful filter for business readers: do not buy the cathedral when the paper has validated the quarry.

The evaluation tests extraction viability, not production reliability

The authors validate the extractor on a dataset of 30 commercial SaaS pricing websites. They evaluate four categories: plans, features, usage limits, and add-ons. The scoring uses true positives, false positives, false negatives, and true negatives, with partial credit for incomplete or partially wrong extraction.

This design has a clear purpose. It is main evidence for feasibility: can the model extract the elements required to populate an intelligent pricing model?

It is not an ablation, not a comparison against a prior extraction system, and not a robustness test across every possible pricing-page design. The paper does not compare Gemini 1.5 Flash against GPT-4, Llama, rule-based parsers, or human annotators. Nor does it test multiple prompting strategies as a formal sensitivity analysis. That is fine, but it limits the claim.

The results are best read category by category.

Extracted element	Mean accuracy	Mean precision	Mean recall	Interpretation
Plans	64.3%	61.4%	100%	The system finds all plans but also over-extracts non-plans.
Features	88.2%	91.1%	96.4%	The strongest category; structured feature tables are LLM-friendly.
Usage limits	67.0%	83.8%	77.8%	Limits are harder because they require interpretation of quantities, units, and conditions.
Add-ons	53.5%	63.0%	81.0%	The weakest category; add-ons are often visually and semantically ambiguous.

The headline is not “LLMs solve SaaS pricing extraction.” The headline is more precise: LLM-assisted extraction appears viable when pricing information is static, visible, and structurally organised, especially for feature tables. It becomes brittle when the web page asks the model to infer product boundaries, hidden interactions, add-on semantics, or usage-limit logic from inconsistent layouts.

In other words, the model is not merely reading a pricing page. It is trying to reconstruct the company’s commercial ontology from a web page designed by committee. One should not expect miracles. One may reasonably expect invoices.

Feature extraction works because comparison tables do half the thinking

The strongest result is feature extraction. Across the tested SaaS sites, the system achieves 88.2% mean accuracy, 91.1% precision, and 96.4% recall. The median accuracy is even higher at 96.3%.

This is the paper’s cleanest evidence that the approach has practical potential. But the reason matters.

Features are often displayed in comparison tables. Tables give the model repeated structure: rows correspond to features, columns correspond to plans, and cells encode availability or limits. The LLM still has to parse labels, map rows to plans, and handle variations, but the page layout provides a scaffold. The machine does not need to invent the schema; the website is quietly offering one, even if it refuses to admit it.

The paper notes strong feature performance for several SaaS examples, including near-perfect results for Buffer and Figma. But the errors are instructive. Dropbox and Mailchimp show notable false positives, while Jira suffers from missed features because the model skipped rows in HTML tables.

That last failure is especially relevant for implementation. A skipped row is not a philosophical limitation of artificial intelligence. It is a pipeline failure. It suggests that extraction quality depends not only on the LLM but also on how page content is scraped, linearised, chunked, and presented to the model. Prompt engineering receives the glamour; HTML preprocessing does the laundry.

For business use, this means feature extraction is the most credible early application. A pricing-intelligence team could use a system like this to build structured feature matrices for competitor monitoring, packaging analysis, and internal audits. But even here, outputs need verification. A 96.4% recall rate sounds high until the missing 3.6% is the enterprise SSO feature your sales team cares about.

Plan extraction finds everything, including things that are not plans

Plan extraction reaches 100% recall. That sounds excellent. It is also a lovely example of why recall without precision can flatter a system.

The mean accuracy for plan extraction is only 64.3%, and mean precision is 61.4%. The model generally finds the real plans, but it also returns extra items. In some cases, it confuses add-ons with plans. In others, it extracts plans from adjacent products appearing on the same page, such as confusion between Figma and FigJam.

This is a common enterprise extraction problem: the model can identify the right class of thing but struggles with scope. A pricing page may contain multiple products, bundled services, plan-like add-ons, trial options, enterprise contact blocks, and comparison sections. Human readers use context to decide what belongs to the core product. The model sees repeated pricing-shaped patterns and occasionally salutes the wrong flag.

Operationally, this matters because plan identity is foundational. If the plan list is polluted, every downstream mapping becomes suspect. Features attach to the wrong plan. Limits become misaligned. Add-ons appear purchaseable under incorrect conditions. A structured YAML file built on a noisy plan set is not automation; it is a faster way to create a mess with indentation.

The fix is not simply “use a better model.” The paper’s discussion points toward internal improvements such as stronger prompt engineering, structured outputs, function calling, and validation logic. For plan extraction specifically, a production system would likely need explicit product-scope detection: which product is being modelled, which page sections belong to it, and which plan-like entities are merely add-ons, modules, or related products.

That is not a minor detail. It is the difference between scraping a web page and understanding a catalogue.

Usage limits and add-ons expose the semantic burden

Usage limits and add-ons are where the easy narrative collapses, usefully.

Usage limits achieve 67% mean accuracy, 83.8% precision, and 77.8% recall. Add-ons perform worse, with 53.5% mean accuracy, 63% precision, and 81% recall. These are not catastrophic results for a basic implementation, but they are not production-clean either.

The problem is that limits and add-ons are not always presented as neat rows in a table. A usage limit can be a number, a threshold, a condition, a quota, a rate, a storage allowance, a seat cap, or a “fair usage” phrase wearing a cardigan. It may apply to a feature, a plan, an add-on, a billing cycle, or a region. Add-ons are even messier: they can look like products, optional modules, support packages, usage expansions, integrations, or sales-led enterprise extras.

The paper’s examples show this unevenness. Microsoft 365 performs very poorly on usage-limit extraction, with 12.5% accuracy and 13% precision. For add-ons, Jira, Clockify, Overleaf, and Evernote present major difficulties. Some SaaS products with no add-ons still trigger false add-on extraction, which is exactly the sort of hallucination that makes automated governance teams quietly reach for manual review.

This is where the paper’s mechanism-first reading becomes important. The LLM is not just identifying words. It is being asked to infer relationships:

Is this item included in the plan or sold separately?
Is this number a price, quota, storage limit, seat cap, or usage threshold?
Does this add-on require a base plan?
Does this condition apply monthly, annually, per user, per workspace, or per organisation?
Is a visible option part of this product or a neighbouring product line?

Those are schema questions. A language model can help answer them, but the system needs validation rules, canonical units, dependency modelling, and human-readable warnings. Without that, the YAML may look tidy while quietly lying. The best lies, as every enterprise architect knows, are properly formatted.

The “ideal pricing page” is a design implication, not a second experiment

The paper includes an “ideal extraction” section that describes what a machine-friendly SaaS pricing page should look like. It recommends clearly separated sections for plans, comparison-table features, usage limits, and add-ons. It also argues that pricing information should be static and visible without extra user interactions.

This section is not a second thesis. It is a design implication derived from the observed failures.

Its purpose is practical: if SaaS companies want pricing pages to be machine-readable, they should stop hiding critical information inside modals, toggles, scripts, fragmented product cards, and decorative layout experiments. A clear HTML table is not just good UX. It is also good machine interface design.

That creates an interesting reversal. The paper begins by using AI to adapt to messy pricing pages. By the end, it implies that pricing pages themselves should adapt to automation. LLMs may reduce the cost of disorder, but they do not repeal it.

For SaaS operators, this is a useful governance point. A company that wants better pricing operations should not only build extraction tools. It should also standardise how pricing information is authored and published. The simplest way to improve an LLM pricing extractor may be to make the pricing page less hostile in the first place. Radical, yes. Perhaps even cheaper.

What the paper directly shows

The paper directly shows that a basic LLM-assisted extractor can recover important pricing elements from real SaaS pricing pages, with strongest performance on features and weaker performance on usage limits and add-ons. It also shows that plan extraction has high recall but suffers from false positives, especially when pages contain multiple products or plan-like add-on structures.

The validation supports three narrower claims.

Claim	Evidence in the paper	Business meaning	Boundary
Machine-readable SaaS pricing is technically plausible.	The extractor processes 30 commercial SaaS pricing sites and extracts plans, features, limits, and add-ons.	Pricing can become a structured operational artifact rather than static collateral.	The prototype focuses on extraction, not full production governance.
Feature matrices are the strongest near-term use case.	Feature extraction reaches 88.2% mean accuracy and 96.4% recall.	Competitive feature comparison and packaging audits are realistic early applications.	Errors still occur when rows are skipped or features are misclassified.
Add-ons and usage limits need stronger modelling.	Usage limits reach 67% mean accuracy; add-ons reach 53.5%.	Monetisation logic cannot be trusted without validation layers.	Ambiguous layouts and hidden dynamic content remain major blockers.
Page structure heavily affects extraction quality.	Only 15 of 30 SaaS platforms had features, usage limits, and add-ons successfully extracted, largely due to missing structured tables or dynamic content.	Pricing-page design becomes part of pricing operations.	Better scraping or agentic browsing may help, but the paper does not prove it yet.

The evidence is strongest as a feasibility demonstration. It is not yet evidence for fully automated pricing management, autonomous monetisation, or reliable extraction across arbitrary pricing-page designs.

What Cognaptus infers for business use

The practical business pathway is straightforward: structured pricing artifacts can reduce the operational cost of understanding, comparing, and governing SaaS pricing.

For SaaS vendors, an iPricing workflow could support internal consistency. Product managers could check whether pricing pages, billing rules, and feature-entitlement systems agree. DevOps teams could treat pricing changes like code changes, with diffs, reviews, and rollback. Sales operations could detect when a public pricing page drifts away from packaging logic used in quotes.

For buyers and analysts, the same approach could support competitive monitoring. Instead of manually maintaining spreadsheets of competitor plans and features, teams could build structured pricing snapshots over time. This would make packaging changes easier to detect: new add-ons, shifted limits, renamed tiers, removed features, or silent enterprise repositioning.

For platform companies, the more interesting long-term use is pricing-as-configuration. Once pricing is represented in a formal schema, it can feed entitlement services, experimentation frameworks, documentation, onboarding flows, and customer-facing calculators. The paper does not build that full stack, but it points toward the interface such a stack would need.

The ROI logic is therefore not “replace pricing strategists.” It is more mundane and more credible:

Operational pain	Structured iPricing benefit	Likely early user
Manual competitor pricing spreadsheets	Automated extraction and diffing	Product marketing, strategy, pricing analysts
Pricing-page and billing-system inconsistency	Machine-readable source-of-truth checks	RevOps, DevOps, product operations
Feature entitlements scattered across teams	Plan-feature mapping in a formal schema	Product managers, engineering leads
Add-on complexity and packaging drift	Dependency modelling and review warnings	Monetisation teams, sales operations
Slow pricing audits	Repeatable extraction and validation workflow	Finance, legal, internal audit

The unglamorous phrase here is “pricing operations.” That is exactly why the work matters. Most companies do not fail at pricing because nobody has heard of AI. They fail because packaging decisions become operational debt.

Where the current approach breaks

The limitations are not generic. They are specific enough to guide implementation.

First, the system depends heavily on the pricing page containing the necessary information in one place. The implementation is designed for a single webpage. If critical details are spread across modals, secondary pages, tooltips, calculators, plan-specific documentation, or sales-only flows, the extractor cannot reliably model what it cannot see.

Second, the system works best when pricing pages use structured comparison tables. The authors explicitly note that the current algorithm is optimised for pages with comparison tables containing plans and feature information. This explains why features perform well and why less structured elements suffer.

Third, dynamic content remains a problem. Selenium can handle some JavaScript-rendered content, but content requiring user actions—clicking buttons, opening modals, changing billing periods, expanding hidden rows—can still be missed. The paper suggests LLM agents as a future direction for navigating dynamic content, but that remains proposed work, not validated evidence.

Fourth, hallucination is operationally serious. False positives are not harmless when the output becomes a pricing artifact. A hallucinated add-on or misclassified plan can contaminate downstream analytics, audits, or entitlement logic. This is why the proposed Process Engine is not optional plumbing. It is the safety rail.

Fifth, the evaluation does not isolate model choice, prompting strategy, scraping method, or schema design as separate causal factors. The paper uses Gemini 1.5 Flash and basic prompt engineering. Better models, structured outputs, tool calling, or knowledge graphs may improve performance, but the paper does not quantify those improvements.

This boundary should not be read as a dismissal. It is the map of what must be engineered next.

The real contribution is making pricing maintainable

The paper’s most useful contribution is not that it adds LLMs to pricing. Everyone is adding LLMs to everything. Somewhere, presumably, there is a toaster with retrieval-augmented generation and a seed round.

The contribution is more durable: it treats SaaS pricing as a maintainable software artifact.

That shift has consequences. If pricing is an artifact, it needs a schema. If it has a schema, it can be validated. If it can be validated, it can be versioned. If it can be versioned, it can be audited. If it can be audited, it can become part of the operating system of a SaaS business rather than a quarterly redesign ritual conducted in slides.

LLMs enter this story as translation machinery. They help convert messy human-facing pricing pages into structured representations. But the business value comes from what happens after translation: cleaner governance, better monitoring, fewer inconsistencies, and faster diagnosis when packaging logic changes.

The paper is still early. The prototype is not a complete pricing-operations platform. Its strongest results depend on favourable page structure. Its weakest categories are precisely the ones monetisation teams often care about most: usage limits and add-ons. But the direction is credible because it attacks a real operational bottleneck with a mechanism that fits the problem.

SaaS pricing will keep getting more modular. Plans will keep multiplying. Add-ons will keep pretending they are not products. Usage limits will keep hiding inside footnotes with the confidence of small-print lawyers. The companies that manage this complexity as structured software will have an operational advantage over those still reconciling pricing truth by spreadsheet archaeology.

The future of SaaS monetisation is not just smarter prices. It is pricing that machines can read without developing a migraine.

Cognaptus: Automate the Present, Incubate the Future.

Francisco Javier Cavero, Juan C. Alonso, and Antonio Ruiz-Cortés, “From Static to Intelligent: Evolving SaaS Pricing with LLMs,” arXiv:2507.12104, 2025, https://arxiv.org/abs/2507.12104. ↩︎

TL;DR for operators#

Pricing pages are becoming operational systems wearing marketing clothes#

Intelligent pricing is not dynamic pricing with nicer branding#

The proposed machine has three layers, but the prototype mostly tests the first#

The evaluation tests extraction viability, not production reliability#

Feature extraction works because comparison tables do half the thinking#

Plan extraction finds everything, including things that are not plans#

Usage limits and add-ons expose the semantic burden#

The “ideal pricing page” is a design implication, not a second experiment#

What the paper directly shows#

What Cognaptus infers for business use#

Where the current approach breaks#

The real contribution is making pricing maintainable#