FormuLLA: When LLMs Stop Talking and Start Formulating

Opening — Why this matters now

Pharmaceutical 3D printing has promised personalization for over a decade. In practice, it has mostly delivered spreadsheets, failed filaments, and a great deal of human patience. The bottleneck has never been imagination—it has been formulation. Every new drug–excipient combination still demands expensive trial-and-error, even as printers themselves have matured.

Into this friction steps FormuLLA, a study that asks a deceptively simple question: what if large language models (LLMs) could reason about formulations rather than merely predict outcomes? Not classify printability. Not regress mechanical scores. But recommend excipients directly—the way a formulation scientist might.

This is not another “ChatGPT for pharma” story. It is a careful, slightly uncomfortable exploration of what happens when general-purpose language models are dragged into a domain that punishes fluent nonsense.

Background — Context and prior art

Most AI applied to pharmaceutical formulation has been narrow by design. Classical machine learning and deep learning models excel at discriminative tasks: predicting dissolution rates, classifying printability, estimating mechanical properties. They require carefully structured inputs and reward narrowly defined outputs.

Generative approaches, such as conditional GANs, pushed further by producing novel formulations—but at a cost. They are unstable, data-hungry, and fundamentally awkward when knowledge is partially textual, partially numerical, and partially tacit.

LLMs change the framing entirely. Trained via next-token prediction rather than adversarial loss, they naturally absorb heterogeneous representations. More importantly, they invert the question. Instead of asking:

Given a formulation, is it printable?

They allow:

Given an API and dose, what formulation should I try?

That inversion is the conceptual leap at the heart of FormuLLA.

Analysis — What the paper actually does

The authors fine-tune four open-source LLMs—Llama 2 (7B), Mistral (~7B), T5-XL (~3B), and BioGPT (~350M)—on a curated dataset of ~1,400 fused deposition modelling (FDM) formulations. Each formulation is converted into instruction–response pairs using an Alpaca-style schema.

Input example:

Recommend excipients for 20 w/w% Paracetamol

Output example:

For this formulation, use these excipients: HPMC 60%, PEG8000 5%, … This is printable and has a Good filament aspect.

Critically, the models are not fully retrained. Parameter-efficient fine-tuning (PEFT) with LoRA adapters is used to adjust only a small fraction of parameters—bringing training times down to minutes, not weeks.

The experimental design is methodical rather than flashy:

Multiple learning rates (10⁻², 10⁻⁴, 10⁻⁶)
Different LoRA adapter configurations (Q/V vs Q/K/V/O)
Controlled variation of generation parameters (temperature, top‑p)

The goal is not to win a benchmark. It is to understand failure modes.

Findings — Results that actually matter

1. Architecture beats domain pretraining

Despite being trained exclusively on biomedical text, BioGPT performs worst overall. It hallucinated excipients, corrupted units, and produced alphanumeric noise under several conditions.

By contrast, Llama 2—trained on broad, non-specialized corpora—consistently outperformed other models when fine-tuned correctly.

Implication: architectural robustness and scaling laws matter more than domain-themed pretraining slogans.

2. Learning rate is destiny

A learning rate of 10⁻⁴ emerged as the only viable regime. Too high (10⁻²) and models collapsed into nonsense. Too low (10⁻⁶) and they drifted, losing semantic grounding.

This echoes a recurring lesson in applied AI: most failures are not conceptual—they are parametric.

3. Language metrics are dangerously misleading

BLEU and ROUGE scores correlated poorly with formulation quality. Some linguistically “good” outputs omitted essential excipient roles. Others scored poorly while recommending chemically sensible combinations.

To address this, the authors introduced VELVET—a domain-aware metric that evaluates excipient similarity using co-occurrence embeddings rather than text overlap.

Lower VELVET scores indicate formulations closer to historical best practice.

4. Catastrophic forgetting is not theoretical

Smaller models exhibited clear signs of catastrophic forgetting: repeated tokens, corrupted strings, loss of unit consistency. Even with a dataset of just 1,400 samples, fine-tuning erased prior knowledge.

This is not a corner case. It is a structural risk when adapting LLMs to technical domains.

Comparative snapshot

Model	Linguistic Quality	Formulation Accuracy (VELVET)	Stability
Llama 2	High	Best	Strong
T5‑XL	Moderate–High	Moderate	Moderate
Mistral	Low–Moderate	Improved w/ adapters	Fragile
BioGPT	Low	Mixed	Weak

Implications — What this means beyond pharma

FormuLLA is not really about 3D printing. It is about where LLMs break when language stops being the product.

Three broader lessons emerge:

Evaluation must be domain-native. If your metric cannot tell nonsense from expertise, your model is already unsafe.
Smaller is not safer. Compact models forget faster, hallucinate harder, and fail more silently.
General models adapt better than themed ones—provided fine-tuning is restrained and well-measured.

For businesses, the implication is sobering: deploying LLMs into operational pipelines without task-specific validation is not innovation. It is deferred liability.

Conclusion — From fluency to function

FormuLLA demonstrates that LLMs can, under strict conditions, reason about pharmaceutical formulations. But it also exposes the cost of mistaking eloquence for competence.

The future here is not a single “pharma GPT.” It is layered systems: general models, carefully adapted, judged by metrics that reflect physical reality—not grammatical symmetry.

LLMs are ready to leave the chat window. Whether we are ready to evaluate them properly is still an open question.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Findings — Results that actually matter#

1. Architecture beats domain pretraining#

2. Learning rate is destiny#

3. Language metrics are dangerously misleading#

4. Catastrophic forgetting is not theoretical#

Comparative snapshot#

Implications — What this means beyond pharma#

Conclusion — From fluency to function#