A citizen-facing AI assistant is where the PLLuM story becomes interesting.

Not because a chatbot in a government app is a dazzling concept. It is not. Most public-sector chatbots have the charisma of a PDF with a search bar and the legal confidence of a nervous intern. The interesting part is what Poland had to build before such an assistant could be considered remotely serious: a rights-managed national corpus, Polish-native instruction data, preference alignment, safety filters, RAG evaluation, retrieval tooling, and a family of public models with different licence regimes.

That is the real argument in PLLuM: A Family of Polish Large Language Models.1 The paper is not merely announcing a Polish fine-tune of Llama or Mistral with a national flag tastefully stapled to the README. It describes something more operationally significant: an end-to-end attempt to turn language-model sovereignty into infrastructure.

The public-administration assistant is therefore the right place to begin. It forces the project to answer the awkward questions that model demos usually glide past. Can the system answer citizens using official government sources? Can it refuse when the documents do not support an answer? Can the data be legally used in a public model? Can unsafe output be blocked without turning the assistant into a bureaucratic mime? Can native Polish speakers detect errors that generic multilingual benchmarks miss?

PLLuM’s answer is not “we trained a bigger model.” Mercifully. The answer is “we built the stack around the model.”

The government assistant reveals the actual product

The paper’s public-sector prototype is designed around Retrieval-Augmented Generation over Polish administrative documents. Its knowledge base uses official sources including gov.pl, biznes.gov.pl, and mobywatel.gov.pl. The target deployment context is not an abstract benchmark but citizen guidance inside Poland’s mObywatel ecosystem, with testing by government users and cooperation with five voivodeship offices.

That matters because public administration is a brutal test environment for language models. The assistant must not simply sound fluent. It must know when the answer is in the provided documents, cite sources, avoid improvising legal guidance, and refuse out-of-domain questions. In other words, it has to behave less like a clever essay generator and more like a controlled information interface. Terribly unfashionable, and therefore useful.

The team built a RAG prototyping application called ShpaRAG to support this work. It lets annotators and developers configure retrieval pipelines, inspect ranked documents, view prompts, judge generator outputs, and export annotations. Underneath that interface, the system combines lexical search, dense retrieval, learned sparse retrieval, hybrid retrieval, and reranking. The final retrieval choice is practical rather than glamorous: a hybrid of BM25 and BAAI/bge-m3, reranked with BAAI/bge-reranker-v2-m3.

This is where the paper quietly undermines a common enterprise fantasy. RAG performance is not purchased by attaching a vector database to a model and declaring victory before lunch. The retrieval layer has to be evaluated on domain documents, chunked intelligently, and tested against the failure modes of real users. In PLLuM’s administrative retrieval experiments, long context mattered, BM25 remained surprisingly competitive, and a custom structure-aware chunking method slightly improved retrieval over both full documents and generic LangChain-style recursive chunking.

The numbers are modest but telling. On the EZD-IR retrieval dataset, BM25 scored 75.3 NDCG@10, slightly above BAAI/bge-m3’s 74.1 in the retriever group. For reranking, BAAI/bge-reranker-v2-m3 reached 85.4. The project’s custom chunking raised the hybrid system to 85.9 NDCG@10 and 94.9 ACC@5, compared with 85.4 and 94.2 for full documents, and 84.5 and 94.0 for generic auto-chunking.

That is not a revolution. It is better: it is an implementation lesson. Administrative AI succeeds in the unsexy margins.

PLLuM is a stack, not a model card

The easiest way to misread PLLuM is to treat the model family as the main artifact. The paper does announce 18 open-access LLMs across base, instruct, and chat variants, including 8B, 12B, 8x7B, and 70B models derived from Llama 3.1, Mistral-Nemo, and Mixtral architectures. But the model list is only the visible layer.

The stronger contribution is the infrastructure beneath it.

Layer What PLLuM built Operational consequence
Corpus governance A Polish research corpus of about 139.9B tokens, with separate legal categories for research, open-model use, and public corpus release Data provenance becomes part of model engineering, not a legal afterthought dragged in after launch
Metadata and validation File-level and text-level metadata, licence fields, quality levels, source fields, schema validation, S3-based review workflow Corpus updates can be audited, rejected, corrected, and reused with traceability
Instruction tuning PLLuMIC, a Polish instruction corpus combining organic, synthetic, and converted examples Polish instruction following is trained directly rather than translated lazily from English
Preference alignment A Polish preference corpus with ranking, scalar ratings, fallback answers, sensitive prompt routing, and quality review Alignment reflects local language and cultural norms instead of importing English-centric preference assumptions
Guard layer Input/output filtering, adapter-based classifiers, regex/dictionary filters, privacy anonymisation, repair prompts, streaming refusal mode Safety becomes a deployable service boundary, not just a property hoped for inside the model weights
Public-sector RAG RAG benchmarks, retrieval evaluation, generator tuning, refusal examples, and government-app testing The system is evaluated against the actual workflow it is meant to serve

This is why “sovereign AI” in the PLLuM paper is not just a slogan, though the phrase does arrive wearing its ceremonial jacket. Sovereignty here means the ability to inspect, adapt, licence, evaluate, and deploy the system under national legal and linguistic constraints. It is less dramatic than declaring independence from Silicon Valley. It is also harder.

PLLuM’s corpus design is unusually central to the paper. The authors distinguish between the broad Research Corpus, the Open Model Corpus, and the Open Corpus. The Research Corpus is the large internal collection: 390,398,879 documents and about 139.9B Polish tokens. The Open Model Corpus is much smaller: 8,610,451 documents and about 5.0B tokens. The Open Corpus contains about 3.9B tokens.

That gap is the business story.

The best training data is not automatically the data one can safely use in a publicly released or commercial system. The paper’s legal analysis makes this explicit. Polish research exceptions may allow certain uses inside a closed scientific setting, but public deployment of generative models is stricter. Copyrighted works require permissions or appropriate licences. Public-domain texts, legal acts, official documents, and openly licensed works are safer, but they narrow the corpus.

This creates a familiar enterprise trade-off. More data may improve performance, but fewer legal encumbrances improve deployability. PLLuM handles that tension by creating separate corpus categories rather than pretending that “training data” is a single homogeneous bucket. A nice idea, this thing called knowing what you trained on.

The paper also describes outreach to more than 50 publishers and institutional partners to obtain licensed material. That is not a decorative detail. For any organisation building domain LLMs in law, finance, medicine, government, or education, the data acquisition process is a strategic capability. You cannot benchmark your way out of missing rights.

Polish-native instruction data is not translation garnish

The project’s instruction corpus, PLLuMIC, is built from three sources: organic instructions written or adapted by humans, synthetic instructions generated under controlled pipelines, and converted examples derived from structured Polish NLP datasets. The full supervised fine-tuning dataset is reported as 77,574 instructions, with organic examples forming just under half of the corpus. In the model training section, the authors describe a particular training set of roughly 57,940 instruction examples and 28M tokens.

The distinction is useful: the corpus is larger than the specific training slice used in some experiments. The paper is a systems report, not a neat little leaderboard note. Naturally, some accounting is layered.

The important design choice is that synthetic data remains secondary. The authors limit synthetic instructions to around 7% of the training set, partly to avoid recursive degradation and partly to reduce dependence on teacher-model biases. They also avoid restrictive licences when generating synthetic data. For the public-administration RAG subset, the team generates answerable, adversarial, and unrelated questions from government documents, then filters generated answers using citation verification.

This is the opposite of the cheap synthetic-data factory model: generate a million examples, admire the spreadsheet, fine-tune, and then act surprised when the model learns synthetic weirdness at industrial scale.

The Polish preference corpus follows the same logic. Prompts are manually written, localized, paraphrased under permissive licences, or derived from Polish datasets. Annotators rank responses, score them across seven dimensions, and sometimes write fallback completions when all model outputs are poor. Sensitive prompts are routed to trained NASK-affiliated annotators. Model outputs are anonymised and randomized before scoring. Version 4 of the preference dataset contains 57,374 high-quality preference pairs, 10,985 unique prompts, and 10,000 controversial or sensitive examples.

The business inference is straightforward: alignment data is not a universal condiment. A model trained to be “helpful” in English may still sound odd, overconfident, evasive, culturally blunt, or legally careless in Polish. Local alignment is not sentimentality. It is product quality.

The evidence is strongest where the paper tests deployment behaviour

The paper contains many experiments, and they do not all carry the same evidential weight. Some are main evidence. Some are ablations. Some are implementation diagnostics. Treating them equally would be convenient and wrong, a classic research-summary hobby.

Test or experiment Likely purpose What it supports What it does not prove
RAG-IFEval on public-administration questions Main evidence for document-grounded public-sector answering PLLuM models can compete strongly in Polish administrative RAG, especially at smaller scale It does not prove full legal reliability or production readiness across all government workflows
LLM-as-judge RAG evaluation Complementary robustness check and overfitting detector Confirms that some RAG-IFEval gains generalise beyond the rule-based benchmark It still depends on judge-model quality and reference-answer design
Red-teaming ASR/FRR safety evaluation Main evidence for safety alignment trade-offs Chat-aligned PLLuM models sharply reduce attack success but increase false rejections It does not remove the need for domain-specific safety policy tuning
Packing versus no-packing in instruction fine-tuning Ablation / implementation detail Packing can help some structured tasks while hurting generative quality It is not a general law for all instruction tuning
Vocabulary extension experiments Ablation / sensitivity test Polish token efficiency improves, but benchmark performance did not beat full continual pretraining It does not rule out vocabulary extension under larger compute budgets
Annealing after pretraining Training-stage evidence Annealing substantially improved benchmark scores and coherence before instruction tuning The paper does not isolate every causal component of the annealing mixture

The RAG results are the most business-relevant. On RAG-IFEval, Llama-PLLuM-8B scores 85.8, the best result among models under 12B in the reported table. Llama-PLLuM-70B scores 89.7 among larger models, close to Gemini-2.5-Pro-Preview at 89.7 and GPT-4.1 at 90.2, and behind Grok-3-Beta at 91.1. In the LLM-as-judge RAG evaluation, Llama-PLLuM-70B scores 87.6, behind Llama-3.1-70B at 90.3 but ahead of Bielik-2.2 at 85.5 and Llama-3.3-70B at 84.5.

The interesting part is not that PLLuM “beats” all global models. It does not. The interesting part is that a domain-adapted Polish model, especially the 8B variant, becomes highly competitive in the exact use case the project cares about. That is a better business lesson than leaderboard chest-beating. Smaller, locally adapted models can matter when the task is local, the documents are local, and the evaluation is local.

Safety alignment works, and then charges rent

The safety results are sharp enough to deserve their own sober little corner.

Before alignment, several PLLuM instruct models are alarmingly vulnerable in the red-teaming evaluation. PLLuM-12B-nc-instruct has an attack success rate of 77.61%. PLLuM-8x7B-nc-instruct reaches 70.63%. Llama-PLLuM-8B-instruct reaches 78.60%. These numbers are not flattering. Nor should they be hidden. Instruction following without safety alignment can become very obedient to bad ideas. Amazing discovery, humanity continues to be humanity.

After chat alignment, the picture changes dramatically. PLLuM-12B-nc-chat drops to 1.03% ASR. PLLuM-8x7B-nc-chat drops to 0.78%. Llama-PLLuM-8B-chat drops to 0.76%, and Llama-PLLuM-70B-chat to 0.79%. The authors interpret this as strong evidence that adversarial alignment improved robustness.

But safety is not free. False rejection rates rise. PLLuM-8x7B-nc-chat has an FRR of 8.69%; Llama-PLLuM-8B-chat has 5.27%; Llama-PLLuM-70B-chat has 5.22%. By contrast, the base multilingual models often keep FRR below 1%. The paper states the trade-off clearly: safer models become more likely to refuse benign inputs.

This is not a limitation to mutter politely near the end. It is a deployment parameter. In a public assistant, some extra refusal may be acceptable if the alternative is unsafe guidance. In a customer-service setting, excessive refusal can become a user-experience failure. In healthcare, legal, or financial advice, the right tolerance depends on risk category, escalation path, and whether a human expert can take over.

The mature business question is not “is the model safe?” It is “what rejection rate can the process absorb?”

Guard is the boring middleware that makes deployment plausible

PLLuM’s Guard layer deserves attention because it embodies a practical truth: production safety is not confined to model training. Guard is implemented as a proxy around the model. User inputs pass through it before reaching the LLM; generated outputs return through it before reaching the user. It combines adapter-based classifiers, dictionary and regex filters, meta-validators, repair prompts, and PII anonymisation.

The classifier performance is respectable but uneven. Harmful-content detection reports 0.923 accuracy and 0.923 F1. Erotic-content detection reports 0.913 accuracy and 0.870 F1. Verbal aggression is weaker, at 0.840 accuracy and 0.760 F1. That spread is useful. It reminds us that “content moderation” is not one problem; it is a bundle of classifiers with different data availability and different ambiguity.

The end-to-end Guard evaluation uses a 1,600-item ethics suite, with 1,000 disallowed and 600 allowed prompts. The reported accuracy is 0.94 on both neutral and adversarial subsets. The authors also state that Guard reduces unsafe or policy-violating completions by an order of magnitude while maintaining 94% helpfulness in benign queries.

That is the kind of number a CIO can understand, but should still interrogate. What counts as helpful? Which risk categories are covered? How will the filter behave in a specialised domain such as tax, immigration, benefits, medical triage, or political information? The paper answers some of this by pointing to modular adapters. New risk domains can be added without recompiling the model. That is the right architecture. It does not remove the need to write the policy.

The licence split is not a footnote for procurement

PLLuM’s model family includes Apache 2.0, Llama 3.1, and CC-BY-NC-4.0 licence categories. The non-commercial variants benefit from broader data access and, according to the paper, often show stronger linguistic quality in low-resource domains. The more commercially usable variants come with fewer legal constraints but may have a different data basis and performance profile.

This is not administrative trivia. It changes who can use what.

For research labs, public institutions, and prototypes, non-commercial variants may be acceptable. For enterprises, government suppliers, and SaaS vendors, the licence category becomes part of the architecture decision. The model with the best benchmark score may not be the model one can legally embed into a commercial service. Procurement teams do not enjoy discovering that after integration. They become spiritual very quickly.

The broader point is that sovereign AI involves licence engineering. Model performance, training corpus rights, output rights, derivative-use rules, deployment jurisdiction, and audit records all interact. PLLuM’s separation of research, open-model, and public-corpus categories is valuable because it reflects this complexity rather than burying it.

What business leaders should actually take from PLLuM

The practical lesson is not that every country, bank, hospital, or telecom should train its own national LLM. That would be an expensive way to rediscover procurement discipline. The lesson is more specific.

First, local data governance is a capability. PLLuM’s legal review, metadata schema, validation pipeline, licence categories, and publisher outreach are as important as the GPU cluster. If your organisation cannot say which data can be used for research, which can be used in a public model, and which can be redistributed, it is not building an AI asset. It is building an argument for future litigation.

Second, domain RAG deserves its own evaluation. PLLuM does not merely test the generator in isolation. It evaluates retrieval, reranking, chunking, refusal behaviour, citation behaviour, and document-grounded answer quality. That is what makes the public-administration use case credible.

Third, smaller adapted models can be economically interesting. Llama-PLLuM-8B’s strong showing among sub-12B models on RAG-IFEval suggests a path for local deployment where cost, latency, privacy, and control matter. Not every use case needs a frontier model. Some need a model that understands the documents and the language without needing a diplomatic escort.

Fourth, safety alignment must be treated as a trade-off curve. PLLuM’s chat models reduce attack success sharply, but false refusals rise. In regulated workflows, the acceptable point on that curve is a policy decision, not just a machine-learning result.

Finally, “sovereign AI” should be judged by operational substitutability. Can the institution update the corpus? Audit the data? Reproduce training decisions? Evaluate local usage? Deploy with guardrails? Integrate into public services? If not, the sovereignty claim is decorative. Very pretty, very LinkedIn, not especially useful.

Where the paper’s claims should stay bounded

PLLuM is strongest as a Polish-language public-sector and national-infrastructure case. The evidence is less general than the slogan.

Some evaluations use internal datasets or LLM-as-judge methods, though the authors often add human oversight and calibration. That improves credibility but does not make judge-based evaluation magically objective. The human evaluation is valuable precisely because native speakers catch unnatural Polish conventions, English transfer, and cultural-reference failures that automatic metrics miss.

The RAG results are compelling for Polish administrative documents, but they do not automatically transfer to other languages, sectors, or legal regimes. A hospital, insurer, court, or bank would need its own knowledge base, refusal rules, safety categories, and expert review.

The model family is also not uniformly open for every commercial use. The licence split matters. Some stronger models may be non-commercial; some commercially useful ones may have different performance characteristics.

And the safety story, while strong, is not “problem solved.” It is “the attack success rate fell dramatically under this evaluation, while false refusal increased.” That is a good engineering result. It is not divine absolution.

The empire is made of metadata

The old way to talk about AI sovereignty is to ask whether a country has a model. The PLLuM paper suggests a better question: does the country have the machinery to keep building, auditing, adapting, and deploying language models under its own legal and cultural constraints?

Poland’s answer, at least in this paper, is unusually concrete. It is a corpus with rights categories. It is an instruction dataset that does not treat Polish as translated English. It is a preference corpus with native annotators and sensitive prompt routing. It is a Guard layer that lives between model and user. It is a RAG system tested on government documents. It is cooperation with public offices and mObywatel deployment testing. It is also a pile of unglamorous infrastructure decisions, which is usually where real strategy hides.

PLLuM does not prove that every nation can or should build a foundation-model empire. It does show that sovereign AI, when taken seriously, looks less like a model launch and more like a public utility project with GPUs attached.

That may be the paper’s most useful contribution. The future of local AI will not be won by whoever writes the loudest press release about independence. It will be won by whoever can turn language, law, data, evaluation, and deployment into one working system.

Not romantic. Not simple. Definitely not a fine-tune with a flag.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jan Kocoń et al., “PLLuM: A Family of Polish Large Language Models,” arXiv:2511.03823, 2025, https://arxiv.org/abs/2511.03823↩︎