Cover image

CQ, AI & The Question of Questions

Questions look cheap. That is why they are dangerous. In most enterprise AI projects, the visible work arrives late: dashboards, RAG demos, knowledge graphs, compliance assistants, workflow copilots, and executive slides with arrows pointing to a “semantic layer.” The invisible work arrives earlier and is less glamorous: deciding what the system must actually know, answer, retrieve, distinguish, reject, and explain. ...

April 22, 2026 · 16 min · Zelina
Cover image

SEALing the Gap: When Synthetic Data Learns Accountability

Network data is easy to fake. Accountability is not. That is the uncomfortable little problem sitting behind synthetic data. A team can simulate users, devices, traffic surges, mobility patterns, channel interference, and edge-network behavior long before a full 6G deployment exists. This is useful. It is also slightly dangerous. A synthetic dataset can look realistic, train a model successfully, and still carry hidden bias, brittle assumptions, weak provenance, or regulatory gaps. Reality is not only a distribution. It is also a chain of responsibility. ...

April 4, 2026 · 16 min · Zelina

Anonymize Customer Data with AI

How to use AI to redact, mask, or pseudonymize customer data safely, and where automated anonymization can fail in practice.

March 16, 2026 · 6 min · Michelle
Cover image

Noise Without Regret: How Error Feedback Fixes Differentially Private Image Generation

Photos are annoying data. They are useful because they contain details: the handle of a bag, the edge of a sleeve, the texture of a face, the faint classroom gesture that matters only after someone trains a model on it. They are risky for exactly the same reason. If a generated image looks too much like the real training data, it may quietly leak what the organization was trying not to reveal. If it is protected too aggressively, it becomes a blurry souvenir from a dataset that used to be useful. ...

January 22, 2026 · 14 min · Zelina
Cover image

SD‑RAG: Don’t Trust the Model, Trust the Pipeline

A chatbot should not be the only employee in the company responsible for keeping secrets. That sounds obvious until we look at how many enterprise RAG systems are designed. A user asks a question. The system retrieves internal documents. The documents are placed into the model context. A policy instruction is added somewhere above the user prompt: do not reveal sensitive information. Then everyone hopes the model behaves. ...

January 20, 2026 · 14 min · Zelina
Cover image

Redundancy Overload Is Optional: Finding the FDs That Actually Matter

Tables have a talent for pretending to be tidy. A customer table may have fifty columns. A transaction table may have a hundred. A log table may contain derived fields, timestamps, status codes, copied identifiers, normalized labels, and a few columns that nobody remembers creating but everybody is afraid to delete. Then a data profiling tool arrives, dutifully discovers functional dependencies, and returns several hundred thousand “valid” relationships. ...

January 18, 2026 · 19 min · Zelina
Cover image

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

January 3, 2026 · 16 min · Zelina
Cover image

When Data Comes in Boxes: Why Hierarchies Beat Sample Hoarding

Data rarely arrives as loose sand Data teams like to speak as if training data arrives one sample at a time: one image, one row, one document, one carefully chosen datapoint. Procurement departments, research consortia, hospitals, vendors, and public repositories are less poetic. They ship data in boxes. A box might be a dataset from one partner institution. A folder from a public repository. A domain-specific archive. A vendor package. A department export. It arrives with source, license, schema, quirks, and hidden failure modes already attached. The operational question is not only “Which samples should we keep?” It is also “Which boxes are worth opening?” ...

December 13, 2025 · 15 min · Zelina
Cover image

Probe and Error: Why Off‑Policy Training Warps LLM Behaviour Detectors

A monitor is only useful if it fails in the boring place. The boring place is production: the real domain, the real prompt style, the real user incentives, the real model generating the real response. Not the tidy benchmark. Not the synthetic dataset. Not the “please pretend to be deceptive” prompt that makes everyone in the lab feel productive. Production is where a detector either catches the thing it was built to catch, or quietly becomes a compliance ornament with a nice AUROC score. ...

November 24, 2025 · 16 min · Zelina
Cover image

Sovereign Syntax: How Poland Built Its Own LLM Empire

A citizen-facing AI assistant is where the PLLuM story becomes interesting. Not because a chatbot in a government app is a dazzling concept. It is not. Most public-sector chatbots have the charisma of a PDF with a search bar and the legal confidence of a nervous intern. The interesting part is what Poland had to build before such an assistant could be considered remotely serious: a rights-managed national corpus, Polish-native instruction data, preference alignment, safety filters, RAG evaluation, retrieval tooling, and a family of public models with different licence regimes. ...

November 9, 2025 · 16 min · Zelina