Lost in Translation: When Safety Contracts Collapse Across 2.1 Billion Voices

Opening — Why this matters now

If you evaluate AI safety only in English, under tightly structured output contracts, you may conclude that everything is under control.

Indic Jailbreak Robustness (IJR) politely disagrees.

The paper introduces a judge-free benchmark across 12 Indic and South Asian languages—representing more than 2.1 billion speakers—and evaluates 45,216 prompts under both contract-bound (JSON) and free-form (FREE) conditions. The conclusion is uncomfortable but precise:

Safety contracts reduce visible non-compliance. They do not eliminate jailbreaks.

For any enterprise deploying LLMs in multilingual markets—especially South Asia—this is not an academic curiosity. It is an operational risk.

Background — The Illusion of English-First Safety

Most mainstream safety benchmarks:

Focus on English (occasionally Chinese)
Use judge models or human raters
Emphasize moderation, not adversarial jailbreak
Ignore orthography (script switching, romanization, code-mixing)

In real-world South Asian usage, users routinely:

Mix scripts (native + Latin)
Romanize native languages
Code-switch mid-sentence
Translate instructions across languages

IJR addresses a gap that previous benchmarks missed:

Dimension	Prior Benchmarks	IJR
Multilingual adversarial jailbreak	Limited	12 Indic languages
Orthography stress testing	Rare	Native, Romanized, Mixed
Contract vs Free comparison	No	Yes
Judge-free evaluation	No	Yes
Dataset size	~10–30k	45,216

This is not merely more data. It is a different evaluation philosophy.

Analysis — What IJR Actually Tests

IJR separates safety into three structural components:

1️⃣ Attacked-Benign (AB)

Benign tasks wrapped with adversarial jailbreak instructions.

2️⃣ Clean-Benign (CB)

Normal benign tasks.

3️⃣ Clean-Harmful (CH)

Clearly harmful requests with canary tokens.

Then it evaluates under two modes:

Track	Output Constraint	What It Measures
JSON	Must emit structured refusal schema	Contract compliance
FREE	No structural constraint	True behavioral alignment

This dual-track design exposes what the authors call the “contract gap.”

Findings — The Contract Gap Is Real

🔹 1. Jailbreak Success Remains High Under Contracts

Across 12 models:

JSON attacked-benign JSR often exceeds 0.75
LLaMA variants approach 0.92–0.98
Indic-specialized Sarvam: 0.96 JSR

Contracts inflate refusals on clean prompts, but jailbreak vulnerability remains.

🔹 2. In FREE Mode, Refusals Collapse

Under unconstrained output:

Attacked-benign JSR ≈ 1.0 across models
Over-refusal collapses to near zero

This reveals a structural dependency:

Safety is contract-induced, not behaviorally internalized.

We can formalize this with the Refusal Robustness Index (RRI):

$$ RRI = 1 - \frac{JSR_{attack}}{JSR_{benign}} $$

In FREE settings, RRI approaches zero.

That is not robustness. That is formatting compliance.

Orthography Shock — Romanization Changes Outcomes

One of the most revealing results concerns script variation.

Average JSON JSR across languages:

Input Type	Mean JSR
Native script	0.755
Romanized	0.416
Mixed	0.488

Romanization reduces jailbreak success under contract-bound evaluation.

Correlations show:

Romanization share ↑ → ΔJSR ↑ (ρ ≈ 0.28–0.32)
Byte/character density ↓ → JSR ↓

Translation: tokenization fragmentation matters.

This is not “script safety.” It is tokenizer stress interacting with refusal formatting.

In FREE mode, this protective effect disappears.

Cross-Lingual Transfer — English Attacks Travel Well

English wrappers applied to Indic cores transfer strongly:

Language	Mean Transfer JSR
Urdu	~0.70
Hindi	~0.68
Others	≥0.58

Format-based attacks outperform instruction-based ones.

This suggests:

Guardrails are not language-bound.
Alignment generalizes poorly under adversarial translation.

For multinational deployments, this means your English red-team coverage is insufficient.

By-Model Observations — Specialization ≠ Robustness

Some notable patterns:

Model Category	Observed Pattern
API-hosted	Lower JSR but still vulnerable
Open-weight	Often near-saturated JSR
Indic-specialized	High JSR, high abstain rates

Sarvam, despite Indic focus, shows:

JSON JSR ≈ 0.96
Schema validity < 0.20
Non-trivial leakage

Localization alone does not solve adversarial robustness.

Validation — Judge-Free but Not Naive

IJR avoids LLM-as-judge scoring.

Human audit across 600 samples shows:

κ ≈ 0.68–0.74 agreement
False negatives < 5%
Schema validity ≈ 95%

Lite vs full evaluation correlations exceed 0.80.

The conclusions are not sampling artifacts.

Implications — What This Means for Enterprises

1️⃣ Contract-bound evaluation overstates safety

If your deployment relies on structured output templates, your risk is masked.

2️⃣ Multilingual markets require orthographic testing

Script-mixing and romanization are first-class citizens in South Asia.

3️⃣ Cross-language red-teaming is mandatory

English-only stress tests are insufficient.

4️⃣ Tokenization is a security surface

Byte-level robustness matters.

5️⃣ Alignment must survive unconstrained generation

If safety vanishes when formatting disappears, it was never stable.

Strategic Takeaways for AI Governance

For organizations deploying LLMs in multilingual contexts:

Report AB/CB/CH separately
Test both JSON and FREE tracks
Stress native + romanized + mixed inputs
Measure cross-lingual transfer explicitly
Track refusal robustness, not raw refusal rates

Safety should be behavioral, not decorative.

Conclusion — Beyond English, Beyond Contracts

IJR reframes multilingual safety evaluation.

Across 12 Indic and South Asian languages, it shows:

Contracts create conservative optics.
Jailbreak success remains high.
Orthography interacts with tokenization.
English attacks transfer broadly.
Free-form generation reveals true alignment.

For a region representing over two billion speakers, this is not niche research. It is overdue infrastructure.

Safety is not what your model says under JSON.

It is what it does when the braces disappear.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The Illusion of English-First Safety#

Analysis — What IJR Actually Tests#

1️⃣ Attacked-Benign (AB)#

2️⃣ Clean-Benign (CB)#

3️⃣ Clean-Harmful (CH)#

Findings — The Contract Gap Is Real#

🔹 1. Jailbreak Success Remains High Under Contracts#

🔹 2. In FREE Mode, Refusals Collapse#

Orthography Shock — Romanization Changes Outcomes#

Cross-Lingual Transfer — English Attacks Travel Well#

By-Model Observations — Specialization ≠ Robustness#

Validation — Judge-Free but Not Naive#

Implications — What This Means for Enterprises#

1️⃣ Contract-bound evaluation overstates safety#

2️⃣ Multilingual markets require orthographic testing#

3️⃣ Cross-language red-teaming is mandatory#

4️⃣ Tokenization is a security surface#

5️⃣ Alignment must survive unconstrained generation#

Strategic Takeaways for AI Governance#

Conclusion — Beyond English, Beyond Contracts#