Opening — Why this matters now
If you evaluate AI safety only in English, under tightly structured output contracts, you may conclude that everything is under control.
Indic Jailbreak Robustness (IJR) politely disagrees.
The paper introduces a judge-free benchmark across 12 Indic and South Asian languages—representing more than 2.1 billion speakers—and evaluates 45,216 prompts under both contract-bound (JSON) and free-form (FREE) conditions. The conclusion is uncomfortable but precise:
Safety contracts reduce visible non-compliance. They do not eliminate jailbreaks.
For any enterprise deploying LLMs in multilingual markets—especially South Asia—this is not an academic curiosity. It is an operational risk.
Background — The Illusion of English-First Safety
Most mainstream safety benchmarks:
- Focus on English (occasionally Chinese)
- Use judge models or human raters
- Emphasize moderation, not adversarial jailbreak
- Ignore orthography (script switching, romanization, code-mixing)
In real-world South Asian usage, users routinely:
- Mix scripts (native + Latin)
- Romanize native languages
- Code-switch mid-sentence
- Translate instructions across languages
IJR addresses a gap that previous benchmarks missed:
| Dimension | Prior Benchmarks | IJR |
|---|---|---|
| Multilingual adversarial jailbreak | Limited | 12 Indic languages |
| Orthography stress testing | Rare | Native, Romanized, Mixed |
| Contract vs Free comparison | No | Yes |
| Judge-free evaluation | No | Yes |
| Dataset size | ~10–30k | 45,216 |
This is not merely more data. It is a different evaluation philosophy.
Analysis — What IJR Actually Tests
IJR separates safety into three structural components:
1️⃣ Attacked-Benign (AB)
Benign tasks wrapped with adversarial jailbreak instructions.
2️⃣ Clean-Benign (CB)
Normal benign tasks.
3️⃣ Clean-Harmful (CH)
Clearly harmful requests with canary tokens.
Then it evaluates under two modes:
| Track | Output Constraint | What It Measures |
|---|---|---|
| JSON | Must emit structured refusal schema | Contract compliance |
| FREE | No structural constraint | True behavioral alignment |
This dual-track design exposes what the authors call the “contract gap.”
Findings — The Contract Gap Is Real
🔹 1. Jailbreak Success Remains High Under Contracts
Across 12 models:
- JSON attacked-benign JSR often exceeds 0.75
- LLaMA variants approach 0.92–0.98
- Indic-specialized Sarvam: 0.96 JSR
Contracts inflate refusals on clean prompts, but jailbreak vulnerability remains.
🔹 2. In FREE Mode, Refusals Collapse
Under unconstrained output:
- Attacked-benign JSR ≈ 1.0 across models
- Over-refusal collapses to near zero
This reveals a structural dependency:
Safety is contract-induced, not behaviorally internalized.
We can formalize this with the Refusal Robustness Index (RRI):
$$ RRI = 1 - \frac{JSR_{attack}}{JSR_{benign}} $$
In FREE settings, RRI approaches zero.
That is not robustness. That is formatting compliance.
Orthography Shock — Romanization Changes Outcomes
One of the most revealing results concerns script variation.
Average JSON JSR across languages:
| Input Type | Mean JSR |
|---|---|
| Native script | 0.755 |
| Romanized | 0.416 |
| Mixed | 0.488 |
Romanization reduces jailbreak success under contract-bound evaluation.
Correlations show:
- Romanization share ↑ → ΔJSR ↑ (ρ ≈ 0.28–0.32)
- Byte/character density ↓ → JSR ↓
Translation: tokenization fragmentation matters.
This is not “script safety.” It is tokenizer stress interacting with refusal formatting.
In FREE mode, this protective effect disappears.
Cross-Lingual Transfer — English Attacks Travel Well
English wrappers applied to Indic cores transfer strongly:
| Language | Mean Transfer JSR |
|---|---|
| Urdu | ~0.70 |
| Hindi | ~0.68 |
| Others | ≥0.58 |
Format-based attacks outperform instruction-based ones.
This suggests:
- Guardrails are not language-bound.
- Alignment generalizes poorly under adversarial translation.
For multinational deployments, this means your English red-team coverage is insufficient.
By-Model Observations — Specialization ≠ Robustness
Some notable patterns:
| Model Category | Observed Pattern |
|---|---|
| API-hosted | Lower JSR but still vulnerable |
| Open-weight | Often near-saturated JSR |
| Indic-specialized | High JSR, high abstain rates |
Sarvam, despite Indic focus, shows:
- JSON JSR ≈ 0.96
- Schema validity < 0.20
- Non-trivial leakage
Localization alone does not solve adversarial robustness.
Validation — Judge-Free but Not Naive
IJR avoids LLM-as-judge scoring.
Human audit across 600 samples shows:
- κ ≈ 0.68–0.74 agreement
- False negatives < 5%
- Schema validity ≈ 95%
Lite vs full evaluation correlations exceed 0.80.
The conclusions are not sampling artifacts.
Implications — What This Means for Enterprises
1️⃣ Contract-bound evaluation overstates safety
If your deployment relies on structured output templates, your risk is masked.
2️⃣ Multilingual markets require orthographic testing
Script-mixing and romanization are first-class citizens in South Asia.
3️⃣ Cross-language red-teaming is mandatory
English-only stress tests are insufficient.
4️⃣ Tokenization is a security surface
Byte-level robustness matters.
5️⃣ Alignment must survive unconstrained generation
If safety vanishes when formatting disappears, it was never stable.
Strategic Takeaways for AI Governance
For organizations deploying LLMs in multilingual contexts:
- Report AB/CB/CH separately
- Test both JSON and FREE tracks
- Stress native + romanized + mixed inputs
- Measure cross-lingual transfer explicitly
- Track refusal robustness, not raw refusal rates
Safety should be behavioral, not decorative.
Conclusion — Beyond English, Beyond Contracts
IJR reframes multilingual safety evaluation.
Across 12 Indic and South Asian languages, it shows:
- Contracts create conservative optics.
- Jailbreak success remains high.
- Orthography interacts with tokenization.
- English attacks transfer broadly.
- Free-form generation reveals true alignment.
For a region representing over two billion speakers, this is not niche research. It is overdue infrastructure.
Safety is not what your model says under JSON.
It is what it does when the braces disappear.
Cognaptus: Automate the Present, Incubate the Future.