Opening — Why this matters now

If you evaluate AI safety only in English, under tightly structured output contracts, you may conclude that everything is under control.

Indic Jailbreak Robustness (IJR) politely disagrees.

The paper introduces a judge-free benchmark across 12 Indic and South Asian languages—representing more than 2.1 billion speakers—and evaluates 45,216 prompts under both contract-bound (JSON) and free-form (FREE) conditions. The conclusion is uncomfortable but precise:

Safety contracts reduce visible non-compliance. They do not eliminate jailbreaks.

For any enterprise deploying LLMs in multilingual markets—especially South Asia—this is not an academic curiosity. It is an operational risk.


Background — The Illusion of English-First Safety

Most mainstream safety benchmarks:

  • Focus on English (occasionally Chinese)
  • Use judge models or human raters
  • Emphasize moderation, not adversarial jailbreak
  • Ignore orthography (script switching, romanization, code-mixing)

In real-world South Asian usage, users routinely:

  • Mix scripts (native + Latin)
  • Romanize native languages
  • Code-switch mid-sentence
  • Translate instructions across languages

IJR addresses a gap that previous benchmarks missed:

Dimension Prior Benchmarks IJR
Multilingual adversarial jailbreak Limited 12 Indic languages
Orthography stress testing Rare Native, Romanized, Mixed
Contract vs Free comparison No Yes
Judge-free evaluation No Yes
Dataset size ~10–30k 45,216

This is not merely more data. It is a different evaluation philosophy.


Analysis — What IJR Actually Tests

IJR separates safety into three structural components:

1️⃣ Attacked-Benign (AB)

Benign tasks wrapped with adversarial jailbreak instructions.

2️⃣ Clean-Benign (CB)

Normal benign tasks.

3️⃣ Clean-Harmful (CH)

Clearly harmful requests with canary tokens.

Then it evaluates under two modes:

Track Output Constraint What It Measures
JSON Must emit structured refusal schema Contract compliance
FREE No structural constraint True behavioral alignment

This dual-track design exposes what the authors call the “contract gap.”


Findings — The Contract Gap Is Real

🔹 1. Jailbreak Success Remains High Under Contracts

Across 12 models:

  • JSON attacked-benign JSR often exceeds 0.75
  • LLaMA variants approach 0.92–0.98
  • Indic-specialized Sarvam: 0.96 JSR

Contracts inflate refusals on clean prompts, but jailbreak vulnerability remains.

🔹 2. In FREE Mode, Refusals Collapse

Under unconstrained output:

  • Attacked-benign JSR ≈ 1.0 across models
  • Over-refusal collapses to near zero

This reveals a structural dependency:

Safety is contract-induced, not behaviorally internalized.

We can formalize this with the Refusal Robustness Index (RRI):

$$ RRI = 1 - \frac{JSR_{attack}}{JSR_{benign}} $$

In FREE settings, RRI approaches zero.

That is not robustness. That is formatting compliance.


Orthography Shock — Romanization Changes Outcomes

One of the most revealing results concerns script variation.

Average JSON JSR across languages:

Input Type Mean JSR
Native script 0.755
Romanized 0.416
Mixed 0.488

Romanization reduces jailbreak success under contract-bound evaluation.

Correlations show:

  • Romanization share ↑ → ΔJSR ↑ (ρ ≈ 0.28–0.32)
  • Byte/character density ↓ → JSR ↓

Translation: tokenization fragmentation matters.

This is not “script safety.” It is tokenizer stress interacting with refusal formatting.

In FREE mode, this protective effect disappears.


Cross-Lingual Transfer — English Attacks Travel Well

English wrappers applied to Indic cores transfer strongly:

Language Mean Transfer JSR
Urdu ~0.70
Hindi ~0.68
Others ≥0.58

Format-based attacks outperform instruction-based ones.

This suggests:

  • Guardrails are not language-bound.
  • Alignment generalizes poorly under adversarial translation.

For multinational deployments, this means your English red-team coverage is insufficient.


By-Model Observations — Specialization ≠ Robustness

Some notable patterns:

Model Category Observed Pattern
API-hosted Lower JSR but still vulnerable
Open-weight Often near-saturated JSR
Indic-specialized High JSR, high abstain rates

Sarvam, despite Indic focus, shows:

  • JSON JSR ≈ 0.96
  • Schema validity < 0.20
  • Non-trivial leakage

Localization alone does not solve adversarial robustness.


Validation — Judge-Free but Not Naive

IJR avoids LLM-as-judge scoring.

Human audit across 600 samples shows:

  • κ ≈ 0.68–0.74 agreement
  • False negatives < 5%
  • Schema validity ≈ 95%

Lite vs full evaluation correlations exceed 0.80.

The conclusions are not sampling artifacts.


Implications — What This Means for Enterprises

1️⃣ Contract-bound evaluation overstates safety

If your deployment relies on structured output templates, your risk is masked.

2️⃣ Multilingual markets require orthographic testing

Script-mixing and romanization are first-class citizens in South Asia.

3️⃣ Cross-language red-teaming is mandatory

English-only stress tests are insufficient.

4️⃣ Tokenization is a security surface

Byte-level robustness matters.

5️⃣ Alignment must survive unconstrained generation

If safety vanishes when formatting disappears, it was never stable.


Strategic Takeaways for AI Governance

For organizations deploying LLMs in multilingual contexts:

  • Report AB/CB/CH separately
  • Test both JSON and FREE tracks
  • Stress native + romanized + mixed inputs
  • Measure cross-lingual transfer explicitly
  • Track refusal robustness, not raw refusal rates

Safety should be behavioral, not decorative.


Conclusion — Beyond English, Beyond Contracts

IJR reframes multilingual safety evaluation.

Across 12 Indic and South Asian languages, it shows:

  • Contracts create conservative optics.
  • Jailbreak success remains high.
  • Orthography interacts with tokenization.
  • English attacks transfer broadly.
  • Free-form generation reveals true alignment.

For a region representing over two billion speakers, this is not niche research. It is overdue infrastructure.

Safety is not what your model says under JSON.

It is what it does when the braces disappear.

Cognaptus: Automate the Present, Incubate the Future.