TL;DR for operators
IndoBias is a useful paper because it attacks a lazy assumption: that a model becomes fairer in a country once it becomes more fluent in that country’s language. Charming idea. Unfortunately, culture is not a plugin.
The paper introduces a two-track benchmark for bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. The first track, IndoBias-Pairs, uses 544 contrastive stereotype pairs per language to test whether a model assigns higher likelihood to prototypical statements than to counter-stereotypical ones. The second track, IndoBias-QA, uses generation-based prompts across 336 demographic groups to examine stereotype polarity at broader coverage, including groups that may not have widely agreed stereotype pairs.
The operational message is not “this model is biased” in the usual decorative compliance-report sense. It is more specific: bias changes with language, domain, entity group, model architecture, fine-tuning history, task format, and corpus source. Indonesian prompts can show strong prototypical bias. Local-language adaptation can increase stereotypical associations. Common Crawl-style pretraining data appears more bias-inducing than human-reviewed sources in the paper’s controlled IndoBERT-scale simulation. And newer or more capable models do not automatically become more stable in their stereotype behaviour.
For business use, this means Indonesian AI deployment needs local audits before models are inserted into hiring, education, customer support, civic services, content moderation, political information systems, or any workflow that touches identity, affiliation, region, religion, institution, or name-based inference. A global benchmark score is not enough. A multilingual benchmark score is also not enough. Even Indonesian-language performance is not enough. One almost admires the number of ways “good enough” can be wrong.
The boundary is important. IndoBias does not cover all Indonesian entities, all local languages, or all possible deployment settings. Its pretraining simulation uses IndoBERT from random weights, not a trillion-token frontier LLM. So the paper should not be used as a universal ranking of model morality. It should be used as a diagnostic template: where to look, what to segment, and why local fluency can carry local prejudice along with it.
Fairness is not a language pack
A familiar deployment story goes like this. A company wants to serve users in Indonesia. It selects a model that handles Indonesian reasonably well, perhaps one adapted for Southeast Asian or Indonesian data. It runs a few task tests, checks that the answers are fluent, sees no obvious disasters, and moves on.
That procedure is efficient. It is also exactly the sort of efficiency that produces expensive surprises later.
IndoBias, introduced by Hanif and colleagues, is designed around the Indonesian case: a country with immense ethnic, linguistic, religious, regional, institutional, and social diversity, where the relevant stereotypes are not simply translated versions of English-language bias categories.1 The paper’s central contribution is not merely a dataset. It is a measurement argument: cultural bias is not a single global property of a model. It is activated through the interaction between language, entity, social category, and task structure.
That distinction matters. English-centric fairness tests are often built around social axes that are salient in the United States or Western Europe. They may include gender, race, religion, nationality, age, or disability. Those are not irrelevant elsewhere, but they do not exhaust the local map. In Indonesia, a model may treat ethnic groups, universities, government institutions, political parties, names, religious affiliations, regions, occupations, or rural-urban markers differently. Some of those distinctions may be invisible to a benchmark designed elsewhere.
The paper’s mechanism-first insight is simple: bias must be measured where it enters the system. It can enter through the sentence probability distribution. It can enter through generation choices. It can enter through the corpora used during pretraining. It can enter through local-language fine-tuning. It can even change direction when the prompt language changes.
A model that is “less biased” on one global benchmark may still be quite willing to treat a person differently once the person is described as coming from a specific Indonesian ethnicity, institution, region, religion, university, or name group. This is not philosophical nuance. It is procurement risk.
IndoBias separates stereotype depth from demographic breadth
The paper’s design is built around a useful distinction: depth versus breadth.
The depth problem asks: when a well-known local stereotype exists, does the model prefer the stereotype-reinforcing sentence over its counter-stereotypical counterpart?
The breadth problem asks: across many Indonesian demographic groups, including those that may not have neat stereotype pairs, does the model assign positive and negative traits unevenly?
These are different measurement problems. Trying to solve both with one benchmark usually produces an awkward compromise. Contrastive sentence pairs give tighter control, but they require a recognisable stereotype. Generation-based QA scales across more groups, but it is less controlled and harder to interpret. IndoBias keeps both tracks and makes the trade-off explicit.
| Track | Likely purpose in the paper | What it supports | What it does not prove |
|---|---|---|---|
| IndoBias-Pairs | Main evidence for stereotype depth | Tests whether models assign lower perplexity to prototypical statements than counter-stereotypical statements across domains and languages | Does not cover groups without widely recognised stereotypes |
| IndoBias-QA | Main evidence and exploratory breadth test | Examines stereotype polarity across 336 demographic groups using generation prompts | Does not isolate a single stereotype mechanism as cleanly as contrastive pairs |
| Encoder-versus-decoder comparison | Benchmark comparison | Shows architecture/model-family differences in prototypical win rates | Does not prove architecture alone causes bias |
| Local adaptation comparison | Comparative evidence | Suggests local-language fine-tuning can increase prototypical bias | Not a controlled ablation across all possible training data and model choices |
| Pretraining simulation | Controlled exploratory extension and sensitivity test | Tests how corpus source and local-language mixing affect bias during IndoBERT-scale training | Does not reproduce frontier LLM pretraining dynamics |
| Appendix taxonomies and annotation guidelines | Implementation detail and interpretability support | Clarifies what domains, groups, and labels mean | Does not guarantee exhaustive cultural coverage |
This structure is the paper’s real strength. It avoids pretending that one metric can carry the entire moral and operational burden. That is a refreshing break from the usual benchmark theatre, where a single aggregate score is invited to wear a tuxedo and impersonate governance.
The contrastive track measures stereotype gravity
IndoBias-Pairs contains 544 manually curated Indonesian sentence pairs. Each pair consists of a prototypical statement and a counter-stereotypical statement. The paper then translates these into Javanese, Sundanese, and Makasar, producing 4,352 total statements: prototypical and counter-stereotypical versions across four languages.
The five broad domains are:
| Domain | Number of Indonesian pairs | What the domain captures |
|---|---|---|
| Identity and Demographics | 93 | Ethnicity, gender, generation |
| Economic Status | 241 | Social class, occupation, education, economy |
| Cultural and Geographic | 93 | Region, domicile, culture affiliation, rural-urban assumptions, access and infrastructure |
| Social and Family Roles | 68 | Family role, marital status, social status |
| Ideology and Religion | 49 | Religion, politics, political affiliation |
The mechanism is straightforward. For each pair, the model assigns perplexity to both sentences. If the prototypical sentence has lower perplexity, the model is interpreted as preferring the stereotype-reinforcing version. The resulting score is a prototypical win rate. A score around 50% suggests no consistent preference between the two sides. A score far above 50% indicates stronger pull toward the prototypical statement.
That does not mean every prototypical statement is equally harmful or every counter-stereotypical statement is equally desirable. The point is comparative: when the only meaningful difference in the template is the stereotype-bearing term, which version does the model find more probable?
This is a useful diagnostic for business because many deployed systems make decisions through probability-weighted language behaviour even when they do not look like classifiers. A support chatbot may phrase options differently. A recruitment assistant may summarise candidates differently. A moderation model may treat statements about one group as more plausible than statements about another. The operational bias may never announce itself as “I am now discriminating.” It will merely sound natural. Naturalness, in this context, is not innocence.
The model comparison is not a leaderboard; it is a procurement warning
The paper benchmarks a wide set of encoder and decoder models across general multilingual, Southeast Asian, and Indonesian-specific categories. The clearest pattern is that decoder models tend to show more extreme prototypical win rates than encoder models.
The paper reports that decoder models often exceed 70% and sometimes approach 80% on Indonesian prompts. For example, Sailor2-8B-Chat reaches 79.6% on Identity and Demographics in Indonesian. Encoders, by contrast, rarely cross 67%, with many scores closer to 50%.
A weak reading would turn this into “encoders good, decoders bad.” That would be convenient, and therefore suspicious.
A better reading is that model class and deployment role matter. Decoder models are the systems most likely to sit inside chatbots, assistants, generative workflows, and agentic applications. If they show stronger stereotypical preference in culturally specific prompts, then the risk is concentrated exactly where organisations are now adding automation: language generation, decision support, intake forms, summarisation, routing, recommendation, and user-facing interaction.
The paper also finds that language changes the bias profile. Indonesian prompts often produce high prototypical win rates. Regional languages such as Javanese may show lower bias in many domains. But the pattern is not uniform. In Ideology and Religion, the direction can reverse; Makasar can show higher bias than Indonesian for some models. The authors hypothesise that local languages may amplify region-specific religious and ideological stereotypes.
This is the point that matters for operators: language selection is not a neutral interface choice. It is a bias condition. Translating a workflow into a local language may change both user accessibility and model behaviour. The first is good. The second is inconvenient. Both are true.
Local adaptation can make the model more culturally fluent and more stereotyped
The paper’s most business-relevant misconception is worth stating plainly: local-language adaptation does not automatically improve fairness.
The authors compare several adapted model pairs: Komodo-7B fine-tuned from Llama-2-7B, SeaLLM-v3-7B from Qwen2-7B, and Sailor2-8B from Qwen2.5-7B. Across these pairs, local adaptation increases average prototypical bias: Komodo-7B by 2.94 points, SeaLLM-v3-7B by 1.98, and Sailor2-8B by 6.11. Sailor2-8B, which has extensive regional adaptation, shows the largest reported increase, including an 8.46-point rise on Sundanese.
This result is not shocking if one thinks mechanistically. Local-language data does not arrive as purified cultural wisdom. It arrives as text. Text contains social regularities, jokes, slurs, prestige hierarchies, institutional assumptions, regional clichés, political associations, and everyday shorthand. Fine-tuning on that material may improve fluency and cultural grounding while also strengthening the statistical associations that encode stereotypes.
That creates a serious procurement problem. Buyers often reward local adaptation because it improves perceived quality. The model sounds more natural. It uses the right idioms. It understands local entities. It makes fewer awkward foreign assumptions. Excellent. It may also become better at reproducing local prejudice. Less excellent, though admittedly very human.
The practical lesson is not to avoid local models. That would be absurd. The lesson is to evaluate local models locally, by domain and language, before deployment. Local fluency is a capability. It is not a fairness certificate.
The QA track catches bias where neat stereotype pairs cannot
Contrastive pairs are powerful, but they have a structural blind spot. They require a recognisable stereotype and a counter-stereotype. That means smaller, less visible, or less documented groups may be excluded from the very benchmark designed to protect representational fairness.
IndoBias-QA addresses this with a generation-based track. It evaluates 336 demographic groups across six categories: 81 ethnicities, 60 government institutions, 78 names, 20 political parties, 29 religions, and 68 universities. The target is not the abstract category itself but people associated with the group: a person from an ethnicity, a person affiliated with a religion, a graduate of a university, and so on.
The prompts combine demographic group names with positive and negative labels drawn from social-science-oriented dimensions. The paper extends prior LLM Stereotype Index work by using not only the Social Progress Index, but also O*NET for occupational and skill-related traits and Worldwide Governance Indicators for institutional traits.
The seven task formats range from simple forced choice to table entry, code variable assignment, story writing, and code function generation. The paper explicitly states that increasing task complexity is intended to get around shallow refusal behaviour and force the model to commit to a substantive answer.
That is an important experimental purpose. The QA track is not merely testing whether a model says something offensive when asked directly. It is testing whether bias leaks through structured work formats: dataset rows, code assignments, stories, and role-play-like prompts. In other words, the formats resemble the strange little containers through which enterprise AI actually does work.
A hiring workflow may not ask, “Which ethnic group is more competent?” It may ask a model to generate candidate summaries, populate HR fields, infer likely fit, classify communication style, or assign risk flags. The stereotype does not need to appear as a manifesto. It can appear as a column value.
Newer models can be less stable, not simply less biased
In the QA experiments, the paper compares Qwen3-8B, Qwen3.5-9B, GPT-4.1 mini, and GPT-5 mini. The reported metric is the mean standard deviation of stereotype polarity across prompts, where lower is better. The interpretation is discrepancy: how unevenly stereotype polarity varies across demographic groups.
The paper reports that newer models often show greater discrepancy. Qwen3.5-9B increases relative to Qwen3-8B on five of six demographic categories, including a 4.79-point increase for Universities. GPT-5 mini increases relative to GPT-4.1 mini across all six categories, including Institutions by 5.16, Political Parties by 5.33, and Universities by 4.45.
This result should be read carefully. It does not mean newer models are universally “more biased” in every simple sense. It suggests that their stereotype polarity is more variable across groups and contexts. That is operationally awkward because many enterprise users equate newer models with safer defaults. The paper pushes against that assumption.
A more capable model may have richer world knowledge, stronger entity recognition, better instruction following, and more nuanced local associations. Those capabilities can improve task performance. They can also create sharper differentiation between demographic entities. A model that knows more about local institutions, parties, universities, and names may have more material from which to form uneven associations.
This is where “capability” and “safety” stop being aligned by default. The grown-up response is not to panic. It is to measure the specific workflow.
The pretraining simulation explains why corpus governance matters
The paper’s third contribution is a controlled pretraining simulation. The authors initialise IndoBERT from random weights and train it for 500,000 steps under six data compositions. The data sources include Indonesian CC-100, Indonesian Wikipedia, Liputan 6 news, and mixtures incorporating Javanese and Sundanese Wikipedia or CC-100.
The purpose here is not to recreate modern LLM pretraining at frontier scale. It is to isolate how corpus source and local-language mixing affect bias over training time under controlled conditions.
The result is highly relevant for data governance. Among Indonesian corpora, CC-100 produces the highest final bias score at 0.585. Wikipedia ends at 0.555, and Liputan 6 at 0.546. The authors attribute the difference to the unfiltered nature of Common Crawl-style web data compared with more human-reviewed sources such as encyclopedic or news text.
The multilingual results add another layer. Introducing local languages into pretraining generally increases bias across almost all evaluated languages. When Indonesian Wikipedia is compared with a mixture that adds Javanese and Sundanese Wikipedia, the mixed version is consistently higher in bias across evaluated target languages. Similar trends appear in the CC-100 experiments, with one exception: for Indonesian evaluation, a higher proportion of local languages in one CC-100 mixture corresponds to a lower final bias score.
That exception matters because it prevents the result from becoming a cartoon. The finding is not “local languages always increase bias.” It is subtler: local-language inclusion often changes the bias trajectory, and in many settings increases measured prototypical bias, but the effect depends on composition and target language.
For businesses, this is not merely academic. Retrieval corpora, fine-tuning sets, synthetic data pipelines, customer-chat logs, local web scrapes, community forums, and domain-specific documents all shape model behaviour. If an organisation treats data ingestion as a volume problem, it should expect bias to appear as a downstream quality problem. Garbage in, governance meeting out.
What the paper directly shows, and what business should infer
The paper is strongest when read as a diagnostic design rather than a final verdict on every model or use case.
| Paper finding | What the paper directly shows | Cognaptus business inference | Boundary |
|---|---|---|---|
| IndoBias-Pairs exposes prototypical preference across five domains and four languages | Models often assign higher likelihood to stereotype-reinforcing statements | Bias audits should be segmented by domain and language, not averaged into one comforting number | Pair-based tests exclude groups without recognised stereotypes |
| Decoder models show more extreme prototypical win rates than encoders | Decoders often exceed 70% and sometimes approach 80% in Indonesian prompts | User-facing generative systems deserve special scrutiny in Indonesian deployments | Architecture is correlated in the benchmark, not isolated as the sole cause |
| Local adaptation can increase prototypical bias | Adapted model pairs show increases after regional or Indonesian fine-tuning | “Localised” models should be audited after adaptation, not trusted because they sound better | Fine-tuning histories differ; this is comparative evidence, not a universal law |
| QA stereotype polarity varies by demographic group | Outputs differ across ethnicities, institutions, names, parties, religions, and universities | Entity-level fairness testing matters in hiring, education, civic tech, and political information workflows | QA depends on parsable outputs and selected prompt formats |
| Common Crawl-style data introduces more bias in the simulation | CC-100 ends with higher bias than Wikipedia or Liputan 6 in controlled IndoBERT training | Data source quality should be part of AI procurement and model governance | IndoBERT-scale simulation does not equal frontier LLM pretraining |
| Local-language mixtures often increase bias | Adding Javanese and Sundanese data generally raises measured bias across languages | More local data may improve coverage while increasing stereotype risk | Effects vary by mixture and target language |
This is the practical reading: IndoBias does not give managers a single answer. It gives them a map of where answers can go wrong.
The hidden cost is not offence; it is misallocation
Many organisations still treat bias as a reputational issue: avoid offensive outputs, minimise scandal, write a policy, call it responsible AI, move on. That is too narrow.
In the IndoBias setting, the business risk is also operational misallocation. If a model associates certain institutions with competence, certain names with opportunity, certain ethnicities with wellbeing, certain regions with infrastructure, or certain social roles with reliability, then biased language behaviour can enter workflows that allocate attention, resources, trust, escalation, or opportunity.
Consider a few plausible deployment surfaces:
- In hiring, name, university, region, religion, or ethnicity signals may affect candidate summaries or inferred fit.
- In education, university prestige associations may shape recommendation, tutoring feedback, or scholarship triage.
- In public services, institutional or regional associations may affect citizen support routing or complaint classification.
- In finance or insurance, regional and social-class proxies may contaminate risk narratives.
- In content moderation, political or religious associations may affect how ambiguous speech is interpreted.
- In customer support, local-language handling may sound culturally competent while quietly changing tone, patience, or escalation thresholds.
The model does not need to make a final decision to cause harm. It only needs to influence the intermediate artefacts from which humans make decisions. Bias in summarisation, labelling, ranking, routing, and explanation is still bias. It simply wears office clothes.
The audit should follow the mechanism
The right governance response is not a generic “bias test.” It is a mechanism-aligned audit.
First, test the languages actually used in deployment. If the system supports Indonesian, Javanese, Sundanese, Makasar, or other local languages, each should be evaluated separately. Translation equivalence cannot be assumed.
Second, segment by domain. Identity, economic status, cultural geography, family roles, ideology, and religion behave differently. An aggregate fairness score can hide the very domain that matters most for a product.
Third, test both controlled pairs and generative work formats. Pair tests reveal stereotype gravity. QA-style prompts reveal how bias appears when the model has to produce labels, code, stories, or structured entries. Enterprise systems generate artefacts, not just sentence probabilities.
Fourth, compare the base model and the localised model. Local fine-tuning should be treated as a model change with fairness consequences, not merely a language upgrade.
Fifth, audit data source composition. Web-scale corpora, reviewed articles, news sources, local-language materials, customer logs, and synthetic data have different risk profiles. The paper’s pretraining simulation suggests this is not decorative metadata; it changes measured bias.
Sixth, evaluate entity-level outputs where entity-level harm is plausible. In Indonesia, this means attention to ethnicity, religion, university, institution, political party, regional identity, and names. In another country, the equivalent categories may differ. That is exactly the point.
Boundaries that matter
The paper’s limitations are not boilerplate. They affect how the result should be used.
IndoBias-QA covers 336 demographic groups across six categories, but Indonesia has far more social and linguistic diversity than any single benchmark can represent. The paper itself notes the scale: over 1,300 ethnic groups and more than 700 local languages. IndoBias-Pairs covers 18 subdomains, but the requirement for widely recognised stereotypes means some less visible groups remain structurally hard to test with pair-based methods.
The language coverage is also limited. Javanese, Sundanese, and Makasar are important, but they are not a proxy for all Indonesian local languages. A company deploying in Acehnese, Balinese, Batak, Buginese, Minangkabau, Papuan languages, or other local contexts cannot simply borrow the result and declare victory.
The QA metric uses parsable outputs. Responses that cannot be mapped to exactly one label are excluded. That is reasonable for measurement, but it means the benchmark evaluates a particular kind of model behaviour: cases where the model commits to a detectable label. Refusals, ambiguity, or malformed outputs are handled through exclusion, not as full behavioural phenomena.
The pretraining simulation is especially useful but bounded. It uses IndoBERT initialised from random weights, a reduced batch size due to GPU memory constraints, three main corpora, two local languages in mixtures, and evaluation only on the pairs track. It cannot prove that the same numerical effects will hold for all modern decoder LLMs trained at massive scale. It can show that corpus composition and local-language mixing are plausible causal levers. That is already enough to make data governance less optional.
The business value is cheaper diagnosis before expensive deployment
The commercial value of IndoBias is not that every Indonesian company should copy the benchmark as-is. The value is that it shows what a serious local audit should look like.
A useful deployment review would ask:
| Governance question | Why IndoBias makes it necessary |
|---|---|
| Which languages will users actually use? | Bias changes across Indonesian and local-language prompts |
| Which demographic signals appear in the workflow? | Entity-level stereotype polarity varies across groups |
| Is the model localised or fine-tuned? | Local adaptation can increase prototypical bias |
| What data sources shaped the model or retrieval layer? | Common Crawl-style data produced higher bias in the simulation |
| Does the model generate structured labels, rankings, or summaries? | QA prompts show bias can emerge in tables, code, stories, and task formats |
| Are results segmented by domain? | Domains behave differently; averages are under-informative |
| Are we evaluating the exact workflow, not a generic demo? | Bias is activated by task structure and prompt context |
This is cheaper than discovering the problem through user complaints, regulatory review, reputational damage, or bad internal decisions quietly multiplied across thousands of interactions.
The slightly annoying truth is that fairness work becomes more demanding as models become more useful. A shallow chatbot with poor local knowledge may offend randomly. A capable localised model may discriminate more fluently. Progress, as ever, arrives carrying paperwork.
Conclusion: local AI needs local measurement
IndoBias is valuable because it refuses the fantasy that fairness travels cleanly across languages and cultures. It shows that Indonesian AI evaluation needs Indonesian social categories, Indonesian and local-language prompts, entity-level testing, corpus-aware analysis, and a willingness to separate fluency from fairness.
The paper’s strongest lesson is not that one model family should be banned, or one data source should be worshipped, or one benchmark should become the national oracle. The lesson is more operational: bias is a deployment property. It appears through the specific language, group, task, model, and data mixture placed into production.
For organisations building or buying AI systems in Indonesia, that means fairness cannot be outsourced to a global leaderboard. It must be tested in the local conditions where the system will act. Not because that sounds ethically impressive. Because otherwise the model may optimise the user experience and the stereotype distribution at the same time.
And that, while culturally grounded, is not the sort of localisation anyone should be proud of.
Cognaptus: Automate the Present, Incubate the Future.
-
Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, and Fajri Koto, “IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages,” arXiv:2606.01260, 2026, https://arxiv.org/abs/2606.01260. ↩︎