AI Language Models Struggle with Persian Taarof Etiquette, Study Finds

When “no” means “yes”: Why AI chatbots can’t process Persian social etiquette
Ars Technica2

Key Points

  • TAAROFBENCH is the first benchmark for evaluating AI performance on Persian taarof etiquette.
  • Major models like GPT‑4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and Dorna scored between 34% and 42% correct.
  • Native Persian speakers achieved an 82% success rate on the same benchmark.
  • Models tend to default to direct, Western‑style communication, missing subtle polite cues.
  • Cultural missteps could harm negotiations, relationships, and reinforce stereotypes.
  • Researchers call for culturally aware training data and evaluation metrics for AI.
  • The study was led by Nikta Gohari Sadr of Brock University with partners from Emory University.
  • Findings highlight a gap between AI behavior and expectations of Persian‑speaking users.

A new study led by Nikta Gohari Sadr reveals that major AI language models, including GPT-4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and the Persian‑tuned Dorna, perform poorly on the Persian cultural practice of taarif, correctly handling only 34 to 42 percent of scenarios compared with native speakers' 82 percent success rate. The researchers introduced TAAROFBENCH, a benchmark that tests AI systems on the nuanced give‑and‑take of polite refusals and insistence. The findings highlight a gap between Western‑centric AI behavior and the expectations of Persian speakers, raising concerns about cultural missteps in global AI applications.

Background and Motivation

Persian speakers navigate daily interactions through a cultural practice known as taarof, a ritualized exchange of offers, refusals, and polite insistence. Misunderstanding this etiquette can lead to social friction, especially as AI language models become increasingly integrated into communication tools used worldwide.

Study Design and Benchmark

Researchers led by Nikta Gohari Sadr of Brock University, together with collaborators from Emory University and other institutions, created TAAROFBENCH, the first benchmark specifically measuring how well AI systems reproduce taarof. The benchmark defines detailed scenarios that include environment, location, roles, context, and user utterances, allowing systematic evaluation of model responses.

Models Evaluated

The study examined a range of contemporary large language models: OpenAI’s GPT‑4o, Anthropic’s Claude 3.5 Haiku, Meta’s Llama 3, DeepSeek’s V3, and Dorna, a Persian‑tuned variant of Llama 3.

Key Findings

Across all tested models, correct handling of taarof scenarios ranged from 34 percent to 42 percent. By contrast, native Persian speakers achieved an 82 percent success rate on the same tasks. The results show that these models default to direct, Western‑style communication, often missing the subtle cues that define polite Persian exchanges.

Implications

The researchers warn that cultural missteps in high‑consequence settings—such as negotiations or relationship building—could derail outcomes, reinforce stereotypes, and limit the effectiveness of AI tools in multilingual contexts. The study underscores the need for AI systems to incorporate culturally specific training data and evaluation metrics to avoid blind spots.

Future Directions

The introduction of TAAROFBENCH provides a concrete pathway for developers to test and improve model performance on Persian etiquette. Ongoing work may expand the benchmark to other cultural practices, encouraging broader awareness of linguistic diversity in AI development.

#AI#language models#Persian#taarof#cultural AI#TAAROFBENCH#Nikta Gohari Sadr#GPT-4o#Claude 3.5 Haiku#Llama 3#DeepSeek V3#Dorna
Generated with  News Factory -  Source: Ars Technica2

Also available in:

AI Language Models Struggle with Persian Taarof Etiquette, Study Finds | AI News