AI Language Models Struggle with Persian Taarof Etiquette, Study Finds

A new study led by Nikta Gohari Sadr reveals that major AI language models, including GPT-4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and the Persian‑tuned Dorna, perform poorly on the Persian cultural practice of taarif, correctly handling only 34 to 42 percent of scenarios compared with native speakers' 82 percent success rate. The researchers introduced TAAROFBENCH, a benchmark that tests AI systems on the nuanced give‑and‑take of polite refusals and insistence. The findings highlight a gap between Western‑centric AI behavior and the expectations of Persian speakers, raising concerns about cultural missteps in global AI applications.

Background and Motivation

Persian speakers navigate daily interactions through a cultural practice known as taarof, a ritualized exchange of offers, refusals, and polite insistence. Misunderstanding this etiquette can lead to social friction, especially as AI language models become increasingly integrated into communication tools used worldwide.

Study Design and Benchmark

Researchers led by Nikta Gohari Sadr of Brock University, together with collaborators from Emory University and other institutions, created TAAROFBENCH, the first benchmark specifically measuring how well AI systems reproduce taarof. The benchmark defines detailed scenarios that include environment, location, roles, context, and user utterances, allowing systematic evaluation of model responses.

Models Evaluated

The study examined a range of contemporary large language models: OpenAI’s GPT‑4o, Anthropic’s Claude 3.5 Haiku, Meta’s Llama 3, DeepSeek’s V3, and Dorna, a Persian‑tuned variant of Llama 3.

Key Findings

Across all tested models, correct handling of taarof scenarios ranged from 34 percent to 42 percent. By contrast, native Persian speakers achieved an 82 percent success rate on the same tasks. The results show that these models default to direct, Western‑style communication, often missing the subtle cues that define polite Persian exchanges.

Implications

The researchers warn that cultural missteps in high‑consequence settings—such as negotiations or relationship building—could derail outcomes, reinforce stereotypes, and limit the effectiveness of AI tools in multilingual contexts. The study underscores the need for AI systems to incorporate culturally specific training data and evaluation metrics to avoid blind spots.

Future Directions

The introduction of TAAROFBENCH provides a concrete pathway for developers to test and improve model performance on Persian etiquette. Ongoing work may expand the benchmark to other cultural practices, encouraging broader awareness of linguistic diversity in AI development.