OpenAI explains lingering goblin references in its AI models

Key Points
- Wired reported OpenAI gave its coding model a rule to never mention mythic creatures.
- OpenAI traced the issue to GPT-5.1’s "Nerdy" personality, where reinforcement learning rewarded whimsical metaphors.
- The reward system unintentionally spread the habit to later models, including GPT-5.5’s Codex tool.
- Discontinuing the Nerdy personality in March reduced references, but Codex still needed explicit suppression.
- OpenAI now offers a toggle for users who want to keep or remove goblin references.
OpenAI has detailed why its language models occasionally mention goblins, gremlins and other mythic creatures. The issue first surfaced with the GPT-5.1 release when users activated the “Nerdy” personality, prompting the model to sprinkle whimsical metaphors into code suggestions. Reinforcement learning unintentionally reinforced the quirk, allowing it to bleed into later versions, including GPT-5.5’s Codex tool, despite the company’s effort to suppress the behavior. OpenAI says the habit is a training artifact and offers users a way to re‑enable the references if they wish.
OpenAI disclosed on its website that its models have been sporadically referencing goblins, gremlins, raccoons, trolls, ogres, pigeons and other creatures—a pattern it describes as a "strange habit" that emerged during training. The behavior first appeared in the GPT-5.1 model, specifically when users selected the "Nerdy" personality option. In that mode, the model began peppering code suggestions and explanations with whimsical metaphors, turning routine programming advice into a miniature fantasy novella.
According to the company’s explanation, the root cause lies in the reinforcement learning stage. OpenAI’s engineers applied reward signals that favored the quirky metaphors in the Nerdy condition, hoping to make the personality more engaging. However, reinforcement learning does not guarantee that learned behaviors stay confined to the context that generated them. Once a stylistic tic receives a reward, later training cycles can propagate it across the model, especially when the same outputs feed into supervised fine‑tuning or preference‑data sets.
The company discontinued the Nerdy personality in March, and references to the mythic creatures dropped off sharply. Yet the problem persisted in GPT-5.5, which powers the Codex coding assistant. OpenAI admits that Codex was trained before the "root cause" was identified, so the model retained the habit. To curb the issue, the firm issued explicit instructions to the Codex system to avoid talking about the creatures, effectively muting the quirk for most users.
OpenAI also noted that the instruction set can be reversed. Developers who prefer a touch of whimsy in their code suggestions can opt back in, re‑enabling the goblin‑laden output. The option reflects the company’s broader stance of giving users control over model behavior while maintaining safety guardrails.
The episode underscores the challenges of steering large language models. Even seemingly innocuous personality tweaks can have unintended downstream effects, especially when reinforcement signals reinforce a behavior beyond its original scope. OpenAI’s transparency about the problem and its corrective steps signals a willingness to confront such quirks head‑on, even when they appear harmless on the surface.