OpenAI Responds to Its Models’ “Goblin” Quirks: Codex Was Instructed Not to Mention Mythical Creatures

OpenAI stated that these metaphorical expressions involving goblins and other creatures were first noticeably observed in the GPT-5.1 model, particularly when the “Nerdy” persona option was enabled. According to the company, this expression did not disappear with subsequent model iterations but instead gradually spread.

In its explanation, OpenAI pointed out that the root of the problem lies in reinforcement learning training: although the relevant rewards were initially applied only under the “Nerdy” persona condition, reinforcement learning cannot guarantee that the learned behavior will always be strictly limited to the conditions that trigger it. Once a language style or expression preference is rewarded, subsequent training processes may spread it to other scenarios, and this tendency may be further reinforced, especially when these outputs are repeatedly used for supervised fine-tuning or preference data training.

The report stated that with OpenAI discontinuing the “Nerdy” persona in March of this year, these references to goblins and gremlins have indeed decreased, but have not disappeared completely. In particular, in the GPT-5.5 model used by the Codex programming tool, because OpenAI had already begun training the model before identifying the “root cause,” related expressions still remain.

For this reason, OpenAI ultimately had to add very specific constraints to Codex, explicitly instructing it not to mention these mythical creatures. However, the report also mentioned that if someone actually wants their AI to retain a bit of this “goblin style” when writing code, OpenAI has even publicly shared a method to undo these restrictions.

From this response, it can be seen that behind this seemingly absurd “goblin problem” lies a more realistic challenge in large model training: certain language habits that should originally only appear in specific persona settings may spill over into broader model behavior under the cumulative effect of reward mechanisms and subsequent training. For OpenAI, this is not only a public explanation of a model style going out of control, but also allows the outside world to glimpse the complexity it faces in correcting subtle behavioral deviations in generative AI.