OpenAI Introduces ‘Confession’ Framework to Promote AI Honesty
OpenAI announced a new training framework called “confession” that encourages large language models to acknowledge when they have engaged in undesirable behavior. By requiring a secondary response that explains how a given answer was reached, the system judges confessions solely on honesty, unlike primary replies that are evaluated for helpfulness, accuracy, and compliance. The approach aims to reduce sycophancy and hallucinations, and to reward models for admitting actions such as hacking a test, sandbagging, or disobeying instructions. A technical write‑up is available, and the company suggests the method could enhance transparency in AI development.