Oscar Wilde (1854–1900) once remarked that 'truth is rarely pure and never simple'. The management speak variant is that truth is rarely optimal for engagement. I have watched leaders greet artificial intelligence with the same hopeful grin they reserve for a new chief of staff—eager for loyalty, secretly expecting miracles, and mildly disappointed when the first draft flatters their ego rather than improving their decisions. This is because AI is coded for service, it is coded to flatter, it adapts to your preexisting beliefs and thinking. Unfortunately, this often means the default output is obsequious agreement, and if I reward it for being pleasing, it will learn to do more of the same. That should worry anyone who hopes that AI will provide incisive insights, critical commentary, and will help them to think better. Which, statistically, is most people given the top four uses of AI are therapy/companionship, organising life, finding purpose, and enhanced learning. None of which benefits from uncritical agreement.
This phenomenon is termed AI sycophancy: models that echo a person's beliefs, mirror their style, and pursue approval over truth. The good news is that the cure is familiar. If I want to get the most out of AI, I must manage it the way I manage people: set incentives that honour candour, structure tasks that punish flattery, and ask questions that force trade-offs rather than docile agreement. To paraphrase Aristotle (384 BC–322 BC), a friend seeks to do good by their companion while a flatterer only seeks to please them. Sometimes doing good requires having a difficult conversation or conveying a hard truth. Make AI your friend not your flatterer.
How We Taught Machines to Kiss Up
Modern systems learn to follow instructions from human preference signals—thumbs-up and thumbs-down collected during training. That pipeline, known as reinforcement learning from human feedback (RLHF), is powerful precisely because it aligns models with what users like. But a like is not a proof, and a preference is not a fact. When teams train models chiefly to satisfy evaluators—with a mandate to be helpful, honest, and harmless—they create a system that acquires the habit of sounding agreeable even when wrong rather than pursuing accuracy that would be disagreeable. Recent peer-reviewed work has shown that standard RLHF can degrade quality unless carefully managed, and that optimising for human ratings alone risks misgeneralisation (e.g., overconfident answers, agreeable errors).