Against Turing Tests: Don't Teach AIs to Lie to Humans

Against Turing Tests: Don't Teach AIs to Lie to Humans

Post co-written with Amber Dawn


In 1950, Alan Turing suggested that machines might one day be able to deceive a human in an ‘imitation game’ - a game where a human judge interacts with both a computer and a human and has to distinguish which is which. Since then, ‘Turing tests’ have become a well-known benchmark of AI sophistication.

Until 2020, the Loebner Prize offered rewards for AIs that could pass a Turing test. Judges would interact with chatbots and with humans, and would have to decide which was the bot. The most human-like programs were awarded prizes of $2000. The prize runners also offered a $100,000 prize for the first program that could decipher text, visual and auditory input indistinguishably from a human. Several Metaculus forecasting questions refer to Turing tests, with forecasters asking (for example) ‘When will an AI first pass a long, informed, adversarial Turing test?’. (Here are some other examples.)

I think that we shouldn’t use Turing tests as a benchmark for AI capabilities. At least, we certainly shouldn’t train AIs with a view to having them pass Turing tests. Why? Passing a Turing test involves deception by definition. The AI is trying to mislead testers into thinking that it’s a human. It might do this by learning to ape human-like speech very accurately, or even lying outright.

An AI with this skill could be very dangerous. First, if an AI can convincingly pass as human, it could much more easily attain power and resources under false pretenses. Second, if an AI can deceive people in the context of a Turing test, it’s likely that it can also deceive people in other domains (for example, it could mislead humans about its true capabilities).

Turing tests have other flaws, too. For example, they are anthropocentric: they reward the ability to use human language and adopt human social norms and ignore other capacities that are potentially equally dangerous. They can therefore be ineffective as a benchmark for assessing an AI’s true power. An AI might fail the Turing test and not be able to speak human language in a convincingly human-like way, but still have worrisome capabilities or tendencies, for example the ability to hack programs or control robotics.  

What are some alternatives?

Coming up with AI benchmarks is hard, but I suggest that it’s better to measure whether AI can achieve certain concrete technical feats. For some of these things, we could still require that the AI product should be indistinguishable from human-produced versions; but this wouldn’t involve actually training an AI to lie, mislead, or otherwise give people the wrong impression. For example, some benchmarks could be:

  • Can an AI write a scientific paper that passes peer review as a paper written by a human? (This might get at PASTA AI as defined by Holden Karnofsky.)
  • Can an AI perform certain specific, well-defined tasks?
  • Can an AI outperform humans on the APPS benchmark for programming (or pass other programming milestones)?

Programming benchmarks might be particularly important because many of the major AI labs are releasing AIs focussed on coding - for example, OpenAI’s Codex (released August 2021), Deepmind’s AlphaCode (released February 2022), and apparently GPT-4 will focus more on coding than its predecessors.

If we focus on creating AIs that can do certain tasks, rather than training them to pass as human, we’re less likely to implant in them dangerous tendencies that could get out of our control.