LESSON 3

Turing Tests and Bullshit Benchmarks

Developed by Carl T. Bergstrom and Jevin D. West

Photo: Carl Bergstrom

Photo: Carl Bergstrom

Bank note from the Bank of England with the face of Alan Turing

Alan Turing is depicted on the £50 note. Image: Bank of England

Alan Turing is depicted on the £50 note. Image: Bank of England

In 1936, British mathematician Alan Turing laid much of the groundwork for modern computer science with his concept of a universal computer, now known as the Turing Machine. 

From 1939-1945, he tipped the course of World War II in favor of the Allies by deciphering German encryption.

And in 1950, he gave us a benchmark for human-level artificial Intelligence: The Turing Test.

The Turing Test, illustrated here, challenges a human judge to distinguish between an artificial intelligence and a human "decoy".

The human judge can only send text messages back and forth between the AI and the human decoy .

If the human judge cannot reliably distinguish the AI from the decoy based on the texts, the AI is said to have passed the Turing test.

Diagram. At left, a human judge at a laptop. A divider separates him from a computer labeled "Artificial intelligence" and a person labeled "Human decoy". Messages, indicated by text message bubbles, can pass from the judge to the AI and to the decoy.

Forty years ago, the television series Star Trek: The Next Generation imagined our world in the 24th century, complete with interstellar travel, teleportation, extraterrestrial life—and artificial intelligence, in the form of the much-beloved Lieutenant Commander Data. 

Commander Data was precise, factual, and super-humanly rational, but also unmistakably non-human. His speech was stilted. He didn't understand jokes. He lacked a nuanced social intelligence of his fellow crew members. 

Commander Data
remains a fiction today.

But other forms of AI
have become our reality.

Commander Data from the series Star Trek: The Next Generation

Paramount Global

Paramount Global

Today we live in a world with conversational AI agents known as large language models.

Logos for Apple Intelligence, ChatGPT, Claude, and Gemini

These machines differ from Commander Data in almost every conceivable way. They don't have the capacity to think through problems logically. They make up things that aren't true. They answer confidently even when they are wrong. They don't have Commander Data's capacity for ethical judgement.

Think about the Turing test. An AI might pass it in different ways.

It might pass the test as Commander Data would: with understanding, empathy, and logical reasoning.

Or it might pass the test as an LLM: by generating plausible text without any underlying comprehension of what it was saying.

Turing's test was designed to assess agents like Commander Data.

But LLMs may beat Data to the prize.

If you needed an artificial agent to command the Starship Enterprise through every kind of peril, you would choose Commander Data. 

But if you needed an agent who could bullshit anyone, anywhere, any time, you’d choose an LLM.

LLMs are what we call anthropoglossic systems.

Anthropoglossic systems are computer programs or algorithms designed to mimic the way that humans use language.

They are engineered to write, speak, and converse like human beings. The fact that they are generally quite good at this drives the enormous enthusiasm for LLMs in the marketplace today.

But their ability to use language also leads us to overestimate their capacities and underestimate their limitations.

This is an example of the ELIZA effect, named after the early computer simulation ELIZA that emulated a psychoanalysis session. This principle refers to the human tendency to project our own capacities—such as thought, emotion, self-reflection, and consciousness—onto even very basic computer programs that communicate with us using text.

As anthropoglossic systems, LLMs tap into our evolved heuristic—effective for tens of thousands of years—that if something can talk, it must be able to think.

Our tendency to think this way is exacerbated by numerous deliberate choices that tech companies have made about how these models are presented to us and how we are able to interact with them.

What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in otherwise normal people.
Joseph Weizenbaum, developer of ELIZA (1976)

When it comes to bullshitting anyone about anything, an LLM has a huge advantage over any human.

People use language in ways that signal belonging. In a social situation, one might use particular type of slang; in an academic paper, one might rely on certain forms of jargon. But each of us has limited experience and expertise. We only belong to a few social groups; we are only expert in a few domains. We don't have the insider knowledge to speak in the codes of groups we don't belong to.

ChatGPT has access to many of the codes of many different groups. As a result, it is better than a human outsider at mimicking the modes and patterns of speech, the dialects and slang and jargon of the groups that are well-represented in its training set.

Therein lies its superhuman bullshitting ability. I have one perspective; it has been trained on millions of perspectives. I can’t go bullshit a bunch of radiologists at a radiologist convention, but ChatGPT possibly could. At least it could convince an adjacent group—surgeons, say—that it was an expert radiologist.

This is an important caveat: the less we know about the subject at hand, the more likely we are to judge an LLM as credible.

PRINCIPLE
AI chatbots are designed to be anthropoglossic: able to speak, write, and converse in human-like fashion. When we interact with anthropoglossic systems, we naturally assume they have the full range human capabilities. They don’t.

DISCUSSION
What design features of contemporary LLMs encourage us to view them not as mindless machines but rather as agents that can think and perhaps even feel?

Photo: Carl Bergstrom

Photo: Carl Bergstrom

VIDEO

Coming Soon.