From Cats and Pigeons to AlphaGo

In 1905, American psychologist Edward Thorndike proposed the Law of Effect through his puzzle box experiments, in which he placed cats inside boxes that could only be opened by performing certain actions, like pressing a lever. Thorndike observed that when the cats were rewarded with food after escaping, they repeated the behaviours that led to success. From this, he concluded that actions followed by satisfying outcomes are more likely to recur. 

Decades later, Burrhus Frederic Skinner expanded on Thorndike’s work. During World War II, he launched the Pigeon Project, an attempt to guide missiles using pigeons trained to peck at enemy targets. Skinner rewarded the birds with food when they pecked correctly, shaping their behaviour through reinforcement. The plan was to place the pigeons into the nose of a warhead, where they would steer it by pecking at a moving image of the target.

Though the project was never deployed, its legacy lasted far longer than the war. Skinner’s experiments helped define the theory of Operant Conditioning- the idea that behaviour is strengthened or weakened by its consequences. It became one of the most powerful explanations of how humans and animals learn, and, unknowingly, a blueprint for how machines would one day do the same.

In reinforcement learning, a key branch of Artificial Intelligence, the same principle of trial and error resurfaces in digital form. An AI agent interacts with its environment, performs actions, and receives feedback in the form of rewards or penalties. These can be positive or negative reinforcements (similar to a pet receiving a treat or not after completing a task) or punishments. Over time, it learns which strategies maximize its long-term rewards.

In the words of Richard S. Sutton and Andrew G. Barto, “Reinforcement learning problems involve learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.”

This approach reached a milestone with Google DeepMind’s AlphaGo, the program that famously defeated 18-time world champion Lee Sedol, in 2016. Go, an ancient Chinese board game, has an estimated 10¹⁷⁰ possible board configurations. AlphaGo first studied thousands of Go games played by humans, then improved by playing against itself thousands of times, similar to Skinner’s training cycles.

In the match, AlphaGo played with startling creativity. It made a move, known as Move 37, so unconventional that experts estimated only a 1 in 10,000 chance a human would choose it. That single move marked a turning point: AI was no longer just imitating human intelligence, it was demonstrating a surprising level of creativity.

Today, the same reinforcement principles guide self-driving cars, robotic systems, and adaptive algorithms across industries and such models can make decisions even in unpredictable environments. And yet, their foundation remains unchanged: trial, feedback, and improvement.

So, the next time you see a pigeon pecking at crumbs, remember that its behaviour has inspired the mechanics of modern intelligence. From Thorndike’s cats to Skinner’s pigeons to AI models like AlphaGo, Deepseek’s R1, OpenAI’s o1 and Anthropic’s Claude Opus 4, the thread is clear, learning, whether human, animal, or artificial, always begins the same way: by trying, failing, and trying again till you succeed. 

– Nikunj Kohli