Reinforcement Learning Explained Using the Beer Game

Michael Watson Ph.D Partner
Larry Snyder, Ph.D. Senior Research Associate
Read Time: 5 minutes apprx.

Space Invaders, Go, and the Beer Game

Reinforcement learning (RL) is a hot topic in the artificial intelligence/machine learning world these days. Google’s DeepMind group used “deep” RL to play classic Atari games like Space Invaders and Breakout, often outperforming expert human players. And it stunned the Go community when its AlphaGo program beat world-champion Go player Lee Sedol in 2016.

My colleague Martin Takáč and I, along with our Ph.D. students at Lehigh University, Afshin Oroojlooy and Reza Nazari, have built a deep RL algorithm to play the beer game. And Opex Analytics developers have been working on a user interface for the beer game that showcases the algorithm — and can also be played in the usual way (for free) in classrooms or other settings. (We’ll be back on these pages in a few months once the game is released.)

If you’ve ever taken a supply chain class, you’ve probably played the beer game. The game involves four players representing four stages of a supply chain: retailer, wholesaler, distributor, and manufacturer. Each player must decide how much to order from its upstream partner, given the order quantity that it received in the current time period. The players’ goal is to minimize the total cost of the supply chain, but they cannot communicate with each other during the game.

RL in a Nutshell

So, what is RL? In short, it is a type of machine learning algorithm that decides what action to take, at every time period, based on the state that the system is in, in order to maximize some long-term reward. These algorithms begin by trying more or less random actions, from which they observe the resulting reward and then learn to improve their actions in the future. Critically, these algorithms are not programmed to execute any pre-determined strategy — they learn that on their own.

For example, when the DeepMind RL algorithm plays Space Invaders, the system state is just the pixels on the screen, which the algorithm parses to determine the locations of the enemy invaders, the player’s cannon, and so on. The actions are whether to move left, right, or neither, and whether to fire. And the reward is the score. DeepMind didn’t program the algorithm to hide beneath the shields or to target the high-value enemies — it learned that on its own.

RL and the Beer Game

In our beer game RL algorithm, the system state consists of the player’s current on-hand and on-order inventory, its backorders, and its inbound order quantity. The action is the outbound order quantity, and the reward is the total supply chain cost (or really its negative).

Our beer game RL algorithm borrows ideas from the DeepMind research but extends the approach to account for the significant ways that the beer game is different from Atari and Go. For example, the DeepMind approach is designed for single-agent games (like Space Invaders) or competitive, zero-sum games (like Go), whereas the beer game is a cooperative, non-zero-sum game (since the players are trying to maximize the team’s performance). Moreover, in Atari and Go, the player knows the state of the system at each time step, whereas in the beer game, the other players’ inventory levels — and their associated rewards — are unknown to each individual player until the game ends.

Any RL algorithm needs to be “trained” — the process of choosing actions, observing the rewards, and then choosing better actions next time. Before our algorithm was trained, it was, unsurprisingly, a terrible beer game player. For example, when the RL algorithm played the role of the wholesaler, it ordered far too much, resulting in huge inventory levels at the wholesaler and big backorders upstream. (Inventory shows as brown boxes and positive numbers in the screenshot below, while backorders show as red boxes and negative numbers.)

After the algorithm played the game a few thousand times, it got better. It learned to keep the inventory levels lower at the wholesaler, — in fact, it started to keep them too low:

But after playing for a while longer, the algorithm learned that the optimal strategy is a so-called base-stock policy, or order-up-to policy. It learned that it didn’t need much inventory at the wholesaler in order to keep the retailer well-stocked, and that since there is no cost for backorders upstream (only at the retailer), keeping inventory levels low or even negative there can be effective:

(Note that the screenshot shows inventory levels at the end of a period. An inventory level of 0 is ideal — it means we met the demand perfectly, with no extra inventory and no shortages.)

We didn’t explicitly tell our code to follow a base-stock policy or to exploit the free upstream stockouts — it learned that on its own.

Here’s a video that shows how the algorithm learns as it trains:

Where to Next?

It’s still early days for this research: For now, our algorithm can only handle simple demand structures, and we can’t yet operate more than one player at a time using an RL agent. But we think of this as a proof-of-concept demonstration that machine learning algorithms can recommend good actions in realistic supply chain environments. In the future, we’ll be applying these ideas to more complicated decisions in more complicated supply chains, like yours.

Visit the project website for updates on our research. You can also read the current version of our research paper on arXiv. And if you’d like to be notified when the Opex Analytics Beer Game is released, visit this page and leave us your e-mail address.

This post also appeared in Supply Chain Digest.