How Does the AI Player Work

Our artificial intelligence (AI) player uses a type of algorithm called reinforcement learning (RL). In a nutshell, RL is a type of machine learning algorithm that decides what action to take, at every time period, based on the state that the system is in, in order to maximize some long-term reward. These algorithms begin by trying more or less random actions, from which they observe the resulting reward and then learn to improve their actions in the future. Critically, these algorithms are not programmed to execute any predetermined strategy—they learn that on their own.

Here is a good example. DeepMind used “deep” RL to play classic Atari games like Space Invaders and Breakout:

In their RL algorithm, the system state is just the pixels on the screen, which the algorithm parses to determine the locations of the enemy invaders, the player’s cannon, and so on. The actions are whether to move left, right, or neither, and whether to fire. And the reward is the score. DeepMind didn’t program the algorithm to hide beneath the shields or to target the high-value enemies—it learned that on its own.

In the Opex Analytics Beer Game RL algorithm, the system state consists of the player’s current on-hand and on-order inventory, its backorders, and its inbound order quantity. The actions is the outbound order quantity, and the reward is the total supply chain cost (or really its negative).

Our RL algorithm borrows ideas from the DeepMind research but extends the approach to account for the significant ways that the beer game is different from Atari and other such games. For example, the DeepMind approach is designed for single-agent games (like Space Invaders) or competitive, zero-sum games (like chess), whereas the beer game is a cooperative, non-zero-sum game (since the players are trying to maximize the team’s performance). Moreover, in Atari and chess, the player knows the state of the system at each time step, whereas in the beer game, the other players’ inventory levels—and their associated rewards—are unknown to each individual player until the game ends.

Any RL algorithm needs to be “trained”—the process of choosing actions, observing the rewards, and then choosing better actions next time. Training our Opex Analytics Beer Game AI agent required thousands of hours of computing time. But now that it’s trained, it can play the game very quickly.

On the other hand, our AI agent is only trained for specific game settings at this time—that’s why it can only play under certain demand patterns. (See Getting Started.) We’re working on enhancing our RL algorithm so that the agent is better at playing under settings that it hasn’t explicitly trained for.

The beer game AI research was performed at Lehigh University by professors Larry Snyder (who is also a Senior Research Associate at Opex) and Martin Takac and by Ph.D. students Afshin Oroojlooy and Reza Nazari. For more details, you can read the current version of our research paper on arXiv, visit the project website, or see What Research Went into the Beer Game AI?