Our artificial intelligence (AI) player uses a type of algorithm called reinforcement
learning (RL). In a nutshell, RL is a type of machine learning algorithm that decides
what action to take, at every time period, based on the state that the system is in, in
order to maximize some long-term reward. These algorithms begin by trying more or less
random actions, from which they observe the resulting reward and then learn to improve
their actions in the future. Critically, these algorithms are not programmed to execute
any predetermined strategy—they learn that on their own.
Here is a good example. DeepMind used “deep” RL to play classic Atari games
like Space Invaders and Breakout:
In their RL algorithm, the system state is just the pixels on the
screen, which the algorithm parses to determine the locations of the enemy invaders,
the player’s cannon, and so on. The actions are whether to move left,
right, or neither, and whether to fire. And the reward is the score.
DeepMind didn’t program the algorithm to hide beneath the shields or to target the
high-value enemies—it learned that on its own.
In the Opex Analytics Beer Game RL algorithm, the system state
consists of the player’s current on-hand and on-order inventory, its backorders, and
its inbound order quantity. The actions is the outbound order
quantity, and the reward is the total supply chain cost (or really its
negative).
Our RL algorithm borrows ideas from the DeepMind research but extends the approach to
account for the significant ways that the beer game is different from Atari and other
such games. For example, the DeepMind approach is designed for single-agent games (like
Space Invaders) or competitive, zero-sum games (like chess), whereas the beer game is a
cooperative, non-zero-sum game (since the players are trying to maximize the team’s
performance). Moreover, in Atari and chess, the player knows the state of the system at
each time step, whereas in the beer game, the other players’ inventory levels—and their
associated rewards—are unknown to each individual player until the game ends.
Any RL algorithm needs to be “trained”—the process of choosing actions, observing the
rewards, and then choosing better actions next time. Training our Opex Analytics Beer
Game AI agent required thousands of hours of computing time. But now that it’s trained,
it can play the game very quickly.
On the other hand, our AI agent is only trained for specific game settings at this
time—that’s why it can only play under certain demand patterns. (See Getting
Started.) We’re working on enhancing our RL algorithm so that the agent is
better at playing under settings that it hasn’t explicitly trained for.