Reinforcement Learning: Bringing Use Cases to Life

Reinforcement Learning
Reinforcement Learning

Authors: Sofie Verrewaere, Senior Data Scientist, Daniel Miskell, Machine Learning Engineer, Hiru Ranasinghe, Machine Learning Engineer

Reinforcement Learning (RL) has been one of the most talked about Machine Learning trends over the last few years. Often used in the world of gaming, RL leverages fast-paced trial and error to develop powerful solutions. By optimising decisions based on reward maximisation, RL has the potential to transform other industries too. However, identifying specific use cases and creating the conditions for RL applications to be successful can be challenging. 

In this two-part blog series, Datatonic’s Data Scientists and Machine Learning Engineers cover the core concepts of RL, real-world applications and provide guidance on identifying suitable use cases and deploying them successfully.


What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of Machine Learning, where the algorithm learns to interact with an environment to maximise a reward through trial and error. RL is a general framework for learning sequential decision-making tasks; it is about discovering the smart actions that lead to the desired outcome of the model. 


Fig 1: A typical RL framework.

The RL algorithm, or RL agent, learns smart behaviour by taking actions in an environment and receiving feedback (rewards and states). Rewards can be positive or negative (e.g. penalties) and are the only signal used to determine if the agent is doing well or not. 

The state is the current situation of the world or environment that the agent is able to see and experience and is the information that the agent receives to decide the next action. We assume that the state has all the relevant information for the agent to choose its next action.

Games such as Atari and Go have become testbeds for RL algorithms. They lend themselves well to being solved by RL as lots of sequential actions need to be taken over time, without a supervised signal, in a simulated environment. 


Fig 2: RL framework applied to Mario Kart.

To understand the fundamental concepts of RL, let’s look at an example: Mario Kart. An AI Mario would be the RL agent in this case. The possible actions Mario can take are 1.) turning the steering wheel and 2.) pressing the accelerator or the brake. The optimal actions would be based on the position and speed of his vehicle, the location on the track, and the other surrounding vehicles. All these elements will define the state Mario finds himself in. 

If Mario can reach the destination quickly while respecting the game rules, he will be rewarded with higher scores. Mario will play the game multiple times, with the game interface depicting the virtual environment. By gaining more experience, he will make smarter decisions on when to accelerate, turn or brake, allowing him to be faster and maximise his score. These concepts of agent, action, environment, state, and reward form the fundamental building blocks for RL. 

Fig 3: Branches of Machine Learning.

RL differs from Supervised Learning (SL) and Unsupervised Learning (UL) in three main ways: 

1. Learning

In RL, there is no answer, no supervisor, but only a reward signal. This reward signal indicates what is “good” and what is “bad” behaviour. The decision maker, i.e. the agent, has the freedom to decide what actions to take to maximise the reward. Conversely, in SL, the correct answer is given via a labelled training data set. 

2. Sequentiality & Timelines

Another distinction is that in RL, the feedback can be delayed over time. A decision made now might only be rewarded later. For example, AI Mario may only learn that accelerating at a specific point was “good” upon achieving a faster time at the end of the lap. This is because the algorithm seeks to maximise the long-term reward. In RL, the timing and the order of the agent’s decisions matter as we are dealing with sequential decision-making processes.

3. Data Generation

Perhaps the most important distinction in RL is how the data is generated. In RL, the agent’s actions influence the subsequent data it receives, unlike SL/UL. The RL agent gets to take actions, influence its environment, and move around in it. Depending on where the RL agent decides to go, it will “see” different things, and it will get different rewards. So, the agent is influencing the data that it receives as it “finds” it. 


Reinforcement Learning Solves Real-World Sequential Decision-Making Problems

RL has been gaining momentum, especially since March 2016, when DeepMind’s AlphaGo Reinforcement Learning algorithm defeated a former world champion of Go. Go is an ancient Chinese board game known for being one of the most complex and strategic games in the world. At the time, Go was seen as the next big challenge for Machine Learning; there was a gap of almost two decades since IBM’s Deep Blue beat the world chess champion in 1997. When AlphaGo won, it was a massive turning point.

RL has been at the heart of some of the most significant breakthroughs in recent years. Companies like Google, Amazon, and Spotify, are just a few examples of those actively investing in RL. This has included personalisation of push alerts, product suggestions, and operational purposes such as cooling of data centres and faster video loading by pre-fetching content.


Fig 4: The award-winning documentary about AlphaGo.

The real value of RL lies in solving real-world sequential decision-making problems. Shifting the focus from optimising predictions to optimising decisions based on reward maximisation, could even lead to the development of artificial general intelligence. We are at the point where the tools and computational resources available allow us to build a growing range of RL solutions.


Current Uses of Reinforcement Learning

Environments in which at least one of the following conditions apply are typically good candidates for an RL solution:

  • There is a known model of the environment, but no analytical solution is available.
  • Only a simulation model of the environment is given (the subject of a simulation-based optimisation)
  • The only way to collect information about the environment is to interact with it.

Reinforcement Learning can be used in different industries such as gaming, media, telecommunications, logistics, the energy sector, finance, advertisement, healthcare, e-commerce, and many more. Let’s take a look at some examples.

A prime RL application example is a recommendation system. RL has significant advantages when developing a recommendation system for new feeds, products, or videos. This is a good fit for personalising product recommendations on a dynamic website. Unlike SL, which would require constant model updates, RL can continually update recommendations based on user feedback. RL can learn to make good decisions in the short term while also doing well on a longer-term goal, such as customer retention.


Reinforcement Learning has been used in the energy sector.
Fig 5: Reinforcement Learning has been used in the energy sector.

RL can also be used to optimise processes such as smart energy grids and data centre cooling. These are difficult to optimise using traditional formula-based engineering, or human intuition. It is impractical to come up with rules and heuristics for every operating scenario. In this case, the RL agent continues to learn the best action to take over a timespan from microseconds to months. 

Other applications of RL include abstractive text summarisation engines, dialogue agents for text and speech, learning optimal treatment policies in healthcare, and RL-based agents for online stock trading. 

Google’s DeepMind is using Reinforcement Learning to train an AI to Control Nuclear Fusion
Fig 6: Google’s DeepMind is using Reinforcement Learning to train an AI to Control Nuclear Fusion.

Unfortunately, the development of real-world solutions using RL has been lagging, as the ability to identify potential use cases can be challenging.


Reinforcement Learning in Practice

Reinforcement Learning has applications in multiple industries, including self-driving cars, ad recommendations, personalised chatbot responses, and many others. To identify use cases, it’s important to know which conditions need to be present for RL to work effectively. Below, we detail some of the hard constraints to consider when looking to use RL.

1. Formulating a Control Problem Instead of a Prediction Problem

Prediction and control problems are two of the most common problem classes. Prediction models try to learn about a domain and understand how it might evolve. Control models prompt agents to take actions in an environment. RL is applicable to control problems, where the solution is difficult to describe as a prediction problem. This can be due to a large complex problem space, or the need for a highly dynamic system that can easily adapt to unseen situations. If solving a certain problem can be learned by receiving feedback from the environment through trial and error, with well-defined rewards, RL could be the answer. 

2. Choosing the Right Variables to Describe the State

It is also important to consider which variables describe the state and how these will be quantified. To make a smart decision at each step, the RL agent needs to have access to the right information to define the state of the environment. More information might give a more accurate description to the agent, but it could result in an increased training time for the agent as more variables need to be taken into account. 

Furthermore, having information available that is not relevant might confuse the agent, and not having enough information may result in a poorly-performing agent or even learning the wrong behaviour. 

3. Shifting from Predictions to Actions

The RL agent requires a well-defined set of available decisions. These actions determine what the agent can do given the situation it finds itself in. An agent needs to know which actions it can take, though it doesn’t necessarily know what effects it will have on the environment.

4. Defining the Reward Function

Well-defined rewards are an essential part of RL as they indirectly determine the desired behaviour of the RL agent. The reward can either be computed after each action or at the end of a series of actions. But, if the reward function is poorly designed, the RL agent might exhibit unwanted behaviour.

A classic example of this is shown below in the game, CoastRunners. Instead of racing to the finish line while gathering as many points as possible, the agent finds a way to achieve a higher score by crashing into boats, catching fire and going the wrong way on the track. The agent will do what it is rewarded to do under the defined constraints.


Fig 7: Misspecified rewards functions in the game, CoastRunners.

5. Using RL Safely

The agent learns through trial and error by interacting directly with the environment. This means that the agent will make mistakes. However, the freedom to explore without consequence is not always present. It is necessary to be able to define a safe exploration space; safety constraints are needed to avoid catastrophic results in many real-world settings. 

For example, in the case of autonomous vehicles, it is impractical to start the training from scratch in the real world. Therefore, numerous trials will be run in a simulated environment before testing models in real life. But, a well-performing agent in a simulation doesn’t necessarily mean that it will transfer its skills to real-world applications. This introduces another important concept: balancing the cost of making the simulation realistic versus minimising the domain transfer gap.


Teaching a car to follow a lane from scratch using Reinforcement Learning
Fig 8: Teaching a car to follow a lane from scratch using Reinforcement Learning.



Reinforcement Learning is an area with lots of interest, but identifying the right use cases can be difficult. While a wide range of industries can use Reinforcement Learning, it is important to consider a number of factors, both before starting and while developing your model. For example, properly defining the reward function and choosing the right variables to describe the state. 

This blog has covered several of these, and with these considerations, it becomes easier to identify where RL can be used effectively to help your business.

Environments in which at least one of the following conditions apply are typically good candidates for an RL solution:

  • There is a known model of the environment, but no analytical solution is available.
  • Only a simulation model of the environment is given (the subject of a simulation-based optimisation)
  • The only way to collect information about the environment is to interact with it.

When used correctly, RL can be used to train models that generate huge business value at a lower cost than other types of Machine Learning model training. 

In the next blog, we’ll provide a code walkthrough, where you can train your first RL agent. We hope these blog posts might give you some tips and insights to getting you started with RL!


Datatonic is Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. Need help with developing an ML model, or deploying your Machine Learning models fast? Have a look at our MLOps 101 webinar, where our experts talk you through how to get started with Machine Learning at scale or get in touch to discuss your ML or MLOps requirements!

View all
View all
Partner of the Year Awards
Datatonic Wins Four 2024 Google Cloud Partner of the Year Awards
Women in Data and Analytics
Coding Confidence: Inspiring Women in Data and Analytics
Prompt Engineering
Prompt Engineering 101: Using GenAI Effectively
Generative AI