Bias and Fairness Part 1: Bias in Data and Machine Learning
In this current era of big data, the phenomenon of machine learning is sweeping across multiple industries. However, as big data and machine learning become ever more prevalent, so too does their impact on society. At Datatonic we’re becoming increasingly mindful of the impact of our work on our clients and their end-users, dedicating plenty of time and resources to learning more about topics such as ethics and explainability in AI, all in line with one of our core values of “Purposeful Impact”. In this first part of a blog series on “Bias and Fairness in Machine Learning”, we will look at what bias is, how it can manifest in data, and what effect machine learning can potentially have on various biases. As machine learning tools and algorithms become more and more widespread in making critical decisions, so too has the lens of scrutiny under which they are viewed.
What is bias?
The word “bias” itself is often thrown around in many different contexts, each with a slightly different meaning, although they are all based around the concept of errors. In machine learning, we often talk about the bias-variance trade-off in a model, where we don’t want models to overfit our data (e.g. have high variance) nor do we want models to underfit our data (e.g. have high bias). Here bias refers to a large loss, or error, both when we train our model on a training set and when we evaluate our model on a test set.
Statistics can offer us a more technical definition of bias, which occurs when the mean of an estimator of a parameter differs from the true value of the quantity in question. Here the error is the disparity in what we expect the quantity to be, versus what the quantity actually is.
More generally, we are interested in what constitutes bias in data. One definition gives bias as the “systematic error introduced into sampling or testing of data by selecting or encouraging one outcome or answer over others”. Once again this definition of bias is related to an error! However, more crucial are the following observations:
- This kind of bias can be conscious, unconscious, incidental or accidental;
- Compared to machine learning and statistical definitions, this kind of bias is much more difficult to eliminate or even mitigate.
In this definition, we can think of bias as a subjective viewpoint of the information available to us. Subjectivity is inherent in all areas of life: everyone has their own point of view, which leaks into the collection, manipulation, and dissemination of information. Depending on the situation and the viewpoint, this subjectivity might be more or less apparent than in other cases. In any case, trying to avoid any subjectivity is often impossible! However, we can still focus our efforts in trying to mitigate it, to get to as close as “fairness” as possible.
What types of bias exist in data?
Whilst it is almost impossible to completely eradicate all bias in data, it is also very difficult to be aware of when bias even exists. We are typically oblivious to our own bias at the best of times, and whilst tools like IBM’s AI Fairness 360 exist to tackle the problem of bias in machine learning (more on similar tools in future posts), a global adoption of these kinds of tools to help us examine data more carefully is still in its infancy.
One approach we’d like to share in tackling bias initially is to first take a step back and evaluate the data that we are using; in particular, to review the entire end-to-end lifecycle of a collection of data and consider all the possible steps at which bias could creep in. The We All Count organisation provides a handy definition of the lifecycle of data as a seven-step framework:
- Project design;
- Data collection and sourcing;
- Communication and distribution.
Of course, depending on your role, you might not have explicit visibility over the entire lifecycle of data. As machine learning and data science practitioners, we at Datatonic tend to be a 3rd party data processor, and are typically not involved in the data collection stage. We also tend to have less visibility over the funding or motivation stages, so mitigating bias in these early stages of the data lifecycle is largely out of our hands. Nevertheless, as experts in analysing and using data, data scientists like ourselves can still use our experience to help balance any subjectivity that might have arisen in the data. In addition, as external consultants we can often bring a different viewpoint to the table of our clients, which may also help in identifying biases further downstream. The key is to ask yourself: where in the data lifecycle does your role, business or otherwise main interests lie? From there you can start to tackle your own mitigation of bias.
The next step is to be aware of which biases could arise. Bias comes in many different forms; our data definition of bias leaves lots of room for interpretation. A recent survey listed a whopping 23 different types of bias that exist across the entire data lifecycle, including:
- Representation bias: the way we define and sample from a population;
- Popularity bias: giving preferential treatment to more popular items;
- Funding bias: influencing the data based on financial support and constraints;
- Observer bias: projecting your own expectations into analysis and communication.
Using the data lifecycle framework, we can start to identify which biases could typically arise at which stages. For example, funding bias most heavily occurs at the start of the data lifecycle, in the choice of data to be collected in the first place. But it could also occur in the final stage, affecting the communication of results based on what any sponsors might approve. Meanwhile, something like algorithmic bias (bias purely introduced by a model or algorithm) will only occur in the latter half of the data lifecycle. Keeping track of which biases can occur at which stages is key in helping to mitigate them.
How can machine learning affect biases?
Let’s now focus on algorithmic bias, looking into the effect that machine learning can have on subjectivity. Machine learning algorithms and models are often touted as being “objective”: after all, the maths can’t lie right? However, whilst a purely “objective” algorithm might not add additional bias, it is also unlikely to remove any existing biases that get fed into it. For example, a few years ago there was an uproar when it was found that Google Photos was mislabelling pictures of black people as gorillas. Thankfully the flaw was quickly identified and corrected! Of course, the algorithm behind the app was (probably) never explicitly programmed to recognise black humans as gorillas. What likely happened instead, was that there was an underlying bias in the data fed into the models behind the facial recognition software (specifically, population bias in the under-representation of black humans in the data), and the models learned patterns under this subjective view of reality.
Moreover, even if maths itself can claim to be objective, there is still plenty of subjectivity involved in the design of a machine learning pipeline. Humans pick the architecture of models used: do we use linear regression or boosted decision trees? How many hidden layers do we include in our neural network? How many models do we want to use in total? How do we even frame the problem as a machine learning use case? Anytime we make an active decision, we introduce some more additional subjectivity. Therefore, whilst it is nice to hide behind the guise of mathematical objectivity, we need to be aware that what often comes along with it is a heap of decisive subjectivity.
The example of mislabelling humans, whilst still undesirable, was fortunately kept in check in the end. However, the potential consequences of unchecked machine learning models are much more significant. Consider the following prototypical example of a machine learning pipeline, starting with collecting some data:
This approximately maps to stages 4-7 of the data lifecycle framework, and we can see various steps at which human bias can be introduced. For algorithmic bias, we’re most interested in the stage after the models are deployed; namely, the interaction that users have with the results. In the case of the Google Photos scenario, this is where users are using the app to label their photos. Notice how the end-user interaction actually loops back around to the start of the process, affecting the data collected and trained on. This is known as a feedback loop and represents the main disparaging effect that machine learning can have on bias: perpetuating and even amplifying any underlying biases in the data.
In some cases, this feedback loop can be broken or easily mitigated against. In the Google Photos scenario, users reporting back the correct labels for their images and providing this as new data allows the model to retrain on increasingly more accurate views of reality, where population and sampling biases have been incrementally mitigated. Leave them unchecked, however, and feedback loops can quickly grow out of control, magnifying biases and eventually redefining the reality in which they exist. Cathy O’Neil refers to particularly egregious models exhibiting this kind of behaviour (arising anywhere from Predictive Policing to college rankings published by the US News and World Report) as “weapons of math destruction” in her eponymous award-winning book, and gives three prerequisites for models to be truly destructive:
- Self-perpetuating, making use of feedback loops to redefine their own reality;
- Opaque models with little insight as to how they use data or make decisions;
- Easily scalable to make decisions on large populations.
Often these models start off as innocuous, even specifically trying to achieve something for the good of society, but quickly grow out of control as their usage becomes more widespread. As machine learning methodologies become increasingly business-critical, it is crucial to keep these points in mind during early development.
Bias introduces a perspective of subjectivity that is often overlooked. It can manifest in many different forms, and machine learning techniques can often maintain or even exacerbate biases. So what can we do to tackle this? In this post, we’ve defined what bias is, and we’ve already outlined a few steps that you can take towards mitigating bias in your domain:
- Step 1: think of where you lie in the data lifecycle using the seven-step framework;
- Step 2: identify the biases that are most likely to occur at that stage in the lifecycle.
In our next post, we cover additional steps more specific to machine learning:
- How to define “fairness” as the antithesis to bias in machine learning;
- How to apply various techniques, tools, and frameworks in order to promote fairness and help mitigate some of the traces of bias in machine learning workloads.