Responsible AI: The Role of Data and Model Cards
Authors: Theodoros Marinos, Machine Learning Consultant and Matthew Gela, Senior Data Scientist
In recent years, there has been a growing concern about the lack of transparency in Machine Learning models, especially when used in high-stakes applications such as healthcare, finance, and criminal justice. Many stakeholders, including regulators, policymakers, and end-users, are calling for increased transparency and accountability in the development and deployment of these models.
In this first technical blog, we will show you how to use data cards and model cards to bring greater transparency between stakeholders and model development teams. We will show you what these look like, as well as how you can integrate them into your Machine Learning pipelines.
ML Documentation lacks consistency
Documentation is an essential part of the software development lifecycle that aims to communicate what has been developed to all stakeholders, including other developers, testers, project managers and the wider business. It makes the software project transparent to non-technical stakeholders. They can understand what the software does, how it works, and how it will benefit them. This helps in building trust between the development team and the stakeholders.
Like in software development, documentation is a critical component of the Machine Learning development lifecycle that ensures that Machine Learning systems are easy to understand, maintain, and use. However, creating clear and transparent documentation for Machine Learning models can present additional challenges: they are often complex and difficult to understand, they can continuously learn and change over time, and there has been a lack of a consistent format for producing documentation around this.
Also, additional information is required to be understood about ML models and datasets. For example, stakeholders need to understand whether the model is fit for purpose, for which they need to understand aspects like model performance, biases in the model or its underlying dataset, as well as its limitations.
For these reasons, there is a need to provide a common standard that enables thorough documentation of models and datasets, automate the process of updating this documentation to reduce friction in keeping information about the model up-to-date and enable information to be found about the models by those who need it. Such tools to provide transparency and key information to the users are model cards and data cards.
Improving transparency with model cards
Model cards are a form of documentation that provides a standardised way of presenting information about Machine Learning models. They were first introduced by Google in 2018, and have since become increasingly popular across the industry. For example, they have been adopted by the HuggingFace platform, where any models made available contain a model card along with the code.
Model cards provide a consistent way of presenting information about Machine Learning models that is easy to understand, maintain, and use.
Model cards aim to provide a concise, holistic picture of a Machine Learning model. They include the following information:
- Model description
- Intended use
- Features used in the model
- Data used for training & evaluation
- Ethical considerations
One of the key benefits of model cards is that they can help reduce friction in the Machine Learning development process. This is because they provide a standardised way of presenting information about models that can be easily shared and understood by business stakeholders, as well as developers, data scientists, and business analysts.
Creating clear and transparent model documentation can still be challenging, even with model cards. One of the biggest challenges is ensuring that the information presented in the model card is accurate, complete, and understandable. This requires a deep understanding of the Machine Learning model and the data used to train it, as well as an understanding of the intended use case for the model.
While a model card provides important information about a Machine Learning model, it does not contain detailed information about the data used to train the model. This is where a data card comes in.
Enter the data card
Data cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a project’s lifecycle for responsible AI development. They are very similar to model cards, but instead of containing information about the model, they contain a summary of the data’s lifecycle that the model has trained on. These can include, but not be limited to:
- Dataset overview & example of data points
- Data source collection, validation, transformations and sampling methods
- Sensitive attributes
- Training and evaluation methods
- Intended and extended use of data
Having a clear view of the data’s lifecycle throughout a project can help to create more transparent, reproducible, and reliable models. It can also make it significantly easier to incorporate further additions or modifications. This translates to faster integration of model improvements since it becomes much simpler to identify areas that require improvement, such as in the preprocessing steps.
Data cards are becoming increasingly essential in Machine Learning as the amount of data used continues to grow rapidly. Keeping track of all the changes in data can be challenging, particularly when dealing with multiple datasets or performing numerous transformations on the data. A uniform method of representing this information, such as the data card, simplifies the process of accessing, using, and comprehending the data. This eliminates the need to waste time looking over multiple files to find the necessary information, as everything can be found in a single, accessible point.
Furthermore, information that cannot be inferred from the data directly, can be included in the data card for further visibility and understanding. In this way, data cards can help inform decision-making much faster by providing key information about the data that otherwise would be very time-consuming to find.
Making model cards – a walkthrough
Now that we’ve explored data and model cards, and why these are beneficial, we’ll show you in more detail how we can integrate model cards into your Machine Learning production pipelines, and share an example model card based on an XGBoost model that predicts whether a loan applicant will default or not on their bank loan.
Generating and updating cards
There is an increasing need to automate the generation and updating of data and model cards. This can help ensure that the information presented is accurate and complete, and can also help reduce the time and effort required to create and maintain these documents. This is because each time a dataset or model gets updated, the resulting metrics could change, graphs can get outdated and other issues might be resolved or even arise.
To do this, we have built KFP pipeline components which can automate updates to the data cards and model cards within the training pipeline, when there are any changes to the model or underlying dataset.
The diagram shows a reduced version of the Vertex AI training pipeline, which illustrates how the model card fits within the training pipeline.
For the model card, this gathers information about the model, including analysing the model code, examining the training data, and generating performance metrics and graphs. Once this information has been gathered, the pipeline component automatically inserts the data into a BigQuery table and saves any relevant graphs in Google Cloud Storage, which can then be retrieved with ease and populate the model card accordingly. The data card follows a similar process, so we won’t elaborate on it here.
Adding interactivity to your model cards
In addition to automated model card generation, there is also a growing interest in interactive model cards. Interactive model cards allow users to explore and interact with Machine Learning models more dynamically. They can include features like interactive visualisations, and the ability to query the model to get a feel for how it works.
Interactive model cards can be particularly useful for non-technical stakeholders, who may not have a deep understanding of Machine Learning. By providing a more interactive and user-friendly way of exploring Machine Learning models, interactive model cards can help ensure that models are more transparent and understandable to all stakeholders.
To achieve this, we’ve used Gradio, a popular open-source Python library for building web apps, to build and populate our model card and provide an interface to query the model. A connection to our BigQuery table allows us to populate all the relevant fields dynamically and with flexibility.
An interactive model card example
An example of what a model card can look like can be seen below. This model card snapshot demonstrates some of the key elements that a model card should include. The model for this model card is an XGBoost model that predicts whether a loan borrower will default or not on his bank loan. The dataset that was used to train the model is a public loan default dataset found in Kaggle.
In the first section of the model card, there is a summary of all key information that a user or a stakeholder might need to understand the use case and purpose of the created model. Such information includes the performance of the model with appropriate metrics, the owners and versions of the model and any considerations or limitations that the model has. These are important to be included as they can help plan actions and tasks to mitigate these limitations.
The next section of our model card includes all of the features that are required for the model to output a prediction. This includes the name of the features and also their corresponding type. Additionally, this section also includes the name of the target variable of the model and the different values it can have, describing the meaning of each one.
The model evaluation section includes the performance metrics of the model in more depth. Any relevant evaluation graphs can be included in this section such as an ROC curve, confusion matrix and feature importance graphs. By having all the evaluation information centralised in one place, it is much easier to understand where the model is underperforming, so you can then form a retraining plan on improving these areas.
Another section of the model card is the fairness measurements section. This includes the outputs of any fairness measurements carried out on identified sensitive attributes that might make the model biased when making a prediction. By having this fairness measurement section, you can easily check all these different attributes and find out if your model is making decisions favouring a certain group or disfavouring another. We will discuss fairness measurement more in detail in an upcoming blog.
Lastly, we also have an interactive part in our model card where stakeholders and users can actively use the model and check the model’s prediction. You can see and adjust all the input parameters of the model to your liking and then get a prediction from the model. For example, in our case, since we are using a dataset to predict if a loan borrower will default or not, we can adjust the values of some features such as loan amount, income and property value, and then click predict to check the model’s predictions along with its confidence for the prediction.
Additionally, we have implemented an explainability component within this interactive part of the model card, which identifies the importance of each parameter the model has received, and how it affected the model’s decision. Having access to this explainability component promotes better transparency for the users and the stakeholders and can even help identify if there are any potential ethical or social issues with the model’s prediction.
We’ve seen how you can create, update and display model cards. How should we share them with stakeholders?
Model Card Repository
Hosting a model card repository in an organisation can be a valuable way to ensure that Machine Learning models are documented and accessible to the necessary stakeholders.
By centralising model cards in a single location, organisations can promote transparency, consistency, and accountability, and ensure that models are developed and used ethically and responsibly.
However, one consideration to make with this is that access to these model cards may need to be restricted in some cases. For example, if a model includes sensitive data or proprietary algorithms, it may be necessary to limit access to the model card to a smaller group of authorised users. By carefully managing access to model card repositories, organisations can make sure information is still shared with those who should have it.
In summary, data and model cards are becoming increasingly important in the field of Machine Learning as a way to maintain transparency, reproducibility, and reliability in ML models. Together, a data card and a model card can provide a comprehensive picture of an ML model, helping to build trust and confidence in its use, as well as make it easier to identify areas of improvement and make informed decisions more responsibly.
Datatonic is Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. Need help developing an ML model, or deploying your Machine Learning models fast?