Propensity Modelling at Scale using TensorFlow Estimators and Cloud AI Platform


Discover what a typical Machine Learning pipeline looks like for Propensity Modelling on Google Cloud Platform, using TensorFlow as our framework of choice. We take you through a full end-to-end example, going from exploring the data to deploying an omni-brand propensity model. 

If that sounds interesting, keep on reading! If you just want the code, you can find it on Github.

Why Propensity Modelling?

Personalisation has become the de facto standard in the contemporary retail sector. Businesses are beginning to fully utilise the wealth of data collected on their customers and unlock its potential, in order to provide better individual experiences and generate brand loyalty. A recent study discovered that 80% of consumers are more likely to make a purchase when their shopping experience is personalised. Generic marketing campaigns are no longer fitting the bill and are on their way out. A one-to-one shopping experience is the goal of the retailer and has become the expectation of the customer. 

Although a relatively old practice, Propensity Modelling is still widely used and re-purposed in new and innovative ways in light of recent Machine Learning advances. The objective of a Propensity Model is to predict the likelihood of a customer committing an action, and this action could be amongst making a purchase (which is the main focus of this tutorial), clicking on an advertisement, or accepting a promotional offer. It makes use of relevant features that capture customer and product attributes, and online / offline behaviour, among others.

Propensity-based problems are often designed as classification tasks, modelled using classification Machine Learning algorithms, with the target variable being a binary 1 (committed action) or 0 (did not commit action) outcome. The output predictions from the model usually consist of the distinct class with an associated probability estimate, known as the propensity score, which measures the likelihood to commit an action. Propensity scores are the basic building blocks that are used to assemble appropriate audiences for targeted marketing campaigns.

Let’s dive into a particular use case now, to get some hands-on experience with what is involved in building a propensity model on Google Cloud!

Defining the problem

The open-source Acquire Valued Shoppers dataset (just over 20 GB uncompressed) was selected for this Propensity Modelling demonstration. The dataset includes almost 350 million rows of completely anonymised basket-level transactional data from over 300,000 shoppers, and information on promotional offers given to customers.

The individual tables are:

  • transactions(349,655,789 rows ; 11 columns)
  • offers (37 rows ; 6 columns)
  • history (160,057 rows ; 7 columns) 

The way in which the tables can be joined together are:

  • The transactions table can be joined to the history table by (id, chain)
  • The history table can be joined to the offers table by (offer)
  • The transactions table can be joined to the offers table by (category, brand, company)

Propensity Modelling

A user story was created to clearly define the question that we want to answer throughout this exercise.

“How can we leverage our transaction data to offer customers a more personalised shopping experience for our top 1000 brands?” Major Online Retailer

Assuming that we are building a Propensity Model for an online retailer, we can highlight a few key areas that will positively impact the business:

  • Propensity scores can be fed into targeted and personalised marketing campaigns through multiple channels such as online advertising, email campaigns and personalisation of web experience
  • Better understanding of customers by surfacing relevant and personalised content leading to increased sales revenue and profit
  • An automated solution that can easily scale to include more customers and more brands

All about the data

As the Acquire Valued Shoppers dataset is rather large to manipulate in memory, we chose BigQuery for data pre-processing due to its ability to process petabytes of data in a matter of seconds.

A BigQuery pipeline was created which leveraged the bq command-line interface. A bash script controls the entire preprocessing, feature engineering, train / development / test set creation, baseline calculation, and sampling by executing several SQL queries in sequence.

The full pipeline can be found at processing-pipeline/.

Let’s look at each of the steps in more detail.


The data was saved to a Google Cloud Storage (GCS) bucket and imported into BigQuery. The transactions table is richest in relevant information, so it was the primary source. A summary of this table is provided below, which gives a good overview of the transactional volumes contained within the dataset.

Total Product Transactions 349,655,789
Distinct Customers 311,541
Time Period 02-03-2012 to 28-07-2013
Distinct Store Chains 134
Distinct Departments 83
Distinct Product Categories 836
Distinct Companies 32,773
Distinct Brands 35,689

The tables in BigQuery were connected to Tableau in order to understand the data distributions and trends, and gain key insights that could potentially feed into feature creation and the modelling process. A few of these are presented below.

A trend of the monthly transaction volumes is shown for the entire time period, and the percentage difference with respect to the previous month is coloured in green (positive difference) or red (negative difference). The transaction volume is relatively stable and increasing over time until it peaks in December 2012, shortly followed by a steep decline until July 2013. The reasons for this are unclear, as there is not much additional information provided within the dataset.

Propensity Modelling

Focussing on customers now, we can observe the number of customers binned by transaction volume. The majority of customers have made between 1001 – 2500 transactions, which could imply that the customers in this dataset may not be individuals, but corporate entities. There is even a small subset of customers (0.17%) who have made more than 5000 transactions in the space of just over a year.

Propensity Modelling

Looking at the transaction volumes associated with brands, out of a total of 35,689 unique brands, 31% of brands have only ever had up to 10 transactions associated with them, and 51% have only ever had up to 100 transactions. On the other extreme, 4% of brands have more than 25,000 transactions. Therefore, it is understandable that the online retailer is primarily interested in the top 1000 brands.

Propensity Modelling

Finally, looking at the history table, we notice that 27% of customers, when incentivized with a promotion, made a repeat purchase. This is interesting, as it reflects price sensitivity and brand affinity amongst customers, which is a useful insight for Propensity Modelling.

Image 5

Let’s quantify the behaviour of customers with multiple purchases. The graph displays up to 10 repeat trips and shows that over half of the customers only made a repeat purchase once, and a small proportion continued to make repeat purchases. This highlights how customer loyalty can be built up.

Image 6


After some initial exploration, we set about assessing the quality of the data. This led to the development of the following preprocessing steps that were incorporated into our BigQuery pipeline:

  1. Missing values for productmeasure were imputed with “Unknown”
  2. Removing rows where purchasequantity = 0
  3. A binary variable was created to track whether the product transaction represented a return (or not) by checking that both purchaseamount and purchasequantity were negative
  4. As there are instances where purchaseamount and purchasequantity are negative separately without it being a returned transaction – these were converted to absolute values
  5. The transactions themselves are at the individual product level, therefore a pseudo-transaction ID was created by concatenating the customer ID, date and store chain
  6. Similarly, as the most granular data point in relation to product is the brand, a pseudo-product ID was created by concatenating the brand, product size and product measure which resulted in 143,075 unique pseudo-product IDs
    1. The problem is that there is still a huge variation in price within this ID
  7. Assuming that purchaseamount productprice * quantity – a separate field was created for product price per unit


Next, March 2013 was chosen as the target month due to its substantial coverage of customers and brands, and availability of at least 12 months data prior to it.

The top 1000 brands by number of purchases before March 2013, and the customers who have shopped one of those brands at least once were extracted. They were cross-joined to produce every possible combination, which served as the base for the feature table. The binary target variable was set to 1 if the customer had shopped the brand in March 2013 (excluding returns), and to 0 if they did not, yielding a class distribution where the positive instances accounted for 3.45% of the total.

The choice of features to serve as input to the Propensity Model were selected to ensure that the transaction behaviour of the customer relative to the brand but also across brands and categories were captured. As displayed in the table below, a large quantity of features were created and were available for feature selection during training to find the optimal set.

This is by no means a comprehensive set, a lot more features such as brand loyalty, price sensitivity and the average time between transactions could also have been produced.

Feature Type Description
customer_id str Unique identifier to represent a customer
brand str Unique identifier to represent a brand
promo_sensitive int If the customer made a repeat visit after being shown a promotion (from history table)
brand_seasonality int The number of months in the last 12 months that the customer has shopped the specific brand
aov_{1M/3M/6M/12M} int Average Order Value – the total sale amount divided by the number of transactions for the brand in the time frame
total_sale_quantity_{1M/3M/6M/12M} int The total sale quantity for the brand in the time frame based on purchasequantity (excluding returns)
max_sale_{amount/quantity}_{1M/3M/6M/12M} int The maximum sale amount / quantity in a single instance for the brand in the time frame based on purchaseamount
 or purchasequantity respectively (excluding returns)
total_returned_items_{1M/3M/6M/12M} int The total number of product transactions representing a return for the brand in the time frame
distinct_products_{1M/3M/6M/12M} int The total number of distinct products purchased from the brand in the time frame based on product ID
distinct_chains_{1M/3M/6M/12M} int The total number of distinct store chains that the customer bought the brand in the time frame
distinct_category_{1M/3M/6M/12M} int The total number of distinct categories of the brand that the customer purchased from in the time frame
distinct_days_shopped_{1M/3M/6M/12M} int The total number of separate occasions the customer purchased the brand in the time frame
overall_aov_{1M/3M/6M/12M} int Average Order Value – the total sale amount divided by the number of transactions across all brands in the time frame
overall_sale_quantity_{1M/3M/6M/12M} int The total sale quantity across all brands in the time frame based on purchasequantity (excluding returns)
overall_returned_items_{1M/3M/6M/12M} int The total number of product transactions representing a return across all brands in the time frame
overall_distinct_products_{1M/3M/6M/12M} int The total number of distinct products purchased across all brands in the time frame based on product ID
overall_distinct_brands_{1M/3M/6M/12M} int The total number of distinct brands the customer purchased from in the time frame
overall_distinct_chains_{1M/3M/6M/12M} int The total number of distinct store chains the customer purchased from across all brands in the time frame
overall_distinct_category_{1M/3M/6M/12M} int The total number of distinct categories the customer purchased from across all brands in the time frame
overall_distinct_days_shopped_{1M/3M/6M/12M} int The total number of separate occasions the customer transacted across all brands in the time frame

Once the feature table was created, it was randomly split into train / development / test sets with 80% : 10% : 10% respectively, and exported to GCS. The schema was extracted from BigQuery as a JSON file for use in the training package later:

bq show --schema --format=prettyjson --project=<GCP PROJECT> <BQ DATASET>.<TABLE> > schema.json


A randomly sampled dataset was created with an approximate 1:1 ratio to balance the classes in a highly imbalanced training set, leaving approximately 18 million rows for training. This simple sampling method was applied for demonstration purposes, but there are many more methods out there that could be more appropriate for a different use case. Whilst the training data was sampled, the development and test sets were kept as is to represent the true distribution, and also adhering to the general rule of thumb which states that the distributions for these two sets should be identical.


A simple yet effective baseline was constructed to act as a benchmark for model performance, with the following logic – if a customer purchased a given brand in February 2013, then they would also purchase that same brand in the following month, March 2013.


Once you have cloned the repository and set up your GCP environment, you can execute the preprocessing and feature engineering pipeline by following the steps:

  1. Download the transactions and history data from Kaggle
  2. Create a GCS bucket and upload the CSV files
  3. Create a dataset in BigQuery, in this case it was propensity_dataset
  4. Import the transactionsand history datasets from GCS into the specified BigQuery dataset
  5. Execute the bash script with the GCP project and BigQuery dataset as command-line arguments: bash <GCP PROJECT> <BQ DATASET>
  6. Export the tables to GCS

Modelling, evaluating and serving

The TensorFlow code for model training can be found in  package/trainer/. The AI Platform bash scripts are in the parent directory ai-platform/.


There were three pre-made estimators that were trialled, but the initialize_estimator function in can easily be extended to include additional estimators.

  • DNNClassifier – a feed-forward neural network consisting of fully connected layers. Except for the input nodes, each node is a neuron that uses a non-linear activation function, which enables it to distinguish data that is not linearly separable. Training is performed via backpropagation. 
  • DNNLinearCombinedClassifier – also known as a Wide & Deep model which combines a Linear model (Wide model) for memorisation, that can quickly ascertain broad patterns in the data, with a Deep Neural Network (Deep model) for generalisation, that slowly learns finer patterns. The combination of both methods enhances the overall score in several use cases, for example, a similar model is used to recommend apps on the Google Play store.
  • BoostedTreesClassifier – TensorFlow implementation of the gradient boosting decision tree algorithm. Gradient boosting is an approach where new models are created one at a time that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

Feature columns

Next, we define feature_columns to use with our three estimators. Our build_feature_columns function returns feature lists whereby the dense features are fed into the DNNClassifier and the deep part of the DNNLinearCombinedClassifier, and the bucketized / categorical columns are fed into the BoostedTreesClassifier and the wide part of the DNNLinearCombinedClassifier. This is just one way of arbitrarily organising the features and there is plenty of room for further experimentation.

The boundaries for the bucketized columns were set by observing percentiles for the features in BigQuery. An improvement on this could be to use TensorFlow Transform (TFT) for operations that require a full pass over the dataset, like bucketization.

Feature selection

With 72 features in total, a —feature_selec argument is available in for the user to supply a list of up to 7 options to remove features corresponding to the term in the dictionary below.

feature_dict = {

    1: "returned",

    2: "chains",

    3: "max_sale_quantity",

    4: "overall",

    5: "12m",

    6: "6m",

    7: "3m"


For example, if —feature_selec = [1, 2], then feature columns containing the terms “returned” or “chains” will be excluded from the feature set (total of 16 features). This is useful for finding a more optimal and simple feature set combination, and can also be tuned within a hyperparameter tuning job.


The evaluation metrics selected for assessing model performance are listed below, and are common across classification tasks:

  1. Accuracy – total number of correct predictions divided by the total number of instances
  2. Precision – number of correct positive instances divided by the number of all positive instances
  3. Recall – number of correct positive instances divided by the number of all relevant samples
  4. F1-score – harmonic mean of precision and recall
  5. Area Under Curve (AUC) – probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance

The values for these are also exported locally / to GCS as a JSON file, along with the model settings and parameters, to help keep track of model trials and results instead of going through Stackdriver logs – see export_train_results and export_eval_results in

Training on AI Platform

Python TensorFlow code ( defines:

  • Data reading
  • Mode: train, predict or evaluate
  • Model architecture for each of the three estimators
  • How features columns are represented (one-hot encoding, label encoding, embeddings..)
  • The training process / parameters and metrics used
  • How the model is saved (how often it is checkpointed)

The bash scripts provided in ai-platform/ contain the parameters for the best performing model from our experiments, which was a DNNLinearCombinedClassifier estimator.

  1. The train / dev / test data, schema and brand vocabulary should be exported to GCS from BigQuery
  2. The AI Platform training job can be submitted by running the script, Note that can be used instead for local model training / prediction and evaluation, see the documentation for more information.
    1. The model type is specified directly in the bash script
    2. The configuration file can be added to the script either for standard model training with custom infrastructure information, or for hyperparameter tuning which also contains the parameter search spaces
    3. Additional arguments are stated in which accepts
    4. AI Platform automatically packages up the python code, pushes it to GCS and then distributes it to the workers – there is a file to download the necessary dependencies
  1. Stackdriver Logging can be used to monitor the AI Platform training job and catch any errors
  2. The training job can be monitored in TensorBoard by pointing to the GCS directory that contains the model checkpoints:
    1. tensorboard --logdir=gs://<GCS BUCKET>/models/<MODEL TYPE>/<JOB NAME>/model
  3. A similar script,, runs an evaluation on the test set by specifying --mode=evaluate (disguised as an AI Platform training job) and the correct directory containing the model checkpoints. The same parameters supplied during training should also be supplied during evaluation as the application re-builds the model using the saved weights.
  1. The signature (inputs / outputs) of the saved model can be observed using the below command:
saved_model_cli show --dir gs://<GCS BUCKET>/models/<MODEL TYPE>/<JOB NAME>/serving/<SERVING ID> --tag serve --signature_def predict

Hyperparameter tuning

Anyone who has trained Machine Learning models before knows that there is an art to it. There are many free parameters and choices to make, and hyperparameter tuning is a useful method that lets us explore the search space in a more disciplined way. Hyperparameter tuning on AI Platform makes use of the power of GCP by parallelizing the execution of jobs whenever possible. It trains the model using different configurations of the input parameters in a smart way using techniques like Bayesian optimization. Given enough time, it can produce near optimal values for the model hyperparameters, which in turn maximizes the model’s predictive power. 

The hyperparameter search space is defined in hptuning_config.yamlonly for the DNNClassifier and DNNLinearCombinedClassifier estimators, but this can be easily adapted for BoostedTreesClassifier.

  • batch size
  • learning rate
  • dropout
  • optimizer
  • hidden units
  • feature selection

As well as the parameters, additional information is included in the configuration file about the number of trials, number of trials to run in parallel, the metric to maximise / minimise, and whether to enable early stopping.

A hyperparameter tuning job can be executed on AI Platform by running with the inclusion of --config=hptuning_config.yaml.

Image 7

Model serving

Once you have trained a model that meets your performance requirements, you can deploy it for serving using the script. The script creates a model name e.g. “WD” (Wide and Deep) and then a model version e.g. “v1”.


Once deployed, the model can be used for both batch and online predictions. There is an extender, tf.contrib.estimator.forward_features, implemented in to ensure we can retrieve the customer_id and brand fields directly in the predictions for ease of interpretation. A copy of the two fields are made in the input_fn and serving_input_receiver_fn using tf.identity and forwarded through to the output, whilst the original versions are manipulated in build_feature_columns.

For batch prediction, the script can be executed with the correct model name and version as command-line arguments, provided the instances for scoring (e.g. batch_test.json) are present in GCS:


The predictions themselves are output to GCS as text files with newline delimited JSON blobs, along with an error statistics file, in case of any issues.

For online prediction, a single prediction is generated at a time – can be run with test JSON file online_test.json:



Image 8


In a real world scenario, once you have used your trained model to generate propensity scores on new data, you can ingest these scores back into BigQuery to create audiences for targeted marketing campaigns. 


In this post you we walked you through what a typical Machine Learning pipeline would look like on Google Cloud Platform for Propensity Modelling using TensorFlow Estimators. We leveraged technologies like BigQuery for preprocessing and feature engineering, and AI Platform for model training, optimisation and serving. We managed to build an omni-brand model capable of generating millions of propensity scores (i.e the likelihood of a customer to shop each of the top 1000 brands in the next month), which can be used for marketing activation.

The architecture with respect to the core components is shown below. We hope that it has given you a good overview of how Propensity Modelling, and almost any other modelling task, can be done at scale on GCP.

Image 9

The code used throughout this exercise is available on Github.

View all
View all
Generative AI LLMOps
Generative AI + LLMOps: Applications in Production
Generative AI
Generative AI
Generative AI 101: GenAI on Google Cloud
Generative AI
Responsible AI
Responsible AI: The Role of Data and Model Cards
Responsible AI