Building a BQML Linear Model with Looker Using Public Data
Author: Christelle Xu, Business Analyst
In my last article I introduced you to BQML and its implementation and integration with Looker. My goal this time is to walk you through a specific example to help solidify those concepts with a tangible use case.
I’ll be basing my walkthrough on Google’s tutorial “Using BigQuery ML to Predict Birth Weight” which you can find here. Much of the BQML syntax will stay the same given that integrating this model in Looker simply involves building a Looker wrapper around BQML code. I’ll be building dimensions from the results to illustrate the visualisation capabilities available once BQML is integrated in Looker. Much of the technical justifications of the integration I covered in my previous blog so please do use it as a reference whilst implementing the model below.
By the end of this tutorial you’ll be able to build a dashboard like this to help facilitate the evaluation of your model:
We’ll be using the well known natality data set to create a model that predicts a baby’s birth weight. The data is publicly available through BigQuery so please follow along.
Prerequisites
If you haven’t read part one of this series, I urge you to at least have it open. You can find it here. I introduce you to how BQML integrates with Looker and walk you through the logic behind the syntax.
I also assume you have basic Looker LookML experience and BigQuery knowledge.
Accessing The Data
When running a BQML model, you need to make sure that both your data and your model are sitting in the same location. In this case, however, Looker will be writing the model in its scratch schema, the process by which you’ll define when creating the necessary persistent derived table (PDT).
Given that the natality data set is sitting in the US, you’ll need to make sure that the connection you create in Looker also in the US. That may mean creating a new connection where your host connection is US based.
By default, you’ll have access to public data sets via your service account in BigQuery. You should also, therefore, have access to these data sets in Looker. But you’ll want to make sure that that connection hasn’t been manually turned off or removed.
The Setup
Once the connection is in place, you can start pulling through the fields that will be used in your model. Best practice is to create an autogenerated file by creating a view from table, selecting ‘natality’ under publicdata:samples
. Nest this autogenerated view under the ‘ag’ folder, and rename the view as ag_natality
for clarity.
This will pull through all fields in your underlying table as dimensions where Looker will infer the underlying data type. Don’t make changes to this file.
Any changes we’ll make in the extended file, which includes the autogenerated file, will overwrite it with changes specified in the extended file. This is important if your underlying schema changes often — you can always regenerate the underlying autogenerated view while keeping your business logic in the extended view.
This brings us to our next step — extending the autogenerated view into an extended view to make the business logic changes. Namely you’ll notice that in the BQML Google tutorial that they cast mother_race()
as a string. This is because mother_race()
is a categorical variable but is saved as a numeric type. All this requires in Looker is a simple cast.
This is perhaps one of the greatest strengths of defining your BQML model in Looker. Once you’ve extended your file and defined the properties of your dimensions, you’re left with a one source of truth for your data. Your team can now work from the same data source using the same data definitions.
The next step is to create a native derived table (NDT) that contains the inputs we want to use. Keep in mind that in order for this NDT to work, the explore source must be exposed in the model.
Here we’ve called the inputs we want to include in our linear model, the same ones called in the BQML tutorial, but we can just as easily add or remove inputs as we see fit:
view: natality_input {
derived_table: {
explore_source: ag_natality {
column: weight_pounds {}
column: is_male {}
column: gestation_weeks {}
column: mother_age {}
column: mother_race {}
}
}
}
Having built out our derived table, in this case called natality_input
, we can reference and reuse these inputs without having to type out the required fields each time they’re needed. This mitigates human error and facilitates consistency.
Building out the Linear Model
To build out the linear model in Looker, we’d wrap the LookML syntax around the BQML query as follows:
view: natality_linear_reg {
derived_table: {
datagroup_trigger: bqml_datagroup
sql_create:
CREATE MODEL
${SQL_TABLE_NAME}
OPTIONS (
model_type='LINEAR_REG',
input_label_cols=['weight_pounds'],
OPTIMIZE_STRATEGY = 'BATCH_GRADIENT_DESCENT'
) AS
SELECT
*
FROM ${natality_input.SQL_TABLE_NAME}
WHERE weight_pounds IS NOT NULL
AND RAND() < 0.001;;
}
}
A few notes on the syntax:
datagroup_trigger
: Where we define the caching policy. The model is saved in the scratch schema and this function defines how often that model is regenerated.- Under
where
, we’re excluding rows where weight is NULL and we’re randomly selecting a sample of rows to train our model with. optimize_strategy
: An optional field which I’ve added for the sake of visualisation. In the example Google provides, thenormal_equations
(least squares), optimisation strategy is automatically called and therefore only one iteration is required to converge to the final model — which is logical. The option I purposefully chose —batch_gradient_descent
is definitely overkill. But for the sake of illustrating some of Looker’s visualisation capabilities, I’ve opted for batch gradient descent which offers me more than one iteration.
Evaluating the model
BQML offers a number of ways to evaluate the model produced. One advantage of evaluating the model through Looker is that the fields returned by these functions can all be visualised in one place.
We’ll first cover ml.TRAINING_INFO
. The function ml.TRAINING_INFO
returns information about training iterations of a given model. The syntax is once more, a Looker wrapper around BQML code:
view: natality_linear_reg_training {
derived_table: {
sql:
SELECT
*
FROM ml.TRAINING_INFO(
MODEL ${natality_linear_reg.SQL_TABLE_NAME});;
}
dimension: training_run {type: number}
dimension: iteration {type: number}
dimension: loss {type: number}
dimension: eval_loss {type: number}
dimension: duration_s {
label: "Duration (s)"
type: number
sql: ${TABLE}.duration_ms / 1000.00;;}
dimension: learning_rate {type: number}
measure: total_loss {
type: sum
sql: ${loss} ;;
}
measure: total_duration {
type: sum
sql: ${duration_s} ;;
}
measure: total_iterations {
sql: max(${iteration}) ;;
}
measure: average_learning_rate {
type: average
sql: ${learning_rate} ;;
}
}
A few notes on the syntax:
- The dimensions created are a subset of fields available for evaluation. Pull through as many or as few as you feel are relevant.
- Feel free to use the full data manipulation capabilities of dimensions in Looker. Whether that means buckets, casting, or value format changes, etc.
- You’ll have to expose this view as an explore to be able to see and use the explores and measures you’ve created.
- BQML returns different metrics depending on the model being run.
Next we’ll take a look at ml.EVALUATE
. The function ml.EVALUATE
returns metrics that reflect the model’s predictive performance. The syntax is as follows:
view: natality_linear_reg_eval {
derived_table: {
sql:
SELECT
*
FROM ml.EVALUATE(
MODEL ${natality_linear_reg.SQL_TABLE_NAME},
(SELECT
*
FROM ${natality_input.SQL_TABLE_NAME}));;
}
dimension: mean_absolute_error {type: number}
dimension: mean_squared_error {type: number}
dimension: mean_squared_log_error {type: number}
dimension: median_absolute_error {type: number}
dimension: r2_score {type: number}
dimension: explained_variance {type: number}
}
The notes made on the ml.TRAINING
function are relevant here as well. It is especially worth highlighting in this case that the metrics returned for ml.EVALUATE
will vary based on the type of model being run.
With these metrics in place, you can start thinking about how the metrics can be exposed as explores. Perhaps even how these metrics can be visualised within the dashboard like the at the top of this blog.
Predicting
Once we have our model all together, we can run predictions and create dimensions and measures to deliver predictions to stakeholders alongside other key metrics.
You’ll notice below that alongside the fields that are normally pulled through when ml.PREDICT
, I’ve created derived dimensions that lend themselves to visualisation.
view: natality_prediction {
derived_table: {
sql:
SELECT
*
FROM ml.PREDICT(
MODEL ${natality_linear_reg.SQL_TABLE_NAME},
(SELECT
*
FROM ${natality_input.SQL_TABLE_NAME}));;
}
dimension: predicted_weight_pounds {
label: "Predicted Weight (Pounds)"
type: number
}
dimension: predicted_weight_buckets {
type: string
sql: CASE
WHEN ${predicted_weight_pounds} >= 0
AND ${predicted_weight_pounds} < 2.5
THEN "0-2.5 LBS"
WHEN ${predicted_weight_pounds} >= 2.5
AND ${predicted_weight_pounds} < 5
THEN "2.5-5 LBS"
WHEN ${predicted_weight_pounds} >= 5
AND ${predicted_weight_pounds} < 7.5
THEN "5-7.5 LBS"
WHEN ${predicted_weight_pounds} >= 7.5
AND ${predicted_weight_pounds} < 10
THEN "7.5-10 LBS"
ELSE ">10 LBS"
END;;
}
dimension: residual {
type: number
sql: ${predicted_weight_pounds} - ${TABLE}.weight_pounds;;
}
dimension: residual_tiers {
type: string
sql: CASE
WHEN ${residual} < -10 THEN "< -10"
WHEN ${residual} >= -10
AND ${residual} < -7.5 THEN "-10 to -7.5"
WHEN ${residual} >= -7.5
AND ${residual} < -5 THEN "-7.5 to -5"
WHEN ${residual} >= -5
AND ${residual} < -2.5 THEN "-5 to -2.5"
WHEN ${residual} >= -2.5
AND ${residual} < 0 THEN "-2.5 to -0"
WHEN ${residual} >= 0
AND ${residual} < 2.5 THEN "0-2.5"
WHEN ${residual} >= 2.5
AND ${residual} < 5 THEN "2.5-5"
WHEN ${residual} >= 5
AND ${residual} < 7.5 THEN "5-7.5"
WHEN ${residual} >= 7.5
AND ${residual} < 10 THEN "7.5-10"
ELSE ">10"
END;;
}
dimension: weight_pounds {type: number}
dimension: expected_weight {
type: string
sql: CASE
WHEN ${weight_pounds} >= 0
AND ${weight_pounds} < 2.5 THEN "0-2.5 LBS"
WHEN ${weight_pounds} >= 2.5
AND ${weight_pounds} < 5 THEN "2.5-5 LBS"
WHEN ${weight_pounds} >= 5
AND ${weight_pounds} < 7.5 THEN "5-7.5 LBS"
WHEN ${weight_pounds} >= 7.5
AND ${weight_pounds} < 10 THEN "7.5-10 LBS"
ELSE ">10 LBS"
END;;
}
dimension: is_male {type: yesno}
dimension: gestation_weeks {type: number}
dimension: mother_age {type: number}
dimension: mother_race {type:string}
dimension: pk {
hidden: yes
sql: GENERATE_UUID();;
}
measure: m_average_residual {
label: "Average Residual"
type: average
sql: ${residual} ;;
}
measure: count {
type: count_distinct
sql: ${pk} ;;
}
}
A few notes on the syntax:
- The input data in
ml.PREDICT
is the same as that as that used to train the model. This is easy in Looker as this can be done by referencing the NDT created at the beginning. - I’ve created a primary key for this table to facilitate row counting.
- The field
mother_race
is a categorical variable which we casted as a string at the very top.
Pulling it all together — Visualising
As I’ve mentioned before, one of the strengths of building out these queries in Looker is to have an at-a-glance view of the fields needed to evaluate and deliver key metrics.
Now obviously, this is a toy model. We can see from the R2 that our model isn’t actually a great fit, and if you had any doubts, the Predicted Weights by Gestation Weeks graph with most of the predicted ranges returns predictions only between 5–7.5 pounds. But this should give you a taste of how much easier it is to train your model with all evaluation metrics in one place. Where to look at a metric after a model tweak, all you have to do is select ‘Clear Cache and Refresh’ rather than query multiple relevant functions.
Delivering Predictions
For our use case, one powerful way to deliver data is by allowing the user to filter on fields relevant to the prediction and have the Look created pull through the correct prediction. This visualisation can be created with the same dimensions we’ve built out for the dashboard.
Congrats!
You’ve now navigated through a concrete example of an implementation of a BQML linear model in Looker. Hopefully this exercise has illustrated how much easier it is to evaluate and deliver your predictions and metrics when BQML is integrated with Looker. Try it with your own data!
Do you have questions or projects you’d like our help with? Get in touch – we’d love to hear from you.