Real-time streaming predictions using Google’s Cloud Dataflow and Cloud Machine Learning
Google Cloud Dataflow is probably already embedded somewhere in your daily life, and enables companies to process huge amounts of data in real-time. But imagine that you could combine this — in real-time as well — with the prediction power of neural networks. This is exactly what we will talk about in our latest blogpost!
It all started with some fiddling around with Apache Beam, an incubating Apache project that provides a programming model that handles both batch and stream processing jobs. We wanted to test the streaming capabilities running a pipeline on Google Cloud Dataflow, a Google managed service to run such pipelines. Of course, you need a source of data, and in this case, a streaming one. Enter Twitter, which provides a nice, continuous stream of data in the form of tweets.
To easily get the tweets into Beam, we hooked the stream into Google Cloud PubSub, a highly scalable messaging queue. We set some filters on the Twitter stream (obviously we included @teamdatatonic and #datatonic) and then hooked the PubSub stream into Dataflow.
Dataflow has a connector that easily reads data from PubSub. The incoming data points are raw, stringified jsons and they first need some parsing and filtering. This is done in the “Parse tweets” part of the pipeline (see below for an overview of the pipeline). We extract the text from the tweets, but in order to have a valid input for the machine learning model, we have to vectorise them as well. This is done in the “Construct vectors from text” step.
As you might notice in the Dataflow graph below, a side input is used in this step as well. Since the inputs into the ML model we used are numbers rather than strings, we assign different words to different indexes and use a lookup table to convert them. This is done by loading the vocabulary into the memory (the full vocabulary was 4,5 MB) of all the Dataflow workers, where the lookup table is implemented as a Hashmap.
Next thing we needed was a simple, but cool model we could apply to our tweets. Luckily, there are some ready-to-use models on GitHub such as this one, which predicts sentiment using a convolutional neural network (CNN). We will not go into much detail here as we focused our work on the pipeline rather than getting the most accurate predictions. The only changes we made to the model are the ones outlined here, which enables Google Cloud Machine Learning to handle the inputs and outputs of the Tensorflow graph, and then we deployed it.
One of the capabilities of Cloud Machine Learning are online predictions — these are made by HTTP requests where the body of the requests contains the instances we wish to classify. We used a client library for Java to handle the communication between the Dataflow workers and the Cloud Machine Learning API interface, which requires authorization.
This is the most exciting part of this infrastructure — it is now possible to create a feed of data and make machine learning predictions on it in real-time, and have it as a fully managed and scalable service. In our pipeline we sent the tweets transformed into vectors to the deployed model and obtain a prediction of sentiment. All of this is done in the “Predict sentiment using CNN on Cloud Machine Learning” step.
We wrote each tweet with its sentiment to BigQuery where it can be used for further analysis or visualisation. The overview of the full architecture can be found below. As you can see, we added Google Cloud Datastore and Google Cloud BigTable, which can be used when the lookup file becomes too large to fit in the memory, or can be used to enhance the data in your pipeline in other ways. This would enable a range of various applications (such as time series analysis), where we need a low latency random access to large amounts of data.
Since this is rather a proof-of-concept and reference pipeline, the output is not exceptionally exciting. However, it opens up a whole new world of live processing and serving predictions, which can all be seamlessly embedded into your data pipeline. We are really excited about these new possibilities and looking forward to apply it in our future projects!