The latest from the TensorFlow Dev Summit 2018

This year’s TensorFlow Dev Summit saw the introduction of new TensorFlow technologies, which promise to make the task of Machine Learning for developers even more efficient and fun.

Some of the most interesting new features span from art inspiration with the Magenta project to user-friendly debugging in Tensorboard and fast scale up with Distributed TensorFlow.

The summit has highlighted much more, and not everything can be outlined in this post. If you are interested in a brief description of the most interesting contributions, and you may share the same interests as Datatonic, this post may be a nice debrief to the latest in TensorFlow.

Below is a summary of what we at Datatonic think are some of the most interesting developments from the summit which we would like to highlight, we include a brief description of each announcement. It’s far from an exhaustive list, and thus we encourage you to dig deeper by having a look at the event videos at the official TensorFlow Youtube page, and as a starter, check this very nice 2 minutes highlights video.

Tf Data workflow

Tf Data workflow

We all know how efficient the input data library is for defining the input_fn for our models: about 10 lines of code to handle shuffling, repeating, mapping, dividing into batches,.. What if you could do all of it in just one command?

The new reads your CSV files directly into (features, labels) batches, allowing you to specify all the input sub-functions as parameters.

dataset =, ‘*.csv’), batch_size, num_epochs=num_epochs, column_names=_CSV_COLUMNS, column_defaults=_CSV_COLUMNS_DEFAULTS, label_name=LABEL, shuffle=shuffle, shuffle_buffer_size=1000000, shuffle_seed=RANDOM_SEED, prefetch_buffer_size=50, num_parallel_parser_calls=8)
return dataset

In order to use this functionality, the installation of tf-nightly is required.

If you want to wait for an official release or for the package to become more stable, there are some new interesting features at your disposal in the official package to speed up the input data pipeline.

You can set the number of parallel reads using and load the dataset in parallel, apply fused versions of the transformation for enhanced performance (we show an example of these in the next section), or perform GPU prefetching so that the next batch of input data will be already in memory for the next step.

Please, refer to the official documentation at and the video of the summit talk below for more details.


Dealing with big data is an everyday task for ML engineers and researchers.

While your algorithm and computations may be optimized to run on GPUs orTPUs, the training data is processed on CPUs and, if this data is big enough, this turns out to be the bottleneck operation for your model.

The GPU/TPU runs fast, and it spends a lot of time idle while waiting for the CPU to process data. How can we get to a perfect flow, as in the picture below, where your resources are 100% optimised?

Optimized CPU/GPU/TPU flow

Optimized CPU/GPU/TPU flow

In the very interesting talk held by Brennan Saeta, three performance optimisation tricks are highlighted for your input_fn:

  • In your map_fn() function, set multiple threads to load data faster.

dataset =,num_parallel_calls = 64)

  • Pipeline.

dataset =,num_parallel_reads = 32)
dataset = dataset.shuffle(10000)
dataset = dataset.repeat(NUM_EPOCHS)
dataset =, num_parallel_calls=64)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(2)

  • Use fused transformations.

dataset = dataset.apply(,NUM_EPOCHS))
dataset = dataset.apply(, batch_size, num_parallel_batches=4))

If you are wondering: can I get this performance boost with the new make_csv_dataset() function presented in the previous section? The answer is: yes. Just set the parameters appropriately (may require some fine tuning), and you are all set to optimise your computation power.  

For other tricks and a nice introduction to the use of DAWNBench as a benchmark suite for quantifying workload and how efficient your model is at execution, have a look at the awesome talk by Brennan Saeta.


Following this line of boosting efficiency, TensorFlow has introduced Eager execution, a new imperative, OO way of using TensorFlow which promises to make the use experience more and more straightforward by shifting away from the typical graph execution.  

First, you can deploy eager execution by simply enabling it via:

import tensorflow.contrib.eager as tfe

Operations will from now on be executed immediately and return their values to Python without requiring a This should make it easier to debug intermediate results.

The tensorflow.contrib.Eager library gives you:

  • a new dynamic data control flow
  • a new way of doing gradients by means of a tape, which allows you to trace operations afterwards
  • easy profiling
  • an OO environment where variables are treated as objects and models can be organised in classes

How can you exploit this eager execution? The idea is to write, iterate and debug in eager…and then import on graph for deployment by means of the tfe.py_func(). This is because, even though the performance with eager execution is improving fast, it is still slower (identity mapping, GPU op enqueuing,..) than using graph execution.

In conclusion, it would be worth using Eager execution as a first approach to TensorFlow, but expert programmers may want to stick with graph execution for now.
Once again, check the original talk from Alex Passos for more insights (and code).

TensorFlow Extended (TFX)

TFX promises to make the production and deployment of ML models a simpler and quicker process by providing a general-purpose ML platform.

Your standardised end-to-end workflow will look like what is shown in the graph below

standardized end-to-end workflow in TFX

standardised end-to-end workflow in TFX

Let’s go through the individual parts:


Perform both batch and real-time training at serving time within the same graph, consistently.

Common transformations are:


To allow for distributed data pipeline, deploy the Analyzer which exploits Apache Beam to allow the pipeline to be run in different environments.

Analyzers schema

Analyzers schema


Your unchanged TensorFlow model. 

Model Analysis.

In a Jupyter notebook, you can use a very nice UI which allows you to interactively see performance by population bucket or feature from your evaluation model.

TFX UI on Jupyter notebook

TFX UI in Jupyter notebook

This feature is quite important as, typically, you would use a global performance measurement to decide whether the model behaves well or requires some more fine tuning.

What if your ML project is a recommender system for an e-commerce company, and your solution is making one customer segment happy and another segment very miserable? This could have the undesired effect to reduce the profits and the reputation towards the client.

The solution is to use the UI above, and never forget to check the fairness of your model so to make sure to never miss any unfairness of your model.


Stay tuned for the release of the new TensorFlow Serving RESTful API, which moves away from gRCP and allows for far more efficient serving where your input can be packed in a JSON format and sent with a command as simple as

> POST https://host:port/v1/models/${MODEL}[/versions/${VERSION}]

For the full talk on the subject, check this video.

Distributed TensorFlow

ML models are getting bigger and bigger, and more computational resources are required to deal with the Big Data era.

Flexibility in scaling up is a requirement that cannot be omitted, but thankfully Distributed TensorFlow is planning to give you even more flexibility in the choice of how to scale up via its Distribution Strategy library.

Introducing Distributed TensorFlow strategy

Introducing Distributed TensorFlow strategy

Scaling up may be as simple as using a GPU rather than a CPU or, better, use multiple GPUs or, even better, deploy multiple machines. When deploying multiple machines, communication between workers and parameters server is an issue: until now, the solution has been to have parameter servers hold the training weights, and the workers exploit a copy of the graph to train their own network.

The problem: it is asynchronous.

The solution: the new synchronous All-reduce approach.

All-reduce is a fused algorithm, capable of combining extremely efficiently, all the weight values provided by each single worker and distributing the results back to all processes by broadcasting. In this way, only one copy of the graph (and one common checkpoint) is stored. Not much code change is required, and this makes using the Distributed Strategy library a must.

Once again, two things to note:

  1. Distributed Strategy is still in development, and the tf-nightly release is required to deploy it.
  2. Check the talk for more details!


As a leading ML and analytics solutions business, we like to keep up with all the latest technologies and news, especially when it comes to TensorFlow and AI.

The TensorFlow Dev Summit 2018 has introduced many new features and technologies within which may still be in a testing phase, but have the potential to change the way we do ML analysis and spare us from a huge amount of stress and, make us more efficient developers.

This post has quickly summarised some of the most interesting new features from this year’s summit, but much more has been developed and we highly suggest to look at the whole Summit talk, as fresh ideas are bound to ignite. For example, why not use Magenta for a cool demo or project?


Up next
Case Studies
View now