Last week, Datatonic organised the first Apache Beam meetup in London, working together with Qubit and Google to educate and spark discussions in the area of data engineering and more specifically: Apache Beam.
They started developing Dataflow / Beam to make previously written pipelines more portable, so moving to new engines or environments wouldn’t necessarily involve rewriting well-written and tested business logic. In 2015, Google released the Dataflow Model paper and in 2016 the project was submitted to the Apache software foundation resulting in the Apache Beam project.
One of the main characteristics that discerns Beam from other data processing frameworks is the fact that the processing logic (written using the Beam SDK) is independent of the runner (how the system actually does the relevant processing), making the code portable. For those interested, an overview of the different runners and their capabilities can be found here: Beam Capability Matrix. This makes it possible to run Beam code on your existing on-prem Spark cluster if your company hasn’t signed off on Cloud technologies, and also allows you to future-proof your code in case you want to move to a managed cloud service (e.g. Google Cloud Dataflow — which is the name of Google’s managed Beam runner).
If you want to get started using Apache Beam, have a look at some of the relevant resources at the bottom of the blogpost!
Since we at Datatonic are quite excited about this new technology, and see interest with our clients for Beam as well, we wanted to build a community to share experience and learn from experts and each other, so we set up a meetup. For the first event we invited three experts to talk about Beam-related topics!
Victor Kotai [LinkedIn, Twitter], software engineer at Qubit, gave an introductory glance at Beam. It went through some of the Beam concepts like the different transforms, and how Beam handles time. The session ended with some code examples and how this translates into a pipeline. This allowed us to get everyone on a similar level before digging deeper. (Although the attending audience was pretty advanced already!)
As a second talk, we had a more use-case focused session, in which Jibran Saithi [LinkedIn, Twitter], tech lead at Qubit laid out how their data infrastructure evolved through time and how they ended up with Apache Beam running on Google’s managed service. They use Beam in both streaming and batch mode at scale and in harmony with other GCP tools like Pub/Sub and BigQuery. We were shown their current architecture, why they are still using Beam, and some of the features they would like to see in the future; which made for an interesting discussion.
To wrap up, we were fortunate enough to have Reuven Lax [LinkedIn, Twitter] talk. Reuven is one of the co-authors of both the Millwheel and Dataflow papers so he is one of the most knowledgeable people in the Beam world. Since he was partly responsible for the inception of Beam and is currently team lead on the Dataflow team, he was able to talk us through the concepts in a clear and concise way and helped us understand some of the more complicated Beam aspects, like state & timers, as well as answer some of the more advanced questions people in the room had.
Overall, we think it was a great first meetup, but we are just getting started. We would love to invite you to our next sessions, which you should be able to find here soon! We will be posting the slides of the first session on the page as well, so stay tuned.
If you want to get involved or have something interesting to share, don’t hesitate to get in touch!