Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

In mission-critical real-time applications, using machine learning to analyze streaming data are gaining momentum. In those applications, Apache Kafka is the most widely used framework to process the data streams. It typically works with other machine learning frameworks for model inference and training purposes. In this talk, our focus is to discuss the KafkaDataset module in TensorFlow. KafkaDataset processes Kafka streaming data directly to TensorFlow’s graph. As a part of Tensorflow (in ‘tf.contrib’), the implementation of KafkaDataset is mostly written in C++. The module exposes a machine learning friendly Python interface through Tensorflow’s ‘tf.data’ API. It could be directly fed to ‘tf.keras’ and other TensorFlow modules for training and inferencing purposes. Combined with Kafka streaming itself, the KafkaDataset module in TensorFlow removes the need to have an intermediate data processing infrastructure. This helps many mission-critical real-time applications to adopt machine learning more easily. At the end of the talk, we will walk through a concrete example with a demo to showcase the usage we described.