English | Deutsch

Product
Cloud
- Confluent Cloud
- Support
- Sign Up
- Log In
- Cloud FAQ
Developers
About Us
- Company
- Partners
- News
- Events
- Careers
- Contact
Blog
Docs
Download

Product
Cloud
- Confluent Cloud
- Support
- Sign Up
- Log In
- Cloud FAQ
Developers
Blog
Docs
Download

This is a preview of your video. Customize your viewer experience and add your own logo and branding. Customize your theme

Upgrade to add your own logo

From Kafka to BigQuery – a Guide for Delivering Billions of Daily Events (Ofir Sharony, MyHeritage) Kafka Summit 2018

What are the most important considerations for shipping billions of daily events to analysis? In this session, I’ll share the journey we’ve made building a reliable, near real-time data pipeline. I’ll discuss and compare several data loading techniques and hopefully assist you on making better choices with your next pipeline. MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change-data-capture logs recording every change made in its databases. Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline. We have decided to ship our events to Apache Kafka, and load them for analysis in Google BigQuery.

In this talk, I’m going to share some of the lessons we learned and best practices we adopted, while describing the following loading techniques:
-Batch loading to Google Cloud Storage and using a load job to deliver data to BigQuery
-Streaming data via BigQuery API, along with Kafka Streams as the streaming framework
-Streaming data to BigQuery with Kafka Connect
-Streaming data with Apache Beam along with its cloud Dataflow runner

Along with presenting our journey, I’ll discuss some important concepts of data loading:
-Batch vs. streaming load
-Processing time partitioning vs. event time partitioning
-Considerations for running your pipeline on premise vs. in the cloud

Hopefully this case study can assist others with building better data pipeline.

Vidyard uses cookies to better understand how videos are viewed, and to improve your experience. Learn more

Product
Confluent Platform
KSQL
Subscription
Professional Services
Training
Customers

Cloud
Confluent Cloud
Support
Sign Up
Log In

Solutions
Industry Solutions
Microservices
Internet of Things
Financial Services
Fraud Detection
Customer 360
Azure Hybrid Streaming

Developers
What is Kafka?
Resources
Events
Online Talks
Meetups
Kafka Summit
Kafka Tutorials
Confluent Developer
Docs
Blog

About
Company
Careers
Partners
News
Contact

Copyright © Confluent, Inc. 2014-2025. Terms & Conditions Privacy Policy Do Not Sell My Information Modern Slavery Policy

Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Iceberg®, Iceberg® and associated open source project names are trademarks of the Apache Software Foundation