Using Kafka to replace RabbitMQ and eliminate task processing outages at DoorDash with Ashwin Kachha
Scaling backend infrastructure to handle hyper-growth is one of the many exciting challenges of working at DoorDash. In this talk, we’ll discuss some scaling issues in 2019 that prompted us to accelerate our adoption of Kafka.
In mid 2019, we faced significant scaling challenges and frequent outages involving Celery and RabbitMQ, two technologies powering the system that handles the asynchronous work enabling critical functionalities of our platform, including order checkout and Dasher assignments. We quickly solved this problem with a simple, Apache Kafka-based asynchronous task processing system that stopped our outages while we continued to iterate on a robust solution. Our initial version implemented the smallest set of features needed to accommodate a large portion of existing Celery tasks. Once in production, we continued to add support for more Celery features while addressing novel problems that arose when using Kafka.
Thereafter, we adopted Kafka across a variety of domains either directly, or in conjunction with technologies like Flink and Cadence. Kafka’s ability to scale and provide at-least-once message delivery has been crucial for our use cases and given us a boost in reliability across several domains.
Slides: https://www2.slideshare.net/ConfluentInc/doordash-using-kafka-to-replace-rabbitmq-and-eliminate-task-processing-outages/ConfluentInc/doordash-using-kafka-to-replace-rabbitmq-and-eliminate-task-processing-outages
In mid 2019, we faced significant scaling challenges and frequent outages involving Celery and RabbitMQ, two technologies powering the system that handles the asynchronous work enabling critical functionalities of our platform, including order checkout and Dasher assignments. We quickly solved this problem with a simple, Apache Kafka-based asynchronous task processing system that stopped our outages while we continued to iterate on a robust solution. Our initial version implemented the smallest set of features needed to accommodate a large portion of existing Celery tasks. Once in production, we continued to add support for more Celery features while addressing novel problems that arose when using Kafka.
Thereafter, we adopted Kafka across a variety of domains either directly, or in conjunction with technologies like Flink and Cadence. Kafka’s ability to scale and provide at-least-once message delivery has been crucial for our use cases and given us a boost in reliability across several domains.
Slides: https://www2.slideshare.net/ConfluentInc/doordash-using-kafka-to-replace-rabbitmq-and-eliminate-task-processing-outages/ConfluentInc/doordash-using-kafka-to-replace-rabbitmq-and-eliminate-task-processing-outages