Upgrade to add your own logo

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022

The Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy Office of Science user facility, provides world-class high-performance computing (HPC) resources for open science as well as world-class expertise in scientific computing. The OLCF operates 2 of the top 5 supercomputers in the world: Frontier and Summit. Our Kafka cluster was built in 2018 to stream data from Summit, a 200 Petaflop system with 4,000 compute nodes—but is the cluster ready for Exascale? The OLCF has recently delivered Frontier, the world's first exascale system, and we engineered a significant increase in streaming bandwidth and volume to serve its performance metrics, system events, utilization metrics, job metadata, and facilities monitoring. Data is indexed and served through an Elasticsearch cluster and provided in real time to Grafana dashboards.

In this talk we will discuss scaling and planning a system to meet the streaming demands of the world’s only exascale and most energy efficient supercomputer. Tune in to learn more about HPC and how streaming fits in to monitoring large-scale systems. We will discuss aggregating data from many clusters into a central streaming system, shedding technical debt by pivoting to Confluent Operator on Kubernetes, and how we use real-time data to optimize supercomputer performance.