m3: Accurate Flow-Level Performance Estimation using Machine Learning

  • Chenning Li ,
  • Arash Nasr-Esfahany ,
  • Kevin Zhao ,
  • Kimia Noorbakhsh ,
  • ,
  • Mohammad Alizadeh ,
  • Thomas Anderson

ACM SIGCOMM |

Data center network operators often need accurate estimates of aggregate network performance, such as the frequency of poor tail latency events, to guide network configuration — when and where to add capacity as a function of increased load, which network congestion control algorithm to use and how best to tune its parameters, and so forth. Unfortunately, existing methods for estimating aggregate network statistics are either fast and systematically inaccurate, or are detailed but too slow to be practical at data center scale.

In this paper, we develop and evaluate a scale-free, fast, and accurate model for estimating data center network tail latency performance given a workload, topology, and network configuration. First, we show that path-level simulations — simulations of traffic that intersects a given path — produce almost the same aggregate statistics as full network-wide packet-level simulations. We use a simple and fast flow-level fluid simulation in a novel way to capture and summarize essential elements of the path workload, including the effect of cross-traffic on flows on that path. We use this inaccurate simulation as input to a simple machine-learning model to predict path-level behavior, and run it on a sample of paths to produce accurate network-wide estimates. Our model generalizes over the choice of congestion control (CC) protocol, CC protocol parameters, and routing. Relative to Parsimon, a state of the art system for rapidly estimating aggregate network tail latency, our approach is significantly faster, more accurate, and more robust.