Network Evolution for DNNs

Deep Neural Networks increasingly power applications like image search, voice recognition, autonomous vehicles, spam detection, datacenter power management, etc. Many of these applications require DNNs to be periodically retrained, thereby improving prediction quality. As a result improving DNN training time has a significant impact on application performance. As a result DNN training is increasingly distributed across machines, and executed on GPUs, ASICs, or other specialized hardware. In this paper we analyze how the network fabric impacts DNN training time in order to determine how the network fabric should change to better accommodate these jobs. We rely on analytical models and trace driven simulation for our analysis and find that changing the network fabric can significantly impact DNN training performance, but unlike traditional data parallel systems the biggest improvements come from improving data distribution mechanisms rather than aggregation mechanisms.