Jump to content

Google's distributed computing for dummies trains ResNet-50 in under half an hour

The AchieVer

Recommended Posts

The AchieVer

Google's distributed computing for dummies trains ResNet-50 in under half an hour

Google's new "TF-Replicator" technology is meant to be drop-dead simple distributed computing for AI researchers. A key benefit of the technology can be that it takes dramatically less time to reach benchmark results on standard tasks such as ImageNet.


Is it better to be as accurate as possible in machine learning, however long it takes, or pretty darned accurate in a really short amount of time? 


For DeepMind researchers Peter Buchlovsky and colleagues, the choice was to go for speed of learning over theoretical accuracy. 


Revealing this week a new bit of technology, called "TF-Replicator," the researchers said they were able to reach the accuracy of the top benchmark results on the familiar ImageNet competition in under half an hour, using 32 of Google's Tensor Processing Unit chips operating in parallel. The debut of Replicator comes as Google this week previewed the 2.0 version of TensorFlow.


The results from using TF-Replicator, the authors claim, approached the best results from some other projects that used many more GPUs, including prior work that employed 1,024 of Nvidia's "Tesla P100" GPUs.  


The implication of the TF-Replicator project is that such epic engineering of GPUs can now be achieved with a few lines of Python code that haven't been specially tuned for any particular hardware configuration. 


TF-Replicator can make multiple "workers" that either share a compute graph, as on the left, or have separate compute graphs of their own, as on the right. 


The trick is basically to make Parallel Distributed Computing for Dummies, if you will. A set of new functions have been added to Google's TensorFlow framework that, DeepMind claims, "trivializes the process of building distributed machine learning systems" by letting researchers "naturally define their model and run loop as per the single-machine setting."  


The system is more flexible than a previous TensorFlow approach, called an "estimator,"which imposed restrictions on the ways models are built. While that system was predisposed to production environments, the Google approach is for the R&D lab, for making new kinds of networks, so it's designed to be more flexible.


It's also meant to be much simpler to program than previous attempts at parallelism, such as "Mesh-TensorFlow," introduced last year by Google's Brain unit as a separate language to specify distributed computing. 


The research, "TF-Replicator: Distributed Machine Learning For Researchers," is posted on the arXiv pre-print server, and there's also a blog post by DeepMind.  

The working assumption in the paper is that they want to get to state-of-the-art results fast rather than to try and push the limit in terms of accuracy. As the authors point out, "Instead of attempting to improve classification accuracy, many recent papers have focused on reducing the time taken to achieve some performance threshold (typically ∼75% Top-1 accuracy)," using ImageNet benchmarks, and, in most cases, training the common "ResNet-50" neural network.  


This rush to get to good results is known as "weak scaling," where the network is trained "in fewer steps with very large batches," grouping the data in sets of multiple thousands of examples.  

Hence, the need to parallelize models to be able to work on those batches simultaneously across multiple cores and multiple GPUs or TPUs. 


The authors set out to build a distributed computing system that would handle tasks ranging from classification to making fake images via generative adversarial networks (GANs) to reinforcement learning, while reaching the threshold of competent performance faster. 


The authors write that a researcher doesn't need to know anything about distributed computing. The researcher specifies their neural net as a "replica," a thing that is designed to run on a single computer. That replica can be automatically multiplied to separate instances running in parallel on multiple computers provided that the author includes two Python functions to their TensorFlow code, called "input_fn" and "step_fn." The first one calls a dataset to populate each "step" of a neural network. That makes it possible to parallelize the work on data across different machines. The other function specifies the computation to be performed, and can be used to parallelize the neural network operations across many machines. 


how TF-Replicator builds the compute graph on multiple machines, leaving the "placeholder" functions in the graph where communications will need to be filled in later, here represented by the dotted lines.


The authors note they had to overcome some interesting limitations. For example, communications between computing nodes can be important for things such as gathering up all the gradient descent computations happening across multiple machines. 


That can be challenging to engineer. If a single "graph" of a neural network is distributed across many computers, what's known as "in-graph replication," then problems can arise because parts of the compute graph may not yet be constructed, which frustrates dependencies between the computers. "One replica's step_fn can call a primitive mid graph construction," they write, referring to the communications primitives. "This requires referring to data coming from another replica that itself is yet to be built." 


Their solution is to put "placeholder" code in the compute graph of each machine, which "can be re-written once all replica subgraphs are finalized." 


Results of various configurations of TF-Replicator for the ImageNet tasks on different configurations of hardware.






Link to comment
Share on other sites

  • Replies 0
  • Views 574
  • Created
  • Last Reply


This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Create New...