I worked on Neural Network in the past. It is by far the most interesting machine learning algorithm that I've learnt. There is always a question in my mind, how is it possible to train a neural network using many machines in a cluster. This paper gives me the first glimpse of it.
Parallelizing the learning computation of a neural network (more specifically Feed-Forward Neural Network) involves passing messages forward and then backward through the networks. Since the connections are very dense from layers to layers, the time spent on IO is usually the dominated factor in training a Neural Network. Therefore, an efficient way for message passing between neurons (i.e. communication, synchronization and data transfer) is needed for model parallelism across machines to be beneficial.
The paper suggests two ways to parallelizing the learning computations: Model Parallelism and Data Parallelism. Model parallelism is concerned about partitioning the neural network computation across several machines and using all available computational capability of each machine for the computation. The model parallelism is done by the framework called DistBelief. The secret sauce of the framework is "batching". To reduce the communication overhead of individual message, a batch of message is sent at a time across nodes. For instance, Node 1 has interconnections to Node 2. Messages from Node 1 are sent to Node 2 only once for each propagation (or vice-versa depending of the type of propagation). The largest cluster they have experimented has 32 machines with 16 cores for each (i.e. a total of 512 cores) which is pretty amazing.
Although the model parallelism discussed in the paper is interesting but the focus of the paper is actually on the data parallelism. Data parallelism is concerned with distributing training across multiple model replicas. Each model replica employs the model parallelism technique discussed above but instead of having only one model, data parallelism uses multiple of model replica to simultaneously solve a single optimization problem. Note that this is different than training multiple models and choose 1 model that performs best. The secret sauce of data parallelism is 3-folds: 1) having a centralized sharded parameter server which model replicas use to share their parameters, 2) dividing the training data so each model replica can train on independently and 3) combining model replicas training results to optimize the objective function for the model of interest.
The paper further describes how the setup described in the data parallelism works by introducing 2 algorithms: Downpour SGD and Sandblaster L-BFGS. Downpour SGD is an online training method whereas Sandblaster L-BFGS is a batch training method. Let focus on Downpour SGD for a moment. It employs asynchronous stochastic gradient descent that uses multiple model replicas for training. Each model replica trains on a subset of training data and send the gradient information back to the centralized parameter servers. Since there are no synchronization between the model replicas and between parameter server shards, the gradient information is applied asynchronously to the parameters. That is the parameters are most likely inconsistency because some parameter server shards could update their parameters faster than other. Requesting a full set of parameters at any given time does not guarantee to retrieve the full set of updated parameters. This stochastic nature of the algorithm does not guarantee that the optimization will converge. Therefore, the author suggests to use Adagrad adaptive learning rate in which the computation of the leraning rate can be computed locally within each parameter server because the learning rate for each parameter can be computed based on only the previous gradient information of the parameter itself. The only difference in Sandblaster L-BFGS is that it has a coordinator to schedule tasks for parameter servers and model replicas. It acts as a distributed transaction manager to regulate the learning process.
The paper carries out some benchmarks for model parallelism and data parallelism. For model parallelism, the ratio between the speed-up and the number of machines per model is less than 1, meaning adding additional machine does not provide a linear speed-up. The best benchmark has a ratio of ~0.16 which uses 64 machines for the training. This is quite wasteful from the computation perspective because 84% of computation is not doing the actual work (it is because of the network IO overhead). The model parallelism works best with neural network that favours local connectivities such as Convolutional Neural Network.
In data parallelism benchmarks, downpour SGD performs better than other algorithms with less number of machines. However, the accuracy of the trained model is far less than 50% on the test dataset. The takeaway from this paper is that it is possible to train a network of a very large number of parameters.
5 comments:
The given information was excellent and useful. This is one of the excellent blog, I have come across. Do share more.
AWS Training in Chennai
DevOps Training in Chennai
Data Science Course in Chennai
ccna course in Chennai
Python Training in Chennai
R Programming Training in Chennai
Angularjs Training in Chennai
RPA Training in Chennai
Blue Prism Training in Chennai
Thanks to the admin for sharing this blog with us. The info in this blog was really helpful to me.
IELTS Coaching in Chennai
IELTS Coaching Centre in Chennai
IELTS Training in Chennai
Best IELTS Coaching in Chennai
IELTS Classes in Mumbai
IELTS Coaching in Mumbai
Best IELTS Coaching in Mumbai
IELTS Center in Mumbai
Informative post indeed, I’ve being in and out reading posts regularly and I see alot of engaging people sharing things and majority of the shared information is very valuable and so, here’s my fine read.
click here call to action
click here css animation
click here to 2lz4drob
click here click here click here
click here code html
Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..
Cyber Security Training Course in Chennai | Certification | Cyber Security Online Training Course | Ethical Hacking Training Course in Chennai | Certification | Ethical Hacking Online Training Course |
CCNA Training Course in Chennai | Certification | CCNA Online Training Course | RPA Robotic Process Automation Training Course in Chennai | Certification | RPA Training Course Chennai | SEO Training in Chennai | Certification | SEO Online Training Course
Amazing blog. It's very helpful to learn these things easily. Thank You so much. Keep supporting.
aws training in chennai | aws course in chennai
Post a Comment