Lessons, Papers and Practices in Data: 2013

I worked on Neural Network in the past. It is by far the most interesting machine learning algorithm that I've learnt. There is always a question in my mind, how is it possible to train a neural network using many machines in a cluster. This paper gives me the first glimpse of it.

Parallelizing the learning computation of a neural network (more specifically Feed-Forward Neural Network) involves passing messages forward and then backward through the networks. Since the connections are very dense from layers to layers, the time spent on IO is usually the dominated factor in training a Neural Network. Therefore, an efficient way for message passing between neurons (i.e. communication, synchronization and data transfer) is needed for model parallelism across machines to be beneficial.

The paper suggests two ways to parallelizing the learning computations: Model Parallelism and Data Parallelism. Model parallelism is concerned about partitioning the neural network computation across several machines and using all available computational capability of each machine for the computation. The model parallelism is done by the framework called DistBelief. The secret sauce of the framework is "batching". To reduce the communication overhead of individual message, a batch of message is sent at a time across nodes. For instance, Node 1 has interconnections to Node 2. Messages from Node 1 are sent to Node 2 only once for each propagation (or vice-versa depending of the type of propagation). The largest cluster they have experimented has 32 machines with 16 cores for each (i.e. a total of 512 cores) which is pretty amazing.

Although the model parallelism discussed in the paper is interesting but the focus of the paper is actually on the data parallelism. Data parallelism is concerned with distributing training across multiple model replicas. Each model replica employs the model parallelism technique discussed above but instead of having only one model, data parallelism uses multiple of model replica to simultaneously solve a single optimization problem. Note that this is different than training multiple models and choose 1 model that performs best. The secret sauce of data parallelism is 3-folds: 1) having a centralized sharded parameter server which model replicas use to share their parameters, 2) dividing the training data so each model replica can train on independently and 3) combining model replicas training results to optimize the objective function for the model of interest.

The paper further describes how the setup described in the data parallelism works by introducing 2 algorithms: Downpour SGD and Sandblaster L-BFGS. Downpour SGD is an online training method whereas Sandblaster L-BFGS is a batch training method. Let focus on Downpour SGD for a moment. It employs asynchronous stochastic gradient descent that uses multiple model replicas for training. Each model replica trains on a subset of training data and send the gradient information back to the centralized parameter servers. Since there are no synchronization between the model replicas and between parameter server shards, the gradient information is applied asynchronously to the parameters. That is the parameters are most likely inconsistency because some parameter server shards could update their parameters faster than other. Requesting a full set of parameters at any given time does not guarantee to retrieve the full set of updated parameters. This stochastic nature of the algorithm does not guarantee that the optimization will converge. Therefore, the author suggests to use Adagrad adaptive learning rate in which the computation of the leraning rate can be computed locally within each parameter server because the learning rate for each parameter can be computed based on only the previous gradient information of the parameter itself. The only difference in Sandblaster L-BFGS is that it has a coordinator to schedule tasks for parameter servers and model replicas. It acts as a distributed transaction manager to regulate the learning process.

The paper carries out some benchmarks for model parallelism and data parallelism. For model parallelism, the ratio between the speed-up and the number of machines per model is less than 1, meaning adding additional machine does not provide a linear speed-up. The best benchmark has a ratio of ~0.16 which uses 64 machines for the training. This is quite wasteful from the computation perspective because 84% of computation is not doing the actual work (it is because of the network IO overhead). The model parallelism works best with neural network that favours local connectivities such as Convolutional Neural Network.

In data parallelism benchmarks, downpour SGD performs better than other algorithms with less number of machines. However, the accuracy of the trained model is far less than 50% on the test dataset. The takeaway from this paper is that it is possible to train a network of a very large number of parameters.

From this article, the keys takeaway for me are 1) the high-level overview of the architecture of LevelDB and 2) How other systems can use LevelDB as the backing store.

The architecture of the LevelDB is similar to the storage component described in the Google BigTable architecture. It uses SSTable and LSM to provide data access to the underlying physical storage. The design aims to maximize the usage of disk throughput while providing CRUD operations. SSTable is a disk-based immutable data structure that is organized in such a way that allows reading a single key or a group of sorted keys by a single disk seek. With the combination of LSM, it supports create, update and delete operations.

The section that I found the most interesting is "Table File Hierarchy". It describes how LevelDB organizes the SSTable files. It does exactly what LevelDB name suggests: LEVEL. The SSTable files are organized into levels: The top level (level 0) contains small SSTable files and each level down contains file size bigger than the previous level. This organization of data makes the new data available on the top level while the older data is aggregated and stored in the lower levels.

Node.js makes use of LevelDB for their backing store because LevelDB provides a way for Node to stream a set of sorted-key data very efficiently.

LevelDB can be best used with Gecko, a contention-oblivious disk array which separates write and read workload by intelligently allocating disks for write only while read workloads are spread to other disks with the help of caches for reading from the write-only disks.

Lessons, Papers and Practices in Data

Wednesday, May 8, 2013

Reviewed Paper: Large Scale Distributed Deep Networks

Thursday, May 2, 2013

Reviewed Article: LevelDB and Node: What is LevelDB Anyway?