Lessons, Papers and Practices in Data

Wednesday, May 8, 2013

Reviewed Paper: Large Scale Distributed Deep Networks

I worked on Neural Network in the past. It is by far the most interesting machine learning algorithm that I've learnt. There is always a question in my mind, how is it possible to train a neural network using many machines in a cluster. This paper gives me the first glimpse of it.

Parallelizing the learning computation of a neural network (more specifically Feed-Forward Neural Network) involves passing messages forward and then backward through the networks. Since the connections are very dense from layers to layers, the time spent on IO is usually the dominated factor in training a Neural Network. Therefore, an efficient way for message passing between neurons (i.e. communication, synchronization and data transfer) is needed for model parallelism across machines to be beneficial.

The paper suggests two ways to parallelizing the learning computations: Model Parallelism and Data Parallelism. Model parallelism is concerned about partitioning the neural network computation across several machines and using all available computational capability of each machine for the computation. The model parallelism is done by the framework called DistBelief. The secret sauce of the framework is "batching". To reduce the communication overhead of individual message, a batch of message is sent at a time across nodes. For instance, Node 1 has interconnections to Node 2. Messages from Node 1 are sent to Node 2 only once for each propagation (or vice-versa depending of the type of propagation). The largest cluster they have experimented has 32 machines with 16 cores for each (i.e. a total of 512 cores) which is pretty amazing.

Although the model parallelism discussed in the paper is interesting but the focus of the paper is actually on the data parallelism. Data parallelism is concerned with distributing training across multiple model replicas. Each model replica employs the model parallelism technique discussed above but instead of having only one model, data parallelism uses multiple of model replica to simultaneously solve a single optimization problem. Note that this is different than training multiple models and choose 1 model that performs best. The secret sauce of data parallelism is 3-folds: 1) having a centralized sharded parameter server which model replicas use to share their parameters, 2) dividing the training data so each model replica can train on independently and 3) combining model replicas training results to optimize the objective function for the model of interest.

The paper further describes how the setup described in the data parallelism works by introducing 2 algorithms: Downpour SGD and Sandblaster L-BFGS. Downpour SGD is an online training method whereas Sandblaster L-BFGS is a batch training method. Let focus on Downpour SGD for a moment. It employs asynchronous stochastic gradient descent that uses multiple model replicas for training. Each model replica trains on a subset of training data and send the gradient information back to the centralized parameter servers. Since there are no synchronization between the model replicas and between parameter server shards, the gradient information is applied asynchronously to the parameters. That is the parameters are most likely inconsistency because some parameter server shards could update their parameters faster than other. Requesting a full set of parameters at any given time does not guarantee to retrieve the full set of updated parameters. This stochastic nature of the algorithm does not guarantee that the optimization will converge. Therefore, the author suggests to use Adagrad adaptive learning rate in which the computation of the leraning rate can be computed locally within each parameter server because the learning rate for each parameter can be computed based on only the previous gradient information of the parameter itself. The only difference in Sandblaster L-BFGS is that it has a coordinator to schedule tasks for parameter servers and model replicas. It acts as a distributed transaction manager to regulate the learning process.

The paper carries out some benchmarks for model parallelism and data parallelism. For model parallelism, the ratio between the speed-up and the number of machines per model is less than 1, meaning adding additional machine does not provide a linear speed-up. The best benchmark has a ratio of ~0.16 which uses 64 machines for the training. This is quite wasteful from the computation perspective because 84% of computation is not doing the actual work (it is because of the network IO overhead). The model parallelism works best with neural network that favours local connectivities such as Convolutional Neural Network.

In data parallelism benchmarks, downpour SGD performs better than other algorithms with less number of machines. However, the accuracy of the trained model is far less than 50% on the test dataset. The takeaway from this paper is that it is possible to train a network of a very large number of parameters.

Thursday, May 2, 2013

Reviewed Article: LevelDB and Node: What is LevelDB Anyway?

From this article, the keys takeaway for me are 1) the high-level overview of the architecture of LevelDB and 2) How other systems can use LevelDB as the backing store.

The architecture of the LevelDB is similar to the storage component described in the Google BigTable architecture. It uses SSTable and LSM to provide data access to the underlying physical storage. The design aims to maximize the usage of disk throughput while providing CRUD operations. SSTable is a disk-based immutable data structure that is organized in such a way that allows reading a single key or a group of sorted keys by a single disk seek. With the combination of LSM, it supports create, update and delete operations.

The section that I found the most interesting is "Table File Hierarchy". It describes how LevelDB organizes the SSTable files. It does exactly what LevelDB name suggests: LEVEL. The SSTable files are organized into levels: The top level (level 0) contains small SSTable files and each level down contains file size bigger than the previous level. This organization of data makes the new data available on the top level while the older data is aggregated and stored in the lower levels.

Node.js makes use of LevelDB for their backing store because LevelDB provides a way for Node to stream a set of sorted-key data very efficiently.

LevelDB can be best used with Gecko, a contention-oblivious disk array which separates write and read workload by intelligently allocating disks for write only while read workloads are spread to other disks with the help of caches for reading from the write-only disks.

Saturday, December 10, 2011

My Experience with HBase Dynamic Partitioning (a.k.a. Region Splitting)

I have been working on HBase for 9 months and below is a summary of my experience with it to date. I hope it can help someone to evaluate the technology for their specific use cases.

I will skip all information that you can obtain from the HBase official website and the HBase definite guide book in this post. Both of them are great resources to understand how to configure and use HBase in a scalable manner. Also, I won't go to the details of why we choose HBase because it has been covered extensively by other HBase users (please search on Google for those). What I'm going to provide here is to describe some challenges of using HBase in a real production environment.

The smallest unit of component that encapsulates the data processing engine of HBase is the region. The regionserver of HBase acts like a container of those regions. In some In-Memory Data Grid technology terminology, this region is called a partition of data. Each of this partition is responsible for a subset of data in HBase. A user request is routed to a particular regionserver that contains the region which holds the data that the request is specified (usually by key). By default, there is only 1 region per table. HBase employs automatic and dynamic partitioning scheme onto the region to scale the system. For instance, if each region is configured to account for 100MB of data, the region will split into 2 regions with each of them accounts for 50MB of data when this threshold (i.e. 100MB) is exceeded.

The dynamic partitioning scheme is both a "love" and a "hate" feature IMHO. It is great when you have no knowledge about the data (such as the range of keys that will be written, the volume of the data, etc) because it helps to automatically scale the system in a way that will provide some degrees of scalability. It is awful when the load of the system is high and it fails to dynamically partition the data appropriately. I was one of those who initially love the feature and then turn to hate this feature because of the following issues observed after running HBase for several months now in production.

Issue #1: Dangling Offline Parent Regions

If you query directly the .META. table of HBase, you will find some regions with "OFFLINE=true and SPLIT=true". According to the HBase books, this shouldn't happen because the CatalogJanitor should remove those parent regions once the split is completed. However, I observe that even the split was successful (in the sense that there are two new daugther regions created and each of them has half of the data than the parent region), the meta info of the parent region is not removed from HBase in some conditions.

Issue #2: Overlapping Regions

This happens usually when a failover occurs during a region split (dynamic partitioning). The region state is unclear about if the split is completed or not. It will try its best to recover by splitting the region again. However, this might overlap some regions if it has been split successfully before a failover occurs (or before the meta information about the region to be split is removed from HBase). This issue happens even more frequently if issue 1 happened before issue 2. Let say, if you run the system for 1 month before a failover occurs and there are dangling offline parent regions left over in the Meta info table, then Hbase guarantees you to have issue 2 because those dangling offline parents contain old information that doesn't apply to the current state of the cluster (Imagine the daughters becomes parents and undergo its own splitting process).

Issue #3: Out-of-Sync Meta Information

There are two places to hold the meta information of regions: 1) .META.table and 2) the region directory under HDFS. The meta info goes out of sync when the directory of a region is not removed from DFS after a region split occurs. Normally, it won't affect your running system despite the fact that they consume some disk spaces but it becomes an issue when you try to recover the .META. table from DFS because the meta information on DFS is not in sync with the .META. table before the .META. table is corrupted. Recovery becomes very difficult because you have no idea if you should keep a region or remove a region in the DFS without digging into the .regioninfo of each region. Note that issue 3 will also cause issue 2.

Issue #4: Inactive Regions

Dynamic partitioning is like a black box in which it hides the details about which region is responsible for which subset of data. It becomes challenging to manage when you delete data in some regions in which those regions become inactive in a sense that it doesn't account for any data in the system anymore because the data has been removed. Those inactive regions still use up resources and participate in the load balancing algorithm of the cluster which leads to unbalanced processing loads in the cluster. For instance, you might see that there are 50 regions in each regionserver 1 and 2, however 50 regions in regionserver 1 are inactive (i.e. no user request will rout to regionserver 1), therefore regionserver 2 handles all the user requests (cluster utilization is 50% in this case).

Conclusion

Those are the biggest challenges I saw in the past 9 months and in the next post, I will describe some solutions to fix those problems. However, the root cause of these issues is stemmed from the dynamic partitioning scheme and therefore, presplitting the table is really a MUST and not an option if you want a smooth running system as suggested in the HBase books! On top of that, my experience with IMDG is that no vendors has a REAL dynamic partitioning scheme. They use explicit/implicit partitioning which all of them have fixed number of partition to start with. Some products give illusions of dynamic partitioning but in fact it is spill over to an already created partition to handle extra load.