Lessons, Papers and Practices in Data

Saturday, December 10, 2011

My Experience with HBase Dynamic Partitioning (a.k.a. Region Splitting)

I have been working on HBase for 9 months and below is a summary of my experience with it to date. I hope it can help someone to evaluate the technology for their specific use cases.

I will skip all information that you can obtain from the HBase official website and the HBase definite guide book in this post. Both of them are great resources to understand how to configure and use HBase in a scalable manner. Also, I won't go to the details of why we choose HBase because it has been covered extensively by other HBase users (please search on Google for those). What I'm going to provide here is to describe some challenges of using HBase in a real production environment.

The smallest unit of component that encapsulates the data processing engine of HBase is the region. The regionserver of HBase acts like a container of those regions. In some In-Memory Data Grid technology terminology, this region is called a partition of data. Each of this partition is responsible for a subset of data in HBase. A user request is routed to a particular regionserver that contains the region which holds the data that the request is specified (usually by key). By default, there is only 1 region per table. HBase employs automatic and dynamic partitioning scheme onto the region to scale the system. For instance, if each region is configured to account for 100MB of data, the region will split into 2 regions with each of them accounts for 50MB of data when this threshold (i.e. 100MB) is exceeded.

The dynamic partitioning scheme is both a "love" and a "hate" feature IMHO. It is great when you have no knowledge about the data (such as the range of keys that will be written, the volume of the data, etc) because it helps to automatically scale the system in a way that will provide some degrees of scalability. It is awful when the load of the system is high and it fails to dynamically partition the data appropriately. I was one of those who initially love the feature and then turn to hate this feature because of the following issues observed after running HBase for several months now in production.

Issue #1: Dangling Offline Parent Regions

If you query directly the .META. table of HBase, you will find some regions with "OFFLINE=true and SPLIT=true". According to the HBase books, this shouldn't happen because the CatalogJanitor should remove those parent regions once the split is completed. However, I observe that even the split was successful (in the sense that there are two new daugther regions created and each of them has half of the data than the parent region), the meta info of the parent region is not removed from HBase in some conditions.

Issue #2: Overlapping Regions

This happens usually when a failover occurs during a region split (dynamic partitioning). The region state is unclear about if the split is completed or not. It will try its best to recover by splitting the region again. However, this might overlap some regions if it has been split successfully before a failover occurs (or before the meta information about the region to be split is removed from HBase). This issue happens even more frequently if issue 1 happened before issue 2. Let say, if you run the system for 1 month before a failover occurs and there are dangling offline parent regions left over in the Meta info table, then Hbase guarantees you to have issue 2 because those dangling offline parents contain old information that doesn't apply to the current state of the cluster (Imagine the daughters becomes parents and undergo its own splitting process).

Issue #3: Out-of-Sync Meta Information

There are two places to hold the meta information of regions: 1) .META.table and 2) the region directory under HDFS. The meta info goes out of sync when the directory of a region is not removed from DFS after a region split occurs. Normally, it won't affect your running system despite the fact that they consume some disk spaces but it becomes an issue when you try to recover the .META. table from DFS because the meta information on DFS is not in sync with the .META. table before the .META. table is corrupted. Recovery becomes very difficult because you have no idea if you should keep a region or remove a region in the DFS without digging into the .regioninfo of each region. Note that issue 3 will also cause issue 2.

Issue #4: Inactive Regions

Dynamic partitioning is like a black box in which it hides the details about which region is responsible for which subset of data. It becomes challenging to manage when you delete data in some regions in which those regions become inactive in a sense that it doesn't account for any data in the system anymore because the data has been removed. Those inactive regions still use up resources and participate in the load balancing algorithm of the cluster which leads to unbalanced processing loads in the cluster. For instance, you might see that there are 50 regions in each regionserver 1 and 2, however 50 regions in regionserver 1 are inactive (i.e. no user request will rout to regionserver 1), therefore regionserver 2 handles all the user requests (cluster utilization is 50% in this case).

Conclusion

Those are the biggest challenges I saw in the past 9 months and in the next post, I will describe some solutions to fix those problems. However, the root cause of these issues is stemmed from the dynamic partitioning scheme and therefore, presplitting the table is really a MUST and not an option if you want a smooth running system as suggested in the HBase books! On top of that, my experience with IMDG is that no vendors has a REAL dynamic partitioning scheme. They use explicit/implicit partitioning which all of them have fixed number of partition to start with. Some products give illusions of dynamic partitioning but in fact it is spill over to an already created partition to handle extra load.

Friday, February 19, 2010

Computing as a Service in the Cloud Computing Stack: 2nd Revisit

First of all, I have to admit that I stole this article from one of my posts to the company mailing list. I just modified it to remove confidential information and add more content to align with my previous posts on the same topic.

Apparently, what I proposed previously in the speculation is so realizable that a company actually makes a product about it. :) The company that I'm talking about is "Penguin Computing" . Penguin computing offers "HPC as a Service" that is specifically designed to bring HPC customers to the Cloud. POD is a HPC system in the cloud but NO VIRTUALIZATION technology is used. No virtualization is needed to ensure the performance of their HPC applications is optimal because each customer will have its dedicated PHYSICAL servers to run their HPC application.

You might wonder how this "HPC as a Service" related to my previous posts. Firstly, the "HPC as a Service" allows the HPC customers to run their HPC tasks in the Cloud. Customers are unaware about if their tasks are ran on specific designated resources but they are guaranteed to have the performance they need. Secondly, they provide multi-core virtualization MANUALLY. From the website, it says

"Experts in high-performance computing help setup and optimize the software environment and can help trouble-shoot issues that might occur."

That is if the HPC task doesn't perform as expected, the HPC experts will help the customers to optimize their tasks to run on the specific hardware to take advantage of the hardware resource. Unlike other type of applications that are aimed for high-throughput, HPC tasks are usually long-running algorithms processed in parallel, with the result of one task not dependent on the other tasks. The level of sophistication in this type of task is high and therefore, it has to be fine-tune manually. Although, it doesn't provide the same "virtualization" that I mentioned in the previous post but this is very close to the idea because someone (in this case, the HPC experts) takes care of the messy details about Computing.

Tuesday, February 16, 2010

Computing as a Service in the Cloud Computing Stack: 1st Revisit

After the previous post about "A Speculation of Computing as a Service in the Cloud Computing Stack", I keep a close observation in the blogosphere to see if anyone in the world has similar ideas on what I proposed in the speculation. Two articles that I found today shed similar lights to the speculation: "The HPC Software Conundrum" and "Engineering Parallel Applications with Tunable Architectures" . The former article says

"... multi-core and streaming-core needs software in a big way. And, there is no sense in bragging about how well the hardware works unless there is an adequate software base to drive the hardware to its intended performance."

The author further lists out a number of solutions that are available today for programmers to use the great hardware and they are OpenMP, CUDA, OpenCL and Ct. However, the author points out that none of these solutions actually tell the programmers how should the code be updated/rewritten using this new tools, "no one size fits all or no silver bullet" as the author would say. He also says

"I still get the feel that it is a kludge of some sort and there should be a way to express what you want to do in a parallel cluster environment using a clear concise standard level of abstraction."

in which I agree with him. There is a need for a layer of abstraction that takes care of the hardware, be it SMP multi-core with or without GPU architecture. The layer of abstraction should be intelligent enough to make use of all the available hardware in the machine to improve the utilization.

The second article suggests a new approach to address the difficulty in adapting a serial program to a parallel program on the new hardware architecture. The new approach will automate the architecture adaptation of parallel programs and uses an auto-tuner to find the best-performing architectural choice for a given machine by introducing a new architecture description language based on parallel patterns and a framework for expressing architecture variants in a generic way. This is a step toward multi-core virtualization.

I wish to see more contributions on the software side to bring the programmers to this multi-core era!