Saturday, December 10, 2011

My Experience with HBase Dynamic Partitioning (a.k.a. Region Splitting)

I have been working on HBase for 9 months and below is a summary of my experience with it to date. I hope it can help someone to evaluate the technology for their specific use cases.

I will skip all information that you can obtain from the HBase official website and the HBase definite guide book in this post. Both of them are great resources to understand how to configure and use HBase in a scalable manner. Also, I won't go to the details of why we choose HBase because it has been covered extensively by other HBase users (please search on Google for those). What I'm going to provide here is to describe some challenges of using HBase in a real production environment.

The smallest unit of component that encapsulates the data processing engine of HBase is the region. The regionserver of HBase acts like a container of those regions. In some In-Memory Data Grid technology terminology, this region is called a partition of data. Each of this partition is responsible for a subset of data in HBase. A user request is routed to a particular regionserver that contains the region which holds the data that the request is specified (usually by key). By default, there is only 1 region per table. HBase employs automatic and dynamic partitioning scheme onto the region to scale the system. For instance, if each region is configured to account for 100MB of data, the region will split into 2 regions with each of them accounts for 50MB of data when this threshold (i.e. 100MB) is exceeded.

The dynamic partitioning scheme is both a "love" and a "hate" feature IMHO. It is great when you have no knowledge about the data (such as the range of keys that will be written, the volume of the data, etc) because it helps to automatically scale the system in a way that will provide some degrees of scalability. It is awful when the load of the system is high and it fails to dynamically partition the data appropriately. I was one of those who initially love the feature and then turn to hate this feature because of the following issues observed after running HBase for several months now in production.

Issue #1: Dangling Offline Parent Regions

If you query directly the .META. table of HBase, you will find some regions with "OFFLINE=true and SPLIT=true". According to the HBase books, this shouldn't happen because the CatalogJanitor should remove those parent regions once the split is completed. However, I observe that even the split was successful (in the sense that there are two new daugther regions created and each of them has half of the data than the parent region), the meta info of the parent region is not removed from HBase in some conditions.

Issue #2: Overlapping Regions

This happens usually when a failover occurs during a region split (dynamic partitioning). The region state is unclear about if the split is completed or not. It will try its best to recover by splitting the region again. However, this might overlap some regions if it has been split successfully before a failover occurs (or before the meta information about the region to be split is removed from HBase). This issue happens even more frequently if issue 1 happened before issue 2. Let say, if you run the system for 1 month before a failover occurs and there are dangling offline parent regions left over in the Meta info table, then Hbase guarantees you to have issue 2 because those dangling offline parents contain old information that doesn't apply to the current state of the cluster (Imagine the daughters becomes parents and undergo its own splitting process).

Issue #3: Out-of-Sync Meta Information

There are two places to hold the meta information of regions: 1) .META.table and 2) the region directory under HDFS. The meta info goes out of sync when the directory of a region is not removed from DFS after a region split occurs. Normally, it won't affect your running system despite the fact that they consume some disk spaces but it becomes an issue when you try to recover the .META. table from DFS because the meta information on DFS is not in sync with the .META. table before the .META. table is corrupted. Recovery becomes very difficult because you have no idea if you should keep a region or remove a region in the DFS without digging into the .regioninfo of each region. Note that issue 3 will also cause issue 2.

Issue #4: Inactive Regions

Dynamic partitioning is like a black box in which it hides the details about which region is responsible for which subset of data. It becomes challenging to manage when you delete data in some regions in which those regions become inactive in a sense that it doesn't account for any data in the system anymore because the data has been removed. Those inactive regions still use up resources and participate in the load balancing algorithm of the cluster which leads to unbalanced processing loads in the cluster. For instance, you might see that there are 50 regions in each regionserver 1 and 2, however 50 regions in regionserver 1 are inactive (i.e. no user request will rout to regionserver 1), therefore regionserver 2 handles all the user requests (cluster utilization is 50% in this case).

Conclusion

Those are the biggest challenges I saw in the past 9 months and in the next post, I will describe some solutions to fix those problems. However, the root cause of these issues is stemmed from the dynamic partitioning scheme and therefore, presplitting the table is really a MUST and not an option if you want a smooth running system as suggested in the HBase books! On top of that, my experience with IMDG is that no vendors has a REAL dynamic partitioning scheme. They use explicit/implicit partitioning which all of them have fixed number of partition to start with. Some products give illusions of dynamic partitioning but in fact it is spill over to an already created partition to handle extra load.

2 comments:

Unknown said...

Hi, thanks for sharing these very valuable informations. They will definitely help me in the near future.

I have a timeseries row key structure, and it will produce me hot regions...
I chose this row key structure because I will make a lot of scans over the data, so the performance of reading is the most important.

I want to ask you if its possible to add new regions to an existing table. This will help me to make the import more distribuited. I searched on google this question, but with no concrete results.

If you know if this kind of operation can be done, please be kind to help me.

Tudor !

Jerry Lam said...

Hi Tudor,

If you are working with time-series data, I will recommend to look at OpenTSDB (http://opentsdb.net/) how they did it with HBase. There are a lot of videos online that describe the rowkey design to make this possible with HBase.

Also, HBase documentation has a section that specifically describes the rowkey design for time series (http://hbase.apache.org/book/rowkey.design.html).

To answer your question, HBase allows you to "create" regions by the means of splitting a region into 2. The reason why it is a split rather than a creation is because at the beginning, HBase has only 1 region than spans the entire rowkey space. When the region holds more than the data that it configures (hbase.hregion.max.filesize), it will eventually split into 2.

You can pre-split the regions so that each region handles a subset of non-overlapping key ranges (http://hbase.apache.org/book/important_configurations.html). This is the way to "create" regions using HBase.

My experience with HBase tells me that you should always manage region splitting yourself to have better and predictable performance.

HTH