Lessons, Papers and Practices in Data: data grid

I have been working on HBase for 9 months and below is a summary of my experience with it to date. I hope it can help someone to evaluate the technology for their specific use cases.

I will skip all information that you can obtain from the HBase official website and the HBase definite guide book in this post. Both of them are great resources to understand how to configure and use HBase in a scalable manner. Also, I won't go to the details of why we choose HBase because it has been covered extensively by other HBase users (please search on Google for those). What I'm going to provide here is to describe some challenges of using HBase in a real production environment.

The smallest unit of component that encapsulates the data processing engine of HBase is the region. The regionserver of HBase acts like a container of those regions. In some In-Memory Data Grid technology terminology, this region is called a partition of data. Each of this partition is responsible for a subset of data in HBase. A user request is routed to a particular regionserver that contains the region which holds the data that the request is specified (usually by key). By default, there is only 1 region per table. HBase employs automatic and dynamic partitioning scheme onto the region to scale the system. For instance, if each region is configured to account for 100MB of data, the region will split into 2 regions with each of them accounts for 50MB of data when this threshold (i.e. 100MB) is exceeded.

The dynamic partitioning scheme is both a "love" and a "hate" feature IMHO. It is great when you have no knowledge about the data (such as the range of keys that will be written, the volume of the data, etc) because it helps to automatically scale the system in a way that will provide some degrees of scalability. It is awful when the load of the system is high and it fails to dynamically partition the data appropriately. I was one of those who initially love the feature and then turn to hate this feature because of the following issues observed after running HBase for several months now in production.

Issue #1: Dangling Offline Parent Regions

If you query directly the .META. table of HBase, you will find some regions with "OFFLINE=true and SPLIT=true". According to the HBase books, this shouldn't happen because the CatalogJanitor should remove those parent regions once the split is completed. However, I observe that even the split was successful (in the sense that there are two new daugther regions created and each of them has half of the data than the parent region), the meta info of the parent region is not removed from HBase in some conditions.

Issue #2: Overlapping Regions

This happens usually when a failover occurs during a region split (dynamic partitioning). The region state is unclear about if the split is completed or not. It will try its best to recover by splitting the region again. However, this might overlap some regions if it has been split successfully before a failover occurs (or before the meta information about the region to be split is removed from HBase). This issue happens even more frequently if issue 1 happened before issue 2. Let say, if you run the system for 1 month before a failover occurs and there are dangling offline parent regions left over in the Meta info table, then Hbase guarantees you to have issue 2 because those dangling offline parents contain old information that doesn't apply to the current state of the cluster (Imagine the daughters becomes parents and undergo its own splitting process).

Issue #3: Out-of-Sync Meta Information

There are two places to hold the meta information of regions: 1) .META.table and 2) the region directory under HDFS. The meta info goes out of sync when the directory of a region is not removed from DFS after a region split occurs. Normally, it won't affect your running system despite the fact that they consume some disk spaces but it becomes an issue when you try to recover the .META. table from DFS because the meta information on DFS is not in sync with the .META. table before the .META. table is corrupted. Recovery becomes very difficult because you have no idea if you should keep a region or remove a region in the DFS without digging into the .regioninfo of each region. Note that issue 3 will also cause issue 2.

Issue #4: Inactive Regions

Dynamic partitioning is like a black box in which it hides the details about which region is responsible for which subset of data. It becomes challenging to manage when you delete data in some regions in which those regions become inactive in a sense that it doesn't account for any data in the system anymore because the data has been removed. Those inactive regions still use up resources and participate in the load balancing algorithm of the cluster which leads to unbalanced processing loads in the cluster. For instance, you might see that there are 50 regions in each regionserver 1 and 2, however 50 regions in regionserver 1 are inactive (i.e. no user request will rout to regionserver 1), therefore regionserver 2 handles all the user requests (cluster utilization is 50% in this case).

Conclusion

Those are the biggest challenges I saw in the past 9 months and in the next post, I will describe some solutions to fix those problems. However, the root cause of these issues is stemmed from the dynamic partitioning scheme and therefore, presplitting the table is really a MUST and not an option if you want a smooth running system as suggested in the HBase books! On top of that, my experience with IMDG is that no vendors has a REAL dynamic partitioning scheme. They use explicit/implicit partitioning which all of them have fixed number of partition to start with. Some products give illusions of dynamic partitioning but in fact it is spill over to an already created partition to handle extra load.

After working on some data grid technologies (e.g. GigaSpaces, Terracotta and Coherence) for awhile, there are still a lot of questions regarding on when one solution is better than the other and in what circumstances that I should select solution A but not solution B. My personal motto is that no technology is good for everything and we should analyze every problem as a new problem by asking the very basic questions starting from scratch.

I always tell my team members that I care about what is the "right" solution to the problem regardless of the budget, the time required to solve the problem, the skillsets we need to solve it. It is only by investigating the problem with the correct paradigm, a problem can really be "solved". Of course, at the end of the day, we need to look at the budget, the timeline of the project and other factors that might affect the project, but this should only be considered after we really understand the problem fully (sometimes you might be surprised that the actual problem does not even exist!). In this way, when the project gets stuck, I know how to direct the team to do the "right" thing (i.e. get the thing done that has the most business values).

Understanding the tools at hand is therefore very important for an engineer so that she can equip with the best tool for the problem before trying to solve the problem.

Data Grid Overview

From Wikipedia, a data grid is a grid computing system that deals with data. It is a pretty general description for a data grid. For more details about what a data grid is all about, I would recommend you to have a look at this blog. Nati will tell you how to make best use of a data grid. ;) Personally, a data grid is a technology which allows applications to access a pool of memory that can be in-process or out-of-process. The pool of memory can be scaled nicely by adding/removing community hardwares. The reason why data grid becomes more and more popular is that the real-time requirement of an applications to be able to access vast amount of data becomes more and more important. If data is stored in on disk, the I/O latency becomes the bottleneck and might breach the real-time requirements of the application. This is the situation when a data grid might provide an answer.

Currently, there are many good data grid technologies in the market. The top three that I encounter frequently are:

They are good data grid technologies but they are very different in terms of the technological philosophy. It is this difference that makes them unique from each other. In my future post at my geek blog, I will spend times to characterize them in the hope to shed some light on their difference and when one solution is more suitable than the other.

Lessons, Papers and Practices in Data

Saturday, December 10, 2011

My Experience with HBase Dynamic Partitioning (a.k.a. Region Splitting)

Sunday, October 4, 2009

Data Grid for the Enterprise Application Overview (Part 1)