Sunday, April 5, 2009

Data Partitioning: A Way to Deal with Data... A Lot of Data (Part 1)

After working for a year and half in the domain of Distributed Computing, I start to get interested in architecting distributed systems to avoid bottlenecks: bottlenecks in the business logic layer, the messaging layer and the persistent layer of a typical n-tiers system. In this post, I will first focus on how to avoid bottlenecks in the persistent layer as this is usually the weakest link for this type of systems. For those who are interested in finding a technology that can solve the 3 bottlenecks at once, I will recommend to have a look at a technology called GigaSpaces. It is a technology that might help you solve the problem you are having with the minimum changes possible and it is a cloud-based platform which allows you to migrate your software to a cloud infrastructure in the future. I will elaborate more on this in the future post or you can find the master here.

Now, let focus back on how to solve the bottleneck in the persistent layer. Like many experts (references) have pointed out that the bottleneck in the persistent layer comes from the relational database. This is due to the fact that relational databases are usually used in the persistent layer for durability and transaction purpose. For instance, distributed applications resided in the business logic layer will access the persistent layer for data. They need data in order to fulfill user requests. The problem is that data resided in the relational database might not be expressed directly in the structure for which distributed applications can use. Even a well-designed relational database will have the same problem because the problem is not about data normalization, it is not about indexing and it is not about the database design. You can do the above correctly and still have bottleneck in the persistent layer. I think, IMHO, this dilemma is the result of using "the wrong tool for the job". Relational databases are not designed fundamentally to be used for distributed systems. Although I have been working with RDBMS for quite sometimes, I always find awkward to have a SQL statement in my application. Although, you can use ORM tools but you still need to know there is a RDBMS underneath and the tool does the translation. It works well in the old days when the information can be stored entirely in the physical memory of a machine (although, IMO, this is a workaround). When information goes beyond the physical limitation of a single machine, there is a need for a revolution. Why it is a revolution? Considering that the modern relational database has too many components that should not be part of a relational database. Companies like Oracle, invests tone of money into RDBMS. They would like to be the "King of Data Management" but they fail to realize the fact that relational databases are not designed to be used for distributed systems. For them to survive, the only way is to break up their RDBMS and re-architect their internal components so that it can be used for distributed systems. With the increasing popularity of cloud computing, there is a need to have a better technology for storing data that can be used in a distributed environment and here is a list of them. I really think the concept of putting the relational database in the right place is correct. I would like to clarify this by using an analogy. In human brains, there are mainly two storages (one for short-term and the other for long-term) for retaining information. For information that we need frequently over a short period of time, it will retain in the short-term memory. For information that we need constantly over a long period of time, it will retain in the long-term memory. It is not surprising that information retained in the long term memory takes longer time to retrieve whereas information in the short term memory takes less effort to retrieve as it is retained in the way theat you want to be accessed. Long term memory on the other hand suits other purposes. It is used for associations, inferences and concept buildings. RDBMS can be analogous as the long term memory. The relational aspect of RDBMS helps to build concepts and it allows to ask challenging questions that were not being asked before. On the other hand, short term memory is used to make decisions. There are actions that are needed to be executed within milliseconds or else your life might be in danger, this type of information is stored in a way that makes decision making faster. To be extreme, for information that you don't usually need, you might as well keep it in external digital storage and you will organize (like RDBMS) it to facilitate retrieval in the future. Although it is slower but for things that you only need once every 3 months, it is not bad.

Now we have the concept of having another component as a short-term memory and the information stored in this component is usually raw meaning that it doesn't necessarily related to other information directly and restrictedly. Nonetheless, it is good enough to be used to handle daily operations which can satisfy a particular SLA. This is the emergence of In-Memory Data Grid (IMDG). For those who are not familiar with IMDG, please go to here. Now, the question is how to make use of IMDG so that distributed applications can take advantage of it? Remember that IMDG doesn't have the baggages that RDBMS has. The data in IMDG can be redundant, duplicated or even inconsistent if needed. As such the data is ready to be partitioned or replicated to many machines to handle requests simultaneously. Remember that data in IMDG is not meant to be used as in RDBMS (just like short term memory and the long term memory). So it is not used for querying like RDBMS. But it is used to answer typical questions very fast because it is hard-wired. Also, don't be surprised that it can only provide the answer only if you asked the right question. This is not a problem in many applications since the questions are remained quite static in the application lifetime.

In the second part of this article, I will present a few problems that one might encounter when using IMDG for data partitioning. Stay Tune :)

No comments: