Lessons, Papers and Practices in Data: April 2009

Does RAM really can be the new disk as suggested in this article? My take on this question is "No". First of all, making RAM to be the new disk will not help under the current computer architecture. Secondly, having disk on RAM technology is not economically feasible.

The current computer architecture follows Von Neumann architecture. Under the Von Neumann Architecture and its variants, hard disk serves memory purpose as a storage, RAM serves as temporary memory with very fast (compared to the hard disk) I/O performance. Using RAM technology to speed up disk I/O is good but it will have limited overall performance gain for an application because it only speeds up the throughput between the disk and RAM. However, the actual performance bottleneck under this computer architecture is the limited throughput between the CPU and memory compared to the amount of memory. So improving disk performance will not speed up the application throughput.

The most promising RAM technology that can be the new disk is (solid-state drive) SSD but the cost/performance ratio is still high compared to the traditional HDD with the same storage capacity for many workload scenarios. This is explained very well in this paper. If the disk is used only as a storage medium (what it should be), SSD solutions are too expensive to be justifiable in many situations. So we won't see a big adoption of SSD as a enterprise-server storage medium in a short future. IMHO, it might not even happen.

From the application perspective, if we adopt data partitioning as I suggested in the other post, the disk will be used as the storage and RAM will be used as the system of record. So during the application runtime, the application will not access the disk and therefore, we achieve the best performance we can under the current computer architecture without introducing expensive hardware (scale-up). This is probably the only way to go to speed up the application in the near future.

After working for a year and half in the domain of Distributed Computing, I start to get interested in architecting distributed systems to avoid bottlenecks: bottlenecks in the business logic layer, the messaging layer and the persistent layer of a typical n-tiers system. In this post, I will first focus on how to avoid bottlenecks in the persistent layer as this is usually the weakest link for this type of systems. For those who are interested in finding a technology that can solve the 3 bottlenecks at once, I will recommend to have a look at a technology called GigaSpaces. It is a technology that might help you solve the problem you are having with the minimum changes possible and it is a cloud-based platform which allows you to migrate your software to a cloud infrastructure in the future. I will elaborate more on this in the future post or you can find the master here.

Now, let focus back on how to solve the bottleneck in the persistent layer. Like many experts (references) have pointed out that the bottleneck in the persistent layer comes from the relational database. This is due to the fact that relational databases are usually used in the persistent layer for durability and transaction purpose. For instance, distributed applications resided in the business logic layer will access the persistent layer for data. They need data in order to fulfill user requests. The problem is that data resided in the relational database might not be expressed directly in the structure for which distributed applications can use. Even a well-designed relational database will have the same problem because the problem is not about data normalization, it is not about indexing and it is not about the database design. You can do the above correctly and still have bottleneck in the persistent layer. I think, IMHO, this dilemma is the result of using "the wrong tool for the job". Relational databases are not designed fundamentally to be used for distributed systems. Although I have been working with RDBMS for quite sometimes, I always find awkward to have a SQL statement in my application. Although, you can use ORM tools but you still need to know there is a RDBMS underneath and the tool does the translation. It works well in the old days when the information can be stored entirely in the physical memory of a machine (although, IMO, this is a workaround). When information goes beyond the physical limitation of a single machine, there is a need for a revolution. Why it is a revolution? Considering that the modern relational database has too many components that should not be part of a relational database. Companies like Oracle, invests tone of money into RDBMS. They would like to be the "King of Data Management" but they fail to realize the fact that relational databases are not designed to be used for distributed systems. For them to survive, the only way is to break up their RDBMS and re-architect their internal components so that it can be used for distributed systems. With the increasing popularity of cloud computing, there is a need to have a better technology for storing data that can be used in a distributed environment and here is a list of them. I really think the concept of putting the relational database in the right place is correct. I would like to clarify this by using an analogy. In human brains, there are mainly two storages (one for short-term and the other for long-term) for retaining information. For information that we need frequently over a short period of time, it will retain in the short-term memory. For information that we need constantly over a long period of time, it will retain in the long-term memory. It is not surprising that information retained in the long term memory takes longer time to retrieve whereas information in the short term memory takes less effort to retrieve as it is retained in the way theat you want to be accessed. Long term memory on the other hand suits other purposes. It is used for associations, inferences and concept buildings. RDBMS can be analogous as the long term memory. The relational aspect of RDBMS helps to build concepts and it allows to ask challenging questions that were not being asked before. On the other hand, short term memory is used to make decisions. There are actions that are needed to be executed within milliseconds or else your life might be in danger, this type of information is stored in a way that makes decision making faster. To be extreme, for information that you don't usually need, you might as well keep it in external digital storage and you will organize (like RDBMS) it to facilitate retrieval in the future. Although it is slower but for things that you only need once every 3 months, it is not bad.

Now we have the concept of having another component as a short-term memory and the information stored in this component is usually raw meaning that it doesn't necessarily related to other information directly and restrictedly. Nonetheless, it is good enough to be used to handle daily operations which can satisfy a particular SLA. This is the emergence of In-Memory Data Grid (IMDG). For those who are not familiar with IMDG, please go to here. Now, the question is how to make use of IMDG so that distributed applications can take advantage of it? Remember that IMDG doesn't have the baggages that RDBMS has. The data in IMDG can be redundant, duplicated or even inconsistent if needed. As such the data is ready to be partitioned or replicated to many machines to handle requests simultaneously. Remember that data in IMDG is not meant to be used as in RDBMS (just like short term memory and the long term memory). So it is not used for querying like RDBMS. But it is used to answer typical questions very fast because it is hard-wired. Also, don't be surprised that it can only provide the answer only if you asked the right question. This is not a problem in many applications since the questions are remained quite static in the application lifetime.

In the second part of this article, I will present a few problems that one might encounter when using IMDG for data partitioning. Stay Tune :)

Lessons, Papers and Practices in Data

Sunday, April 19, 2009

My Take on "RAM as the New Disk"

Monday, April 13, 2009

How to be an Engineer, IMHO (Introduction)

Sunday, April 5, 2009

Data Partitioning: A Way to Deal with Data... A Lot of Data (Part 1)