Sharding in the Cloud
I/O reads to non-virtualized datastores can be one of the biggest bottlenecks in applications that have been deployed to the cloud. One of the easiest ways to avoid this problem is to simply shard your database.
Sharding in the Cloud
It’s interesting; when you look at the performance of a typical enterprise application deployed to the cloud, the prototypical points of contention when it comes to performance disappear, and a problem that seemed to only dog tape drives of the 1970’s starts to reappear. Yes, when you’re applications are floating in the cloud, basic I/O operations become the bottleneck, as your infinitely clustered applications keep going back to that storage device that handles all of your data.
Performance problems in the cloud
So, what’s the problem? The problem comes from the fact that your cloud based applications keep hitting non-cloud based data, or at least, data that isn’t properly designed for the cloud. Cloud based applications can often find themselves doing sixty to seventy percent of their input and output operations against a non-virtualized storage reads and writes, which can significantly depress the performance and scalability benefits of moving enterprise applications to the cloud. CPUs and clock-cycles go unused as reads from datastores and hard drives slow down the application processing cycle.
Datastores and the virtualization market
So, what does the enterprise architect do to minimize the problem of reading and writing to non-virtualized datastores? Well, a simple solution is to just virtualize the datastore as well. There are a number of players jumping into the datastore and I/O virtualization market, including DataCore, NextIO and Neterion, providing solutions that pool bottlenecking I/O resources into clusters hat can be consolidated and dynamically allocated; but not every enterprise architect enjoys the luxury of being able to simple buy more hardware in the hopes of solving a complex, cloud based performance problem.
But another, perhaps even more controversial, or at least intellectually challenging approach to solving the problem of divergent latencies between middleware applications deployed to the cloud, and the amount of time it takes for the corresponding, non-virtualized datastore to respond to requests, is to take a new approach to database design, and shard your existing RDB.
Sharding your database
Database sharding isn’t anything like clustering database servers, virtualizing datastores or partitioning tables. It goes far beyond all of that. In the simplest sense, sharding your database involves breaking up your big database into many, much smaller databases that share nothing and can be spread across multiple servers. These small databases are fast, easy to manage, and often are much cheaper to use as they are often implemented by using open source licensed databases.
And how do you do it? Well, there’s a variety of different approaches, but essentially, it’s just a matter of taking a look at your database and essentially ‘horizontally partitioning’ your data into logically related rows, as opposed to the types of columnizing of data that you do with a typical relational database. The logical rows that you come up with get isolated and deployed into their own database, and as a result, data interaction becomes much faster and more responsive.
Given, this is a very simple look at sharding, but it’s something that modern enterprise applications that are looking at leveraging the benefits of cloud computing without encountering significance performance problems with their database I/O when their middleware applications reach economies of scale.