Some insightful articles and some of my own thoughts on the trends in data storage:
THE BACKGROUND:
Disk capacities are going up and costs are going down, however the
effective transfer bandwidth (ETB) per byte of capacity has come down tremendously. Despite capacities and transfer rates increasing by factors or 10,000 and 100 respectively, typical drive ETB has actually
decreased by a factor of 100. As Jim Gray said "
Disks have become tapes." (
Link to source).
Consider, for example, a 10 TB database. Ten years ago, this database would have occupied two thousand 5 GB drives - a common size at the time. With a 3 MB/second transfer rate, the aggregate bandwidth of these 2,000 drives would have been 6 GB/second, enabling the entire database to be scanned in about 30 minutes. Today, only about 20 higher-capacity drives would be needed to hold this same database. Those 20 drives would have an aggregate bandwidth of 1.2 GB/second, increasing the time required to scan the entire database to 150 minutes - an increase of two hours.
DISKS ARE BECOMING A SEQUENTIAL ACCESS DEVICE RATHER THAN A RANDOM ACCESS DEVICEJim Gray points out - We have to convert from random disk access to sequential access patterns. Disks will give you 200 accesses per second, so if you read a few kilobytes in each access, you're in the megabyte-per-second realm, and it will take a year to read a 20-terabyte disk. If you go to sequential access of larger chunks of the disk, you will get 500 times more bandwidth—you can read or write the disk in a day. So programmers have to start thinking of the disk as a sequential device rather than a random access device.
Tom White later says that - "
MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data). Read more
here.
My take is that
SSDs are going to take a while to become an economically viable alternative to disks. Flash disks cost approximately $10/GB, and the OEM costs of good flash drives cost about $60/GB or more (source
here). Compare this with the cost of disk, which is about $0.20/GB. So, we are looking at about
300x price difference here. So, I think, it's going to take while before SSDs become reality in storing terabytes of data. Until that time, we will have to use 50-70% empty disks to enhance striping-performance. So, if we were to use 50% empty disks, the cost of disks doubles for storing the same amount of data.
Labels: bandwidth, disk, hadoop, mapreduce, performance, pubmatic, RAID, scalability, solid state disk, SSD, storage