Friday, November 07, 2008

Cloud Computing: How important is "data locality" from a costing perspective?

Nicholas Carr wrote an excellent article about cloud computing "The new economics of computing".

"In late 2007, the New York Times faced a challenge. It wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. It had already scanned all the articles, producing a huge, four-terabyte pile of images in TIFF format. But because TIFFs are poorly suited to online distribution, and because a single article often comprised many TIFFs, the Times needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files.

Working alone, he uploaded the four terabytes of TIFF data into Amazon's Simple Storage Service (S3) utility, and he hacked together some code for EC2 that would, as he later described in a blog post, "pull all the parts that make up an article out of S3, generate a PDF from them and store the PDF back in S3." He then rented 100 virtual computers through EC2 and ran the data through them. In less than 24 hours, he had his 11 million PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.

The total cost for the computing job? Gottfrid told me that the entire EC2 bill came to $240. (That's 10 cents per computer-hour times 100 computers times 24 hours; there were no bandwidth charges since all the data transfers took place within Amazon's system - from S3 to EC2 and back.)"

One thing missed in the "NYT TIFF to PDF conversion computational task" is the mention of data transfer cost of uploading 4TB TIFF images into S3.

Doing some simple computations – Amazon would charge about $409.60 for uploading 4TB data into S3, and would charge an additional $261.12 for downloading the processed PDF files, which were 1.5TB in size. That is about $670.72. In addition there will be bandwidth charges of this 5.5TB data transfer from the NYT datacenter, 4TB out and 1.5TB in, I am sure that will be of the order of $400-$600. That could take the data transfer costs to $1000-$1200 range.

In addition to that – consider the amount of time it would take to transfer such a data. At 10Mbps, it would take 53.4 days to transfer this data.

Using Hadoop on EC2 is definitely a great idea, and is very helpful, however the locality of data also matters a lot. Moving data, in my opinion costs a lot, and sometimes undermines the computational costs ($240 here).

Let me know your thoughts.

Labels: , , , , , , ,

Thursday, June 26, 2008

All models are wrong, and increasingly you can succeed without them.

Says Peter Norvig, Google's research director, as an update to George Box's maxim.

This is an awesome article about Petabyte size datasets and why correlation of data is enough, instead of finding a reason why datasets are related and building a model around it. Read some excerpts here, and full article via this link.

Excerpts from the original article:
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

Petabytes allow us to say: "Correlation is enough."

Labels: , , , ,

Sunday, December 30, 2007

Thrudb: Better Storage?

I recently read about thrudb, and I must say I am very impressed with the lucidity with which Jake Luciani describes the problem and the solution. Here is an excerpt:

"Data on the web is often fluid and loosely structured and it is becoming increasingly difficult to fit this data into a fixed database schema which is amended over time. A simple example of this is tagging. The many-to-many relationship of tags is difficult to query efficiently using tables and SQL, such that ad-hoc solutions are required.
Also, web data is often "mashed up" and viewed together (e.g. Facebook profile) or viewed spatially (e.g. Google maps + event data).
In order to provide this new kind of data flexibility the web is moving towards a document-oriented data model, where records aren’t grouped by their structure but by their attributes.
There are also standard data-oriented issues like indexing, caching, replication and backups, which are left for "later" but are never easy to implement when it’s time to do it. There are a number of great of open source solutions to these problems, but they require proper integration and configuration. These components end up being learned over time and learned by trial and error.
Thrudb, therefore, is an attempt to simplify the modern web data layer and provide the features and tools most web-developers need. These features can be easily configured or turned off."

Looks very cool. I am going to try this out as soon as I get hold of my developer box tomorrow morning.

Thrudb talks about the following features:
• Client libraries for most languages
• Multi-master replication
• Incremental backups and redo logging
• Multiple storage backends (S3 included)
• Built for horizontal scalability
• Simple and powerful search api (Lucene)

Labels: , , , , , ,