Saturday, June 28, 2008

Designing Caches for Highly Scalable Web 2.0 Applications

Unix/Linux file systems have been designed in a way that reads are heavily cached and sometimes pre-fetched. There are various techniques, algorithms and methods for read caching, and each file system has its somewhat unique method and therefore performance. Most file systems would use page-cache for caching read-I/O and buffer cache for caching the metadata.

There has been immense amount of research in this area – of how to improve the read performance using caching (see here and here).

Enter highly scalable Web 2.0 era, enter Facebook - if you look at the Facebook IO-profile in my previous post – 92% of the read (for photos) are served by the CDN. What that means is reads will only happen once, after that the file will be cached in the CDN and the read will never go to the backend storage (NetApp filer in this case). So all the file system caching is probably going waste, since we are never going to read from the file-system-cache ever. Facebook photos are cached in CDN for 4.24 years (their http cache-control max-age is 133,721,540), which means the CDN will not go back to the origin server for that period.

This raises interesting questions – do file systems really need to do any caching, what is the read-write ratio for such an application, how can this file system be better tuned for such an application?
Can file system cache be better used for pre-fetching the entire metadata in the cache, so that Facebook NetApp filer has to do fewer than 3 reads for reading a photo?

Thoughts?

Labels: , , , ,

0 Comments:

Post a Comment

<< Home