Mukul Kumar's Blog: Early YouTube Engineer talks about outages, system scalability

Gigaom.com has posted a very interesting presentation by Cuong Do, an early software engineer who’s now manager of the site’s Core Product Engineering group.

Gigaom.com has a video, here is the transcript of that video.

Introduction

Do’s talk was titled “Behind the Scenes: A Look Into YouTube’s Infrastructure,” had harrowing tales of outages; gory details about the specific languages, architectures, and tools YouTube uses. “One of the key phrases we had in the early days was ‘These are good problems to have,’” Do said. “And after a while we’re like, ‘I’m going to kill the next person who says that.’”

Do describes the “Early team”

2 sys admin
2 scalability software architects
2 feature developers
2 network engineers
1 DB admin
Zero chefs

Algorithm for handling rapid growth

while (true) {
identify_and_fix_bottlenecks();
drink();
sleep(TOO_LITTLE);
notice_new_bottleneck();
}

Web Request flow

End user browser
NetScalar load balancer
Web Servers (bank of)
Apache
Local app server
Python app
Memcached
DB

Video served through

CDN - internal project from Google (for non-US content)
Video – colo servers – from various US locations

Key technologies

Linux (SuSE 10.x)
Apache (2.2.x) / lighthttpd 1.4.x
lighthttpd - very fast in handling large files
MySQL 5.0.x - metadata storage
Python
[Very difficult to recruit people for a small company, Faster to get 10 machines, than 1 dev, so got more machines that run slow code.]
Google technologies - search, Bigtable (video thumbnails), GFS (internal storage, transfer buffers)

Started with 1 DB server

Replica for backup, replica for reporting (later)
Vertical partitions for DB, takes part of the web site (more later)
Multiple users for DB, for scalability
Associate one user with one partition

Scalability challenges

Rapid unpredictable growth (user growth always exceeds any amount of hardware scale)
Passionate users (users who video-blog their life)
New features (recommendation algorithms, compatibility, social graphs – always blew off scalability predictions)
Pushing hardware and software boundaries (if you are running a hardware or software too close to its limits, it's more likely to fail)
Unknown problems (issues that you don't find on Google search, or issues that even the vendor of the 3rd-party-software doesn’t know)

One example of an issue we had

Subject: oh @#!%
Date: October 22, 2005 2:24:33 AM PDT
We can't accept any more videos, too many videos.

All thumbnails were stored in separate sub-directories, and all sub-directories were in one flat directory.
More than 10K files in a directory, out of Inodes problem.
Wasn't too difficult to solve, went to a tree structure.
Wasn't the funniest thing to do, with all videos being uploaded

YouTube: 5 hours Outage

MySQL gave error, checksum has failed
Checksum stored for every page (15K of data)
MySQL checksum failed, it puked, lost 4-5 hours of data
Took 4-5 hours to recover
We found lots of questions, but no answers for this problem
"Maybe this is a hardware issue, not everybody is having it"
Found exactly one-blog-post, turns out the combination of the RAID card and the other I/O card, can sometimes cause interesting fluctuations in voltage
So the data was fine in the disk, the voltage fluctuation caused the CPU to read different data
It took weeks to figure this out

One of most favorite ones: Again ran out of disk space!!

Awesome.

Labels: infrastructure, outages, scalability, youtube