Sunday, July 13, 2008

Early YouTube Engineer talks about outages, system scalability

Gigaom.com has posted a very interesting presentation by Cuong Do, an early software engineer who’s now manager of the site’s Core Product Engineering group.

Gigaom.com has a video, here is the transcript of that video.

Introduction
Do’s talk was titled “Behind the Scenes: A Look Into YouTube’s Infrastructure,” had harrowing tales of outages; gory details about the specific languages, architectures, and tools YouTube uses. “One of the key phrases we had in the early days was ‘These are good problems to have,’” Do said. “And after a while we’re like, ‘I’m going to kill the next person who says that.’”

Do describes the “Early team”
  • 2 sys admin
  • 2 scalability software architects
  • 2 feature developers
  • 2 network engineers
  • 1 DB admin
  • Zero chefs
Algorithm for handling rapid growth
while (true) {
identify_and_fix_bottlenecks();
drink();
sleep(TOO_LITTLE);
notice_new_bottleneck();
}
Web Request flow
  • End user browser
  • NetScalar load balancer
  • Web Servers (bank of)
  • Apache
  • Local app server
  • Python app
  • Memcached
  • DB
Video served through
  • CDN - internal project from Google (for non-US content)
  • Video – colo servers – from various US locations
Key technologies
  • Linux (SuSE 10.x)
  • Apache (2.2.x) / lighthttpd 1.4.x
  • lighthttpd - very fast in handling large files
  • MySQL 5.0.x - metadata storage
  • Python
  • [Very difficult to recruit people for a small company, Faster to get 10 machines, than 1 dev, so got more machines that run slow code.]
  • Google technologies - search, Bigtable (video thumbnails), GFS (internal storage, transfer buffers)
DB
Started with 1 DB server
  • Replica for backup, replica for reporting (later)
  • Vertical partitions for DB, takes part of the web site (more later)
  • Multiple users for DB, for scalability
  • Associate one user with one partition
Scalability challenges
  • Rapid unpredictable growth (user growth always exceeds any amount of hardware scale)
  • Passionate users (users who video-blog their life)
  • New features (recommendation algorithms, compatibility, social graphs – always blew off scalability predictions)
  • Pushing hardware and software boundaries (if you are running a hardware or software too close to its limits, it's more likely to fail)
  • Unknown problems (issues that you don't find on Google search, or issues that even the vendor of the 3rd-party-software doesn’t know)
One example of an issue we had
Subject: oh @#!%
Date: October 22, 2005 2:24:33 AM PDT
We can't accept any more videos, too many videos.
  • All thumbnails were stored in separate sub-directories, and all sub-directories were in one flat directory.
  • More than 10K files in a directory, out of Inodes problem.
  • Wasn't too difficult to solve, went to a tree structure.
  • Wasn't the funniest thing to do, with all videos being uploaded
YouTube: 5 hours Outage
  • MySQL gave error, checksum has failed
  • Checksum stored for every page (15K of data)
  • MySQL checksum failed, it puked, lost 4-5 hours of data
  • Took 4-5 hours to recover
  • We found lots of questions, but no answers for this problem
  • "Maybe this is a hardware issue, not everybody is having it"
  • Found exactly one-blog-post, turns out the combination of the RAID card and the other I/O card, can sometimes cause interesting fluctuations in voltage
  • So the data was fine in the disk, the voltage fluctuation caused the CPU to read different data
  • It took weeks to figure this out
One of most favorite ones: Again ran out of disk space!!

Awesome.

Labels: , , ,

0 Comments:

Post a Comment

<< Home