Early YouTube Engineer talks about outages, system scalability
Gigaom.com has a video, here is the transcript of that video.
- 2 sys admin
- 2 scalability software architects
- 2 feature developers
- 2 network engineers
- 1 DB admin
- Zero chefs
- End user browser
- NetScalar load balancer
- Web Servers (bank of)
- Local app server
- Python app
- CDN - internal project from Google (for non-US content)
- Video – colo servers – from various US locations
- Linux (SuSE 10.x)
- Apache (2.2.x) / lighthttpd 1.4.x
- lighthttpd - very fast in handling large files
- MySQL 5.0.x - metadata storage
- [Very difficult to recruit people for a small company, Faster to get 10 machines, than 1 dev, so got more machines that run slow code.]
- Google technologies - search, Bigtable (video thumbnails), GFS (internal storage, transfer buffers)
- Replica for backup, replica for reporting (later)
- Vertical partitions for DB, takes part of the web site (more later)
- Multiple users for DB, for scalability
- Associate one user with one partition
- Rapid unpredictable growth (user growth always exceeds any amount of hardware scale)
- Passionate users (users who video-blog their life)
- New features (recommendation algorithms, compatibility, social graphs – always blew off scalability predictions)
- Pushing hardware and software boundaries (if you are running a hardware or software too close to its limits, it's more likely to fail)
- Unknown problems (issues that you don't find on Google search, or issues that even the vendor of the 3rd-party-software doesn’t know)
Date: October 22, 2005 2:24:33 AM PDT
We can't accept any more videos, too many videos.
- All thumbnails were stored in separate sub-directories, and all sub-directories were in one flat directory.
- More than 10K files in a directory, out of Inodes problem.
- Wasn't too difficult to solve, went to a tree structure.
- Wasn't the funniest thing to do, with all videos being uploaded
- MySQL gave error, checksum has failed
- Checksum stored for every page (15K of data)
- MySQL checksum failed, it puked, lost 4-5 hours of data
- Took 4-5 hours to recover
- We found lots of questions, but no answers for this problem
- "Maybe this is a hardware issue, not everybody is having it"
- Found exactly one-blog-post, turns out the combination of the RAID card and the other I/O card, can sometimes cause interesting fluctuations in voltage
- So the data was fine in the disk, the voltage fluctuation caused the CPU to read different data
- It took weeks to figure this out