Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn’t find the actual presentation, but fortunately Kris Jordan took some great notes. That’s how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call.
In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I’ve never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.
80th-100th largest site in the world
26 million uniques a month
30 million users.
Uniques are only half that traffic. Traffic = unique web visitors + APIs + Digg buttons.
2 billion requests a month
13,000 requests a second, peak at 27,000 requests a second.
3 Sys Admins, 2 DBAs, 1 Network Admin, 15 coders, QA team
Lots of servers.
Scaling is specialization. When off the shelf solutions no longer work at a certain scale you have to create systems that work for your particular needs.
Lesson of web 2.0: people love making crap and sharing it with the world.
Web 2.0 sucks for scalability. Web 1.0 was flat with a lot of static files. Additional load is handled by adding more hardware. Web 2.0 is heavily interactive. Content can be created at a crushing rate.
Languages don’t scale. 100% of the time bottlenecks are in
IO. Bottlenecks aren’t in the language when you are handling so many simultaneous requests. Making PHP 300% faster won’t matter. Don’t optimize PHP by using single quotes instead of double quotes when
the database is pegged.
Don’t share state. Decentralize. Partitioning is required to process a high number of requests in parallel.
Scale out instead of up. Expect failures. Just add boxes to scale and avoid the fail.
Database-driven sites need to be partitioned to scale both horizontally and vertically. Horizontal partitioning means store a subset of rows on a different machines. It is used when there’s more data than will fit on one machine. Vertical partitioning means putting some columns in one table and some columns in another table. This allows you to add data to the system without downtime.
Data are separated into separate clusters: User Actions, Users, Comments, Items, etc.
Build a data access layer so partitioning is hidden behind an API.
With partitioning comes the CAP Theorem: you can only pick two of the following three: Strong Consistency, High Availability, Partition Tolerance.
Partitioned solutions require denormalization and has become a big problem at Digg. Denormalization means data is copied in multiple objects and must be kept synchronized.
MySQL replication is used to scale out reads.
Use an asynchronous queuing architecture for near-term processing.
– This approach pushes chunks of processing to another service and let’s that service schedule the processing on a grid of processors.
– It’s faster and more responsive than cron and only slightly less responsive than real-time.
– For example, issuing 5 synchronous database requests slows you down. Do them in parallel.
– Digg uses Gearman. An example use is to get a permalink. Three operations are done parallel: get the current logged, get the permalink, and grab the comments. All three are then combined to return a combined single answer to the client. It’s also used for site crawling and logging. It’s a different way of thinking.
– See Flickr – Do the Essential Work Up-front and Queue the Rest and The Canonical Cloud Architecture for more information.
Bottlenecks are in IO so you have tune the database. When the database is bigger than RAM the disk is hit all the time which kills performance. As the database gets larger the table can’t be scanned anymore. So you have to:
– avoid joins
– avoid large scans across databases by partitioning
– add read slaves
– don’t use NFS
Run numbers before you try and fix a problem to make sure things actually will work.
Files like for icons and photos are handled by using MogileFS, a distributed file system. DFSs support high request rates because files are distributed and replicated around a network.
Cache forever and explicitly expire.
Cache fairly static content in a file based cache.
Cache changeable items in memcached
Cache rarely changed items in APC. APC is a local cache. It’s not distributed so no other program have access to the values.
For caching use the Chain of Responsibility pattern. Cache in MySQL, memcached APC, and PHP globals. First check PHP globals as the fastest cache. If not present check APC, memcached and on up the chain.
Digg’s recommendation engine is a custom graph database that is eventually consistent. Eventually consistent means that writes to one partition will eventually make it to all the other partitions. After a write reads made one after another don’t have to return the same value as they could be handled by different partitions. This is a more relaxed constraint than strict consistency which means changes must be visible at all partitions simultaneously. Reads made one after another would always return the same value.
Assume 1 million people a day will bang on any new feature so make it scalable from the start. Example: the About page on Digg did a live query against the master database to show all employees. Just did a quick hack to get out. Then a spider went crazy and took the site down.
Digg buttons were a major key to generating traffic.
Uses Debian Linux, Apache, PHP, MySQL.
Pick a language you enjoy developing in, pick a coding standard, add inline documentation that’s extractable, use a code repository, and a bug tracker. Likes PHP, Track, and SVN.
You are only as good as your people. Have to trust guy next to you that he’s doing his job. To cultivate trust empower people to make
decisions. Trust that people have it handled and they’ll take care of it. Cuts down on meetings because you know people will do the job right.
Completely a Mac shop.
Almost all developers are local. Some people are remote to offer 24 hour support.
Joe’s approach is pragmatic. He doesn’t have a language fetish. People went from PHP, to Python/Ruby, to Erlang. Uses vim. Develops from the command line. Has no idea how people constantly change tool sets all the time. It’s not very productive.
Services (SOA) decoupling is a big win. Digg uses REST. Internal services return a vanilla structure that’s mapped to JSON, XML, etc. Version in URL because it costs you nothing, for example:
/1.0/service/id/xml. Version both internal and external services.
People don’t understand how many moving parts are in a website. Something is going to happen and it will go down.
MemcacheDB: Evolutionary Step For Code, Revolutionary Step For Performance
Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that’s 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear.
You can’t update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.
The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That’s 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.
With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB’s own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it’s easy to see why Joe was so excited about MemcacheDB’s ability to handle their digg deluge.
What is MemcacheDB? It’s a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported.
Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren’t multiple indexes and there’s no SQL. That’s why it can be so fast.
Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it’s a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached’s normal partitioning scheme on top.
I asked Joe why he didn’t turn to one of the in-memory data grid solutions? Some of the reasons were:
This data is generated from many different databases and takes a long time to generate. So they want it in a persistent store.
MemcacheDB uses the memcache protocol. Digg already uses memcache so it’s a no-brainer to start using MemcacheDB. It’s easy to use and easy to setup.
Operations is happy with deploying it into the datacenter as it’s not a new setup.
They already have memcached high availability and failover code so that stuff already works.
Using a new system would require more ramp-up time.
If there are any problems with the code you can take a look. It’s all open source.
Not sure those other products are stable enough.
So it’s an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.
Scaling Digg and Other Web Applications by Kris Jordan.
Joe Stump’s Blog
MemcachedRelated Tags on HighScalability
Caching Related Tags on HighScalability
Anti-RDBMS: A list of distributed key-value stores
An Unorthodox Approach to Database Design : The Coming of the Shard
Episode 4: Scaling Large Web Sites with Joe Stump, Lead Architect at DIGG