This document summarizes a keynote speech given by John Adams, an early Twitter engineer, about scaling Twitter operations from 2008-2009. Some key points:
1) Twitter saw exponential growth rates from 2008-2009, processing over 55 million tweets per day and 600 million searches per day.
2) Operations focused on improving performance, reducing errors and outages, and using metrics to identify weaknesses and bottlenecks like network latency and database delays.
3) Technologies like Unicorn, memcached, Flock, Cassandra, and daemons were implemented to improve scalability beyond a traditional RDBMS and handle Twitter's data volumes and real-time needs.
4) Caching,
4. John Adams @netik
• Early Twitter employee (mid-2008)
• Lead engineer: Outward Facing Services (Apache,
Unicorn, SMTP), Auth, Security
• Keynote Speaker: O’Reilly Velocity 2009
• O’Reilly Web 2.0 Speaker (2008, 2010)
• Previous companies: Inktomi, Apple, c|net
• Working on Web Operations book with John Alspaw
(flickr, etsy), out in June
12. Operations
• What do we do?
• Site Availability
• Capacity Planning (metrics-driven)
• Configuration Management
• Security
• Much more than basic Sysadmin
13. What have we done?
• Improved response time, reduced latency
• Less errors during deploys (Unicorn!)
• Faster performance
• Lower MTTD (Mean time to Detect)
• Lower MTTR (Mean time to Recovery)
14. Operations Mantra
Move to
Find Take
Next
Weakest Corrective
Weakest
Point Action
Point
Metrics +
Logs + Science = Process Repeatability
Analysis
16. Finding Weakness
• Metrics + Graphs
• Individual metrics are irrelevant
• We aggregate metrics to find knowledge
• Logs
• SCIENCE!
17. Monitoring
• Twitter graphs and reports critical metrics in
as near real time as possible
• If you build tools against our API, you should
too.
• RRD, other Time-Series DB solutions
• Ganglia + custom gmetric scripts
• dev.twitter.com - API availability
18. Analyze
• Turn data into information
• Where is the code base going?
• Are things worse than they were?
• Understand the impact of the last software
deploy
• Run check scripts during and after deploys
• Capacity Planning, not Fire Fighting!
19. Data Analysis
• Instrumenting the world pays off.
• “Data analysis, visualization, and other
techniques for seeing patterns in data are
going to be an increasingly valuable skill set.
Employers take notice!”
“Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
20. Forecasting Curve-fitting for capacity planning
(R, fityk, Mathematica, CurveFit)
unsigned int (32 bit)
Twitpocolypse
status_id
signed int (32 bit)
Twitpocolypse
r2=0.99
23. What’s a Robot ?
• Actual error in the Rails stack (HTTP 500)
• Uncaught Exception
• Code problem, or failure / nil result
• Increases our exception count
• Shows up in Reports
24. What’s a Whale ?
• HTTP Error 502, 503
• Twitter has a hard and fast five second timeout
• We’d rather fail fast than block on requests
• We also kill long-running queries (mkill)
• Timeout
25. Whale Watcher
• Simple shell script,
• MASSIVE WIN by @ronpepsi
• Whale = HTTP 503 (timeout)
• Robot = HTTP 500 (error)
• Examines last 60 seconds of
aggregated daemon / www logs
• “Whales per Second” > Wthreshold
• Thar be whales! Call in ops.
26. Deploy Watcher
Sample window: 300.0 seconds
First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010)
PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB0049 CANARY APACHE: ALL OK
WEB0049 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY OTHER: ALL OK
27. Feature “Darkmode”
• Specific site controls to enable and disable
computationally or IO-Heavy site function
• The “Emergency Stop” button
• Changes logged and reported to all teams
• Around 60 switches we can throw
• Static / Read-only mode
29. Servers
• Co-located, dedicated machines at NTT America
• No clouds; Only for monitoring, not serving
• Need raw processing power, latency too high
in existing cloud offerings
• Frees us to deal with real, intellectual, computer
science problems.
• Moving to our own data center soon
30. unicorn
• A single socket Rails application Server (Rack)
• Zero Downtime Deploys (!)
• Controlled, shuffled transfer to new code
• Less memory, 30% less CPU
• Shift from mod_proxy_balancer to
mod_proxy_pass
• HAProxy, Ngnix wasn’t any better. really.
31. Rails
• Mostly only for front-end.
• Back end mostly Scala and pure ruby
• Not to blame for our issues. Analysis found:
• Caching + Cache invalidation problems
• Bad queries generated by ActiveRecord, resulting in
slow queries against the db
• Queue Latency
• Replication Lag
32. memcached
• memcached isn’t perfect.
• Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
important configuration data
(loss of darkmode flags, for example)
• Network Memory Bus isn’t infinite
• Segmented into pools for better performance
33. Loony
• Central machine database (MySQL)
• Python, Django, Paraminko SSH
• Paraminko - Twitter OSS (@robey)
• Ties into LDAP groups
• When data center sends us email, machine
definitions built in real-time
34. Murder
• @lg rocks!
• Bittorrent based replication for deploys
• ~30-60 seconds to update >1k machines
• P2P - Legal, valid, Awesome.
35. Kestrel
• @robey
• Works like memcache (same protocol)
• SET = enqueue | GET = dequeue
• No strict ordering of jobs
• No shared state between servers
• Written in Scala.
36. Asynchronous Requests
• Inbound traffic consumes a unicorn worker
• Outbound traffic consumes a unicorn worker
• The request pipeline should not be used to
handle 3rd party communications or
back-end work.
• Reroute traffic to daemons
37. Daemons
• Daemons touch every tweet
• Many different daemon types at Twitter
• Old way: One daemon per type (Rails)
• New way: Fewer Daemons (Pure Ruby)
• Daemon Slayer - A Multi Daemon that could
do many different jobs, all at once.
38. Disk is the new Tape.
• Social Networking application profile has
many O(ny) operations.
• Page requests have to happen in < 500mS or
users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• SSDs? What to do?
39. Caching
• We’re the real-time web, but lots of caching
opportunity. You should cache what you get from us.
• Most caching strategies rely on long TTLs (>60 s)
• Separate memcache pools for different data types to
prevent eviction
• Optimize Ruby Gem to libmemcached + FNV Hash
instead of Ruby + MD5
• Twitter now largest contributor to libmemcached
40. MySQL
• Sharding large volumes of data is hard
• Replication delay and cache eviction produce
inconsistent results to the end user.
• Locks create resource contention for popular
data
41. MySQL Challenges
• Replication Delay
• Single threaded. Slow.
• Social Networking not good for RDBMS
• N x N relationships and social graph / tree
traversal
• Disk issues (FS Choice, noatime, scheduling
algorithm)
42. Relational Databases
not a Panacea
• Good for:
• Users, Relational Data, Transactions
• Bad:
• Queues. Polling operations. Social Graph.
• You don’t need ACID for everything.
43. Database Replication
• Major issues around users and statuses tables
• Multiple functional masters (FRP, FWP)
• Make sure your code reads and writes to the
write DBs. Reading from master = slow death
• Monitor the DB. Find slow / poorly designed
queries
• Kill long running queries before they kill you
(mkill)
44. Flock
Flock
• Scalable Social Graph Store
• Sharding via Gizzard
Gizzard
• MySQL backend (many.)
• 13 billion edges,
100K reads/second
Mysql Mysql Mysql
• Open Source!
45. Cassandra
• Originally written by Facebook
• Distributed Data Store
• @rk’s changes to Cassandra Open Sourced
• Currently double-writing into it
• Transitioning to 100% soon.
46. Lessons Learned
• Instrument everything. Start graphing early.
• Cache as much as possible
• Start working on scaling early.
• Don’t rely on memcache, and don’t rely on the
database
• Don’t use mongrel. Use Unicorn.