SlideShare a Scribd company logo
Scaling social games
   “the order of magnitude
          challenge”
  Paolo Negri @hungryblank
Order of magnitude
                                      DAU

                     1000000

                      750000
DAU:
                      500000
daily active users
                      250000

                           0
                               July         December
Social Games
Flash client (game)    HTTP API




                      http://www.flickr.com/photos/stars6/4381851322
Social Games
Flash client


               • Game actions need to be
                 persisted and validated

               • 1 API call every few secs
Social Games
                          HTTP API



• 5000 HTTP reqs/sec
• more than 90% writes
• 60K queries/sec


                         http://www.flickr.com/photos/stars6/4381851322
July 2010
                                HAproxy
• ~ 170 000 daily users
• Plain Ruby on Rails app
• Persistency 100% SQL        Ruby on Rails




                                MySQL
July 2010
• 1 haproxy server              HAproxy


• multiple RoR servers
• 4 mysql servers             Ruby on Rails
  (sharded dataset)

                                MySQL
July 2010
                      HAproxy




                    Ruby on Rails




Slow down             MySQL
July 2010
                            HAproxy



High queries/request      Ruby on Rails
       ratio


    Slow down               MySQL
Queries/request


• Which code is triggering extra queries?
• Why in our test environment the ratio is
  lower than live?
Queries/request
       Running code of live system


Application   Plugins   Ruby on Rails
Queries/request
 Source of extra queries

              •   sharding plugin “breaks” std
                  Rails query cache
    Plugins
              •   Flash wire protocol plugin
                  generates extra queries
Plugins

• Deceiving “feature for free”
• Might provide the right feature
• But might not meet scaling need
Plugins

• Instant code legacy, for new projects also!
• Once added it’s your code
• Even if it’s maintained, will it follow your
  needs?
Plugins


• Assess code quality when you add it
• Can you afford to maintain/change it?
Plugins


• We fixed it!
• Query cut up to 40% on some requests
Early August
 30

22.5

 15

 7.5

  0
   6:00 6:10 6:20 6:30 6:40 6:50 7:00 7:10 7:20 7:30 7:40 7:50 8:00 8:10
                                                    query time in ms


• The MySQL hiccup
• every 70 mins query time spikes x7
Hiccup causes
    Who is periodically blocking MySQL

• Code (app + plugins + Rails)?
• Some periodic job?
• The devil (AWS)?
Hiccup quick fix
• We shard out the top queried table
  (40% of all queries)

                 MySQL servers

      shard 1   shard 2   shard 3   shard 4
Hiccup quick fix
• We shard out the top queried table
  (40% of all queries)

     Top table      Top table      Top table      Top table
      shard 1        shard 2        shard 3        shard 4



    Other tables   Other tables   Other tables   Other tables
      shard 1        shard 2        shard 3        shard 4
Hiccup quick fix
• Mysql likes it
• “top table” shards will go a long way in the
  scaling process

     Top table      Top table      Top table      Top table
      shard 1        shard 2        shard 3        shard 4



    Other tables   Other tables   Other tables   Other tables
      shard 1        shard 2        shard 3        shard 4
Hiccup causes
    Who is periodically blocking MySQL

• Code (app + plugins + Rails)?
• Some periodic job?
• The devil (AWS)?
             None of the Above
Hiccup real cause

• Emerging MySQL internal at high volume
• MySQL flushes its buffer
• Under heavy write IO it’s blocking
Hiccup solution

• Percona MySQL patches (XtraDB) avoid
  blocking behavior
• Query time profile gets smooth
• IO capacity limit manifested with gradual
  performance decay
Write through cache

• Memcache in front of MySQL
• Evaluated before sharding
• Was discarded
• Because of our read/write reatio
Write through cache


 90% of the times we read data
     in order to modify it
Write through cache

  It means 90% of the times

      1. read cache
      2. write cache
      3. write SQL
Write through cache
                   Bound to

   Read heavy                  Write heavy

                         • Mysql write
                              (unless async)
• memcache perfs
                         • Write through lib
                              optimized for
                              writes?
MySQL

• Sharding SQL is a painful way to scale
• Data migrations at high load imply
  downtime
• ACID benefits all lost because of sharding
  or in name of performance
Redis
• A persistent cache
• Fast 60000 qps on AWS hardware
• Interesting data structures, not only KV
• Already some small scale experince in
  house
Redis adoption

• Which data to start from?
• How do we migrate without downtime?
• Which Ruby object - Redis structure lib?
Redis adoption
• Which data to start from?
• Best data fit for Redis hashes
• Top 3rd queried table
• a collection of integer fields that need only
  increment / decrement
Redis adoption
• How do we migrate without downtime?
• Migrate one user at a time
• Use a Redis set to keep note of migrated/
  non migrated
• No downtime, transparent to users
Redis adoption
• How do we migrate without downtime?
                            MySQL
User 123
              RoR
             Server



                             Redis
Redis adoption
• How do we migrate without downtime?
              read original data
                                   MySQL
User 123
              RoR
             Server



                                   Redis
Redis adoption
• How do we migrate without downtime?
                                  MySQL
User 123
               RoR
              Server



                                  Redis
            write migrated data
Redis adoption
• How do we migrate without downtime?

• Migration might never complete
• SQL + Redis set information to generate
  final batch migration
Redis 1st result

10% query load from 4 MySQL server
is moved to 1 Redis server
Redis server load is 0.05
Redis


• Becomes the tool to use
• Migration plan for all write intensive data
• Migrate one “class” at a time
Redis honeymoon end

• Memory usage grows more than data
• Snapshot to disk causes spikes in query
  time
• Starting new slaves eats memory on the
  master node
Redis honeymoon end
           Russian Roulette Feeling



• Redis machine sized with overabundant
  RAM
• Rigorous slave/master starting plan
Redis


• Redis team acknowledges persistency/
  replication problems
• Redis 2.4 diskstore plan starts
1.000.000


And counting...
1.000.000
painless scaling          HAproxy




                        Ruby on Rails




                         Persistency
1.000.000
                            HAproxy



just add servers          Ruby on Rails
 as load grows


                           Peristency
1.000.000
                         HAproxy




                       Ruby on Rails



 Painful and            Peristency
troublesome
Infrastructure

• AWS
• Chef - through Scalarium
• Ganglia
Thanks
  ...
wooga
        Is looking for
Business Intelligence Engineer

   http://wooga.com/jobs

More Related Content

Scaling Social Games

  • 1. Scaling social games “the order of magnitude challenge” Paolo Negri @hungryblank
  • 2. Order of magnitude DAU 1000000 750000 DAU: 500000 daily active users 250000 0 July December
  • 3. Social Games Flash client (game) HTTP API http://www.flickr.com/photos/stars6/4381851322
  • 4. Social Games Flash client • Game actions need to be persisted and validated • 1 API call every few secs
  • 5. Social Games HTTP API • 5000 HTTP reqs/sec • more than 90% writes • 60K queries/sec
 http://www.flickr.com/photos/stars6/4381851322
  • 6. July 2010 HAproxy • ~ 170 000 daily users • Plain Ruby on Rails app • Persistency 100% SQL Ruby on Rails MySQL
  • 7. July 2010 • 1 haproxy server HAproxy • multiple RoR servers • 4 mysql servers Ruby on Rails (sharded dataset) MySQL
  • 8. July 2010 HAproxy Ruby on Rails Slow down MySQL
  • 9. July 2010 HAproxy High queries/request Ruby on Rails ratio Slow down MySQL
  • 10. Queries/request • Which code is triggering extra queries? • Why in our test environment the ratio is lower than live?
  • 11. Queries/request Running code of live system Application Plugins Ruby on Rails
  • 12. Queries/request Source of extra queries • sharding plugin “breaks” std Rails query cache Plugins • Flash wire protocol plugin generates extra queries
  • 13. Plugins • Deceiving “feature for free” • Might provide the right feature • But might not meet scaling need
  • 14. Plugins • Instant code legacy, for new projects also! • Once added it’s your code • Even if it’s maintained, will it follow your needs?
  • 15. Plugins • Assess code quality when you add it • Can you afford to maintain/change it?
  • 16. Plugins • We fixed it! • Query cut up to 40% on some requests
  • 17. Early August 30 22.5 15 7.5 0 6:00 6:10 6:20 6:30 6:40 6:50 7:00 7:10 7:20 7:30 7:40 7:50 8:00 8:10 query time in ms • The MySQL hiccup • every 70 mins query time spikes x7
  • 18. Hiccup causes Who is periodically blocking MySQL • Code (app + plugins + Rails)? • Some periodic job? • The devil (AWS)?
  • 19. Hiccup quick fix • We shard out the top queried table (40% of all queries) MySQL servers shard 1 shard 2 shard 3 shard 4
  • 20. Hiccup quick fix • We shard out the top queried table (40% of all queries) Top table Top table Top table Top table shard 1 shard 2 shard 3 shard 4 Other tables Other tables Other tables Other tables shard 1 shard 2 shard 3 shard 4
  • 21. Hiccup quick fix • Mysql likes it • “top table” shards will go a long way in the scaling process Top table Top table Top table Top table shard 1 shard 2 shard 3 shard 4 Other tables Other tables Other tables Other tables shard 1 shard 2 shard 3 shard 4
  • 22. Hiccup causes Who is periodically blocking MySQL • Code (app + plugins + Rails)? • Some periodic job? • The devil (AWS)? None of the Above
  • 23. Hiccup real cause • Emerging MySQL internal at high volume • MySQL flushes its buffer • Under heavy write IO it’s blocking
  • 24. Hiccup solution • Percona MySQL patches (XtraDB) avoid blocking behavior • Query time profile gets smooth • IO capacity limit manifested with gradual performance decay
  • 25. Write through cache • Memcache in front of MySQL • Evaluated before sharding • Was discarded • Because of our read/write reatio
  • 26. Write through cache 90% of the times we read data in order to modify it
  • 27. Write through cache It means 90% of the times 1. read cache 2. write cache 3. write SQL
  • 28. Write through cache Bound to Read heavy Write heavy • Mysql write (unless async) • memcache perfs • Write through lib optimized for writes?
  • 29. MySQL • Sharding SQL is a painful way to scale • Data migrations at high load imply downtime • ACID benefits all lost because of sharding or in name of performance
  • 30. Redis • A persistent cache • Fast 60000 qps on AWS hardware • Interesting data structures, not only KV • Already some small scale experince in house
  • 31. Redis adoption • Which data to start from? • How do we migrate without downtime? • Which Ruby object - Redis structure lib?
  • 32. Redis adoption • Which data to start from? • Best data fit for Redis hashes • Top 3rd queried table • a collection of integer fields that need only increment / decrement
  • 33. Redis adoption • How do we migrate without downtime? • Migrate one user at a time • Use a Redis set to keep note of migrated/ non migrated • No downtime, transparent to users
  • 34. Redis adoption • How do we migrate without downtime? MySQL User 123 RoR Server Redis
  • 35. Redis adoption • How do we migrate without downtime? read original data MySQL User 123 RoR Server Redis
  • 36. Redis adoption • How do we migrate without downtime? MySQL User 123 RoR Server Redis write migrated data
  • 37. Redis adoption • How do we migrate without downtime? • Migration might never complete • SQL + Redis set information to generate final batch migration
  • 38. Redis 1st result 10% query load from 4 MySQL server is moved to 1 Redis server Redis server load is 0.05
  • 39. Redis • Becomes the tool to use • Migration plan for all write intensive data • Migrate one “class” at a time
  • 40. Redis honeymoon end • Memory usage grows more than data • Snapshot to disk causes spikes in query time • Starting new slaves eats memory on the master node
  • 41. Redis honeymoon end Russian Roulette Feeling • Redis machine sized with overabundant RAM • Rigorous slave/master starting plan
  • 42. Redis • Redis team acknowledges persistency/ replication problems • Redis 2.4 diskstore plan starts
  • 44. 1.000.000 painless scaling HAproxy Ruby on Rails Persistency
  • 45. 1.000.000 HAproxy just add servers Ruby on Rails as load grows Peristency
  • 46. 1.000.000 HAproxy Ruby on Rails Painful and Peristency troublesome
  • 47. Infrastructure • AWS • Chef - through Scalarium • Ganglia
  • 49. wooga Is looking for Business Intelligence Engineer http://wooga.com/jobs