SlideShare a Scribd company logo
Planning
to fail

@davegardnerisme
#phpne13
dave
the taxi app
Planning
 to fail
Planning
for failure
Planning
 to fail
Why?


http://en.wikipedia.org/wiki/High_availability
99.9%        (three nines)

Downtime:

43.8 minutes per month
8.76 hours per year
99.99%       (four nines)

Downtime:

4.32 minutes per month
52.56 minutes per year
99.999% (five nines)

Downtime:

25.9 seconds per month
5.26 minutes per year
www.whoownsmyavailability.com



             ?
www.whoownsmyavailability.com



          YOU
The beginning
<?php
My website: single VPS running PHP + MySQL
No growth, low volume, simple functionality, one engineer (me!)
Large growth, high volume, complex functionality, lots of engineers
• Launched in London
  November 2011

• Now in 5 cities in 3 countries
  (30%+ growth every month)

• A Hailo hail is accepted around
  the world every 5 seconds
“.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in

complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of
  instances)

• 10+ engineers building services
                and you?
                (hailo is hiring)
Our overall
reliability is in
    danger
Embracing failure

(a coping strategy)
Planning to Fail #phpne13
VPC
(running PHP+MySQL)




                      reliable?
Reliable
  !==
Resilient
Choosing a stack
“Hailo”
(running PHP+MySQL)




                      reliable?
Service    Service         Service        Service


      each service does one job well



          Service Oriented Architecture
• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation
  if needed
Service
(running PHP+MySQL)




                      reliable?
Service                     MySQL




   MySQL running on different box
MySQL
Service
                            MySQL



 MySQL running in Multi-Master mode
Going global
CRUD
                          Locking
MySQL                     Search
                          Analytics
                          ID generation
                          also queuing…

        Separating concerns
At Hailo we look for technologies that are:

• Distributed
  run on more than one machine

• Homogenous
  all nodes look the same

• Resilient
  can cope with the loss of node(s) with no
  loss of data
“There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”




http://blog.b3k.us/2012/01/24/some-rules.html
• Highly performant, scalable and
  resilient data store

• Underpins much of what we do
  at Hailo

• Makes multi-DC easy!
ZooKeeper
• Highly reliable distributed
  coordination

• We implement locking and
  leadership election on top of ZK
  and use sparingly
• Distributed, RESTful, Search
  Engine built on top of Apache
  Lucene

• Replaced basic foo LIKE ‘%bar%’
  queries (so much better)
NSQ
• Realtime message processing
  system designed to handle
  billions of messages per day

• Fault tolerant, highly available
  with reliable message delivery
  guarantee
• Real time incremental analytics
  platform, backed by Apache
  Cassandra

• Powerful SQL-like interface

• Scalable and highly available
Cruftflake
• Distributed ID generation with
  no coordination required

• Rock solid
• All these technologies have
  similar properties of distribution
  and resilience

• They are designed to cope with
  failure

• They are not broken by design
Lessons learned
Minimise the
critical path
What is the minimum viable service?
class HailoMemcacheService {
    private $mc = null;

    public function __call() {
        $mc = $this->getInstance();
        // do stuff
    }

    private function getInstance() {
        if ($this->instance === null) {
             $this->mc = new Memcached;
             $this->mc->addServers($s);
        }
        return $this->mc;
    }
}        Lazy-init instances; connect on use
Configure clients
   carefully
$this->mc = new Memcached;
$this->mc->addServers($s);

$this->mc->setOption(
    Memcached::OPT_CONNECT_TIMEOUT,
    $connectTimeout);
$this->mc->setOption(
    Memcached::OPT_SEND_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_RECV_TIMEOUT,
    $sendRecvTimeout);
$this->mc->setOption(
    Memcached::OPT_POLL_TIMEOUT,
    $connectionPollTimeout);
         Make sure timeouts are configured
here?




Choose timeouts based on data
“Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”




http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html
here?




95th percentile
Test
• Kill memcache on box A,
  measure impact on application

• Kill memcache on box B,
  measure impact on application


All fine.. we’ve got this covered!
FAIL
• Box A, running in AWS, locks up

• Any parts of application that
  touch Memcache stop working
Things fail in
exotic ways
$ iptables -A INPUT -i eth0 
     -p tcp --dport 11211 -j REJECT



    $ php test-memcache.php

    Working OK!




Packets rejected and source notified by ICMP. Expect fast fails.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 -j DROP



$ php test-memcache.php

Working OK!




 Packets silently dropped. Expect long time outs.
$ iptables -A INPUT -i eth0 
 -p tcp --dport 11211 
 -m state --state ESTABLISHED 
 -j DROP



$ php test-memcache.php




           Hangs! Uh oh.
• When AWS instances hang they
  appear to accept connections
  but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/
+bug/583031
Fix, rinse, repeat
RabbitMQ     RabbitMQ    RabbitMQ


                          HA cluster

      AMQP (port 5672)


 Service
$ iptables -A INPUT -i eth0   
 -p tcp --dport 5672          
 -m state --state ESTABLISHED 
 -j DROP



$ php test-rabbitmq.php




  Fantastic! Block AMQP port, client times out
FAIL
“RabbitMQ clusters do not
tolerate network partitions
well.”




http://www.rabbitmq.com/partitions.html
$ epmd –names
epmd: up and running on port
4369 with data:
name rabbit at port 60278




 Each node listens on a port assigned by EPMD
Planning to Fail #phpne13
$ iptables -A INPUT -i eth0   
 -p tcp --dport 60278         
 -m state --state ESTABLISHED 
 -j DROP



$ php test-rabbitmq.php




           Hangs! Uh oh.
Mnesia('rabbit@dmzutilities03-global01-
     test'): ** ERROR ** mnesia_event got
     {inconsistent_database,
     running_partitioned_network,
     'rabbit@dmzutilities01-global01-test'}




     application: rabbitmq_management
     exited: shutdown
     type: temporary




RabbitMQ logs show partitioned network error; nodes shutdown
Planning to Fail #phpne13
while ($read < $n
    && !feof($this->sock->real_sock())
    && (false !== ($buf = fread(
        $this->sock->real_sock(),
        $n - $read)))) {
    $read += strlen($buf);
    $res .= $buf;
}




  PHP library didn’t have any time limit on reading a frame
Fix, rinse, repeat
It would be
nice if we could
 automate this
Automate!
• Hailo run a dedicated automated
  test environment

• Powered by bash, JMeter and
  Graphite

• Continuous automated testing
  with failure simulations
Fix attempt 1: bad timeouts configured
Fix attempt 2: better timeouts
Simulate in
system tests
Simulate failure

Assert monitoring endpoint
picks this up




       Assert features still work
In conclusion
“the best way to avoid
failure is to fail constantly.”




http://www.codinghorror.com/blog/2011/04/worki
ng-with-the-chaos-monkey.html
You should test for
failure

How does the software react?
How does the PHP client react?
Automation makes
continuous failure
testing feasible
Systems that cope well
with failure are easier
to operate
TIMED BLOCK ALL
THE THINGS
Thanks


Software used at Hailo

http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix

Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/

ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems

More Related Content

Planning to Fail #phpne13

  • 8. 99.9% (three nines) Downtime: 43.8 minutes per month 8.76 hours per year
  • 9. 99.99% (four nines) Downtime: 4.32 minutes per month 52.56 minutes per year
  • 10. 99.999% (five nines) Downtime: 25.9 seconds per month 5.26 minutes per year
  • 14. <?php
  • 15. My website: single VPS running PHP + MySQL
  • 16. No growth, low volume, simple functionality, one engineer (me!)
  • 17. Large growth, high volume, complex functionality, lots of engineers
  • 18. • Launched in London November 2011 • Now in 5 cities in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds
  • 19. “.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.” http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
  • 20. • SOA (10+ services) • AWS (3 regions, 9 AZs, lots of instances) • 10+ engineers building services and you? (hailo is hiring)
  • 28. Service Service Service Service each service does one job well Service Oriented Architecture
  • 29. • Fewer lines of code • Fewer responsibilities • Changes less frequently • Can swap entire implementation if needed
  • 31. Service MySQL MySQL running on different box
  • 32. MySQL Service MySQL MySQL running in Multi-Master mode
  • 34. CRUD Locking MySQL Search Analytics ID generation also queuing… Separating concerns
  • 35. At Hailo we look for technologies that are: • Distributed run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data
  • 36. “There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.” http://blog.b3k.us/2012/01/24/some-rules.html
  • 37. • Highly performant, scalable and resilient data store • Underpins much of what we do at Hailo • Makes multi-DC easy!
  • 38. ZooKeeper • Highly reliable distributed coordination • We implement locking and leadership election on top of ZK and use sparingly
  • 39. • Distributed, RESTful, Search Engine built on top of Apache Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much better)
  • 40. NSQ • Realtime message processing system designed to handle billions of messages per day • Fault tolerant, highly available with reliable message delivery guarantee
  • 41. • Real time incremental analytics platform, backed by Apache Cassandra • Powerful SQL-like interface • Scalable and highly available
  • 42. Cruftflake • Distributed ID generation with no coordination required • Rock solid
  • 43. • All these technologies have similar properties of distribution and resilience • They are designed to cope with failure • They are not broken by design
  • 46. What is the minimum viable service?
  • 47. class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-init instances; connect on use
  • 48. Configure clients carefully
  • 49. $this->mc = new Memcached; $this->mc->addServers($s); $this->mc->setOption( Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure timeouts are configured
  • 51. “Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.” http://techblog.netflix.com/2011/04/lessons- netflix-learned-from-aws-outage.html
  • 53. Test
  • 54. • Kill memcache on box A, measure impact on application • Kill memcache on box B, measure impact on application All fine.. we’ve got this covered!
  • 55. FAIL
  • 56. • Box A, running in AWS, locks up • Any parts of application that touch Memcache stop working
  • 58. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source notified by ICMP. Expect fast fails.
  • 59. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long time outs.
  • 60. $ iptables -A INPUT -i eth0 -p tcp --dport 11211 -m state --state ESTABLISHED -j DROP $ php test-memcache.php Hangs! Uh oh.
  • 61. • When AWS instances hang they appear to accept connections but drop packets • Bug! https://bugs.launchpad.net/libmemcached/ +bug/583031
  • 63. RabbitMQ RabbitMQ RabbitMQ HA cluster AMQP (port 5672) Service
  • 64. $ iptables -A INPUT -i eth0 -p tcp --dport 5672 -m state --state ESTABLISHED -j DROP $ php test-rabbitmq.php Fantastic! Block AMQP port, client times out
  • 65. FAIL
  • 66. “RabbitMQ clusters do not tolerate network partitions well.” http://www.rabbitmq.com/partitions.html
  • 67. $ epmd –names epmd: up and running on port 4369 with data: name rabbit at port 60278 Each node listens on a port assigned by EPMD
  • 69. $ iptables -A INPUT -i eth0 -p tcp --dport 60278 -m state --state ESTABLISHED -j DROP $ php test-rabbitmq.php Hangs! Uh oh.
  • 70. Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'} application: rabbitmq_management exited: shutdown type: temporary RabbitMQ logs show partitioned network error; nodes shutdown
  • 72. while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP library didn’t have any time limit on reading a frame
  • 74. It would be nice if we could automate this
  • 76. • Hailo run a dedicated automated test environment • Powered by bash, JMeter and Graphite • Continuous automated testing with failure simulations
  • 77. Fix attempt 1: bad timeouts configured
  • 78. Fix attempt 2: better timeouts
  • 80. Simulate failure Assert monitoring endpoint picks this up Assert features still work
  • 82. “the best way to avoid failure is to fail constantly.” http://www.codinghorror.com/blog/2011/04/worki ng-with-the-chaos-monkey.html
  • 83. You should test for failure How does the software react? How does the PHP client react?
  • 85. Systems that cope well with failure are easier to operate
  • 87. Thanks Software used at Hailo http://cassandra.apache.org/ http://zookeeper.apache.org/ http://www.elasticsearch.org/ http://www.acunu.com/acunu-analytics.html https://github.com/bitly/nsq https://github.com/davegardnerisme/cruftflake https://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not mentioned.
  • 88. Further reading Hystrix: Latency and Fault Tolerance for Distributed Systems https://github.com/Netflix/Hystrix Timelike: a network simulator http://aphyr.com/posts/277-timelike-a-network-simulator Notes on distributed systems for young bloods http://www.somethingsimilar.com/2013/01/14/notes-on-distributed- systems-for-young-bloods/ Stream de-duplication (relevant to NSQ) http://www.davegardner.me.uk/blog/2012/11/06/stream-de- duplication/ ID generation in distributed systems http://www.slideshare.net/davegardnerisme/unique-id-generation-in- distributed-systems

Editor's Notes

  1. I’m dave!
  2. I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
  3. The title of my talk is “planning to fail”
  4. First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
  5. This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
  6. Why do we care about reliability?
  7. Advantages
  8. Advantages
  9. Advantages
  10. Advantages
  11. Advantages
  12. But first, let’s rewind to the beginning
  13. The pure joy of inserting a php tag in the middle of an HTML table
  14. My website still follows this pattern. I’d like to think my website is quite reliable.
  15. My website is reliable, but simple. Doesn’t change very often.
  16. Hailo is complex!
  17. Hailo is growing.
  18. Key quote: less machinery is quadratically better.
  19. Hailo have a lot of machinery!
  20. Enter the chaos monkey… If you want to be good at something, practice often!
  21. How about the “reliable” VPC that runs my website?
  22. But not resilient; my website would not cope well with the chaos monkey approach.
  23. We have to choose our stack appropriately if we are going to go down the chaos monkey route.
  24. Hailo didn’t start out this way; but the PHP component did
  25. Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
  26. Advantages
  27. Here’s one of our services… is this reliable?
  28. But Hailo is going global
  29. At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
  30. Don’t pick things that arebroken by design
  31. We remove services from the critical path using lazy-init pattern
  32. We want to define timeouts so that under failure conditions we don’t hang forever
  33. Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
  34. Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
  35. Yay!
  36. Boo
  37. Boo
  38. This was after we fixed the bug, but we had the timeouts configured badly.
  39. Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
  40. Runnable .md based system tests