Planning to Fail #phpne13
- 8. 99.9% (three nines)
Downtime:
43.8 minutes per month
8.76 hours per year
- 9. 99.99% (four nines)
Downtime:
4.32 minutes per month
52.56 minutes per year
- 18. • Launched in London
November 2011
• Now in 5 cities in 3 countries
(30%+ growth every month)
• A Hailo hail is accepted around
the world every 5 seconds
- 19. “.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in
complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
- 20. • SOA (10+ services)
• AWS (3 regions, 9 AZs, lots of
instances)
• 10+ engineers building services
and you?
(hailo is hiring)
- 28. Service Service Service Service
each service does one job well
Service Oriented Architecture
- 29. • Fewer lines of code
• Fewer responsibilities
• Changes less frequently
• Can swap entire implementation
if needed
- 31. Service MySQL
MySQL running on different box
- 34. CRUD
Locking
MySQL Search
Analytics
ID generation
also queuing…
Separating concerns
- 35. At Hailo we look for technologies that are:
• Distributed
run on more than one machine
• Homogenous
all nodes look the same
• Resilient
can cope with the loss of node(s) with no
loss of data
- 36. “There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”
http://blog.b3k.us/2012/01/24/some-rules.html
- 37. • Highly performant, scalable and
resilient data store
• Underpins much of what we do
at Hailo
• Makes multi-DC easy!
- 39. • Distributed, RESTful, Search
Engine built on top of Apache
Lucene
• Replaced basic foo LIKE ‘%bar%’
queries (so much better)
- 40. NSQ
• Realtime message processing
system designed to handle
billions of messages per day
• Fault tolerant, highly available
with reliable message delivery
guarantee
- 41. • Real time incremental analytics
platform, backed by Apache
Cassandra
• Powerful SQL-like interface
• Scalable and highly available
- 43. • All these technologies have
similar properties of distribution
and resilience
• They are designed to cope with
failure
• They are not broken by design
- 47. class HailoMemcacheService {
private $mc = null;
public function __call() {
$mc = $this->getInstance();
// do stuff
}
private function getInstance() {
if ($this->instance === null) {
$this->mc = new Memcached;
$this->mc->addServers($s);
}
return $this->mc;
}
} Lazy-init instances; connect on use
- 49. $this->mc = new Memcached;
$this->mc->addServers($s);
$this->mc->setOption(
Memcached::OPT_CONNECT_TIMEOUT,
$connectTimeout);
$this->mc->setOption(
Memcached::OPT_SEND_TIMEOUT,
$sendRecvTimeout);
$this->mc->setOption(
Memcached::OPT_RECV_TIMEOUT,
$sendRecvTimeout);
$this->mc->setOption(
Memcached::OPT_POLL_TIMEOUT,
$connectionPollTimeout);
Make sure timeouts are configured
- 51. “Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”
http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html
- 54. • Kill memcache on box A,
measure impact on application
• Kill memcache on box B,
measure impact on application
All fine.. we’ve got this covered!
- 56. • Box A, running in AWS, locks up
• Any parts of application that
touch Memcache stop working
- 58. $ iptables -A INPUT -i eth0
-p tcp --dport 11211 -j REJECT
$ php test-memcache.php
Working OK!
Packets rejected and source notified by ICMP. Expect fast fails.
- 59. $ iptables -A INPUT -i eth0
-p tcp --dport 11211 -j DROP
$ php test-memcache.php
Working OK!
Packets silently dropped. Expect long time outs.
- 60. $ iptables -A INPUT -i eth0
-p tcp --dport 11211
-m state --state ESTABLISHED
-j DROP
$ php test-memcache.php
Hangs! Uh oh.
- 61. • When AWS instances hang they
appear to accept connections
but drop packets
• Bug!
https://bugs.launchpad.net/libmemcached/
+bug/583031
- 63. RabbitMQ RabbitMQ RabbitMQ
HA cluster
AMQP (port 5672)
Service
- 64. $ iptables -A INPUT -i eth0
-p tcp --dport 5672
-m state --state ESTABLISHED
-j DROP
$ php test-rabbitmq.php
Fantastic! Block AMQP port, client times out
- 67. $ epmd –names
epmd: up and running on port
4369 with data:
name rabbit at port 60278
Each node listens on a port assigned by EPMD
- 69. $ iptables -A INPUT -i eth0
-p tcp --dport 60278
-m state --state ESTABLISHED
-j DROP
$ php test-rabbitmq.php
Hangs! Uh oh.
- 70. Mnesia('rabbit@dmzutilities03-global01-
test'): ** ERROR ** mnesia_event got
{inconsistent_database,
running_partitioned_network,
'rabbit@dmzutilities01-global01-test'}
application: rabbitmq_management
exited: shutdown
type: temporary
RabbitMQ logs show partitioned network error; nodes shutdown
- 72. while ($read < $n
&& !feof($this->sock->real_sock())
&& (false !== ($buf = fread(
$this->sock->real_sock(),
$n - $read)))) {
$read += strlen($buf);
$res .= $buf;
}
PHP library didn’t have any time limit on reading a frame
- 76. • Hailo run a dedicated automated
test environment
• Powered by bash, JMeter and
Graphite
• Continuous automated testing
with failure simulations
- 82. “the best way to avoid
failure is to fail constantly.”
http://www.codinghorror.com/blog/2011/04/worki
ng-with-the-chaos-monkey.html
- 83. You should test for
failure
How does the software react?
How does the PHP client react?
- 87. Thanks
Software used at Hailo
http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp
Plus a load of other things I’ve not mentioned.
- 88. Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix
Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator
Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/
Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/
ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems
Editor's Notes
- I’m dave!
- I work at Hailo. This presentation draws on my experiences building Hailo into one of the world’s leading taxi companies.
- The title of my talk is “planning to fail”
- First PHP conf; tempting fate. Thought about this title, but sounds more like monitoring.
- This talk more pro-active than that. Talking about my experiences at Hailo building reliable web services by continually failing.
- Why do we care about reliability?
- Advantages
- Advantages
- Advantages
- Advantages
- Advantages
- But first, let’s rewind to the beginning
- The pure joy of inserting a php tag in the middle of an HTML table
- My website still follows this pattern. I’d like to think my website is quite reliable.
- My website is reliable, but simple. Doesn’t change very often.
- Hailo is complex!
- Hailo is growing.
- Key quote: less machinery is quadratically better.
- Hailo have a lot of machinery!
- Enter the chaos monkey… If you want to be good at something, practice often!
- How about the “reliable” VPC that runs my website?
- But not resilient; my website would not cope well with the chaos monkey approach.
- We have to choose our stack appropriately if we are going to go down the chaos monkey route.
- Hailo didn’t start out this way; but the PHP component did
- Splitting into an SOA. Makes it much easier to change bits of code since each service does less, has less lines of code and changes less frequently. Also makes it easier to work in larger teams.
- Advantages
- Here’s one of our services… is this reliable?
- But Hailo is going global
- At Hailo we are splitting out the features of MySQL and using different technologies where appropriate
- Don’t pick things that arebroken by design
- We remove services from the critical path using lazy-init pattern
- We want to define timeouts so that under failure conditions we don’t hang forever
- Instrumenting operations times – mean, upper 90th, upper bound (highest observed value)
- Let’s aim for 95th percentile as our timeout – but instrument when we do have timeouts so that we know what’s going on
- Yay!
- Boo
- Boo
- This was after we fixed the bug, but we had the timeouts configured badly.
- Better –memcache failure having less impact now; some features might be degraded, but the minimal viable service now works
- Runnable .md based system tests