Planning to Fail #phpne13

Planning
to fail

@davegardnerisme
#phpne13

Why?

http://en.wikipedia.org/wiki/High_availability

99.9% (three nines)

Downtime:

43.8 minutes per month
8.76 hours per year

99.99% (four nines)

Downtime:

4.32 minutes per month
52.56 minutes per year

99.999% (five nines)

Downtime:

25.9 seconds per month
5.26 minutes per year

www.whoownsmyavailability.com

?

www.whoownsmyavailability.com

YOU

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in London
November 2011

• Now in 5 cities in 3 countries
(30%+ growth every month)

• A Hailo hail is accepted around
the world every 5 seconds

“.. Brooks [1] reveals that the complexity
of a software project grows as the square
of the number of engineers and Leveson
[17] cites evidence that most failures in

complex systems result from unexpected
inter-component interaction rather than
intra-component bugs, we conclude that
less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of
instances)

• 10+ engineers building services
and you?
(hailo is hiring)

Our overall
reliability is in
danger

Embracing failure

(a coping strategy)

VPC
(running PHP+MySQL)

reliable?

“Hailo”
(running PHP+MySQL)

reliable?

Service Service Service Service

each service does one job well

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation
if needed

Service
(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

MySQL
Service
MySQL

MySQL running in Multi-Master mode

CRUD
Locking
MySQL Search
Analytics
ID generation
also queuing…

Separating concerns

At Hailo we look for technologies that are:

• Distributed
run on more than one machine

• Homogenous
all nodes look the same

• Resilient
can cope with the loss of node(s) with no
loss of data

“There is no such thing as standby
infrastructure: there is stuff you
always use and stuff that won’t
work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and
resilient data store

• Underpins much of what we do
at Hailo

• Makes multi-DC easy!

ZooKeeper
• Highly reliable distributed
coordination

• We implement locking and
leadership election on top of ZK
and use sparingly

• Distributed, RESTful, Search
Engine built on top of Apache
Lucene

• Replaced basic foo LIKE ‘%bar%’
queries (so much better)

NSQ
• Realtime message processing
system designed to handle
billions of messages per day

• Fault tolerant, highly available
with reliable message delivery
guarantee

• Real time incremental analytics
platform, backed by Apache
Cassandra

• Powerful SQL-like interface

• Scalable and highly available

Cruftflake
• Distributed ID generation with
no coordination required

• Rock solid

• All these technologies have
similar properties of distribution
and resilience

• They are designed to cope with
failure

• They are not broken by design

What is the minimum viable service?

class HailoMemcacheService {
private $mc = null;

public function __call() {
$mc = $this->getInstance();
// do stuff
}

private function getInstance() {
if ($this->instance === null) {
$this->mc = new Memcached;
$this->mc->addServers($s);
}
return $this->mc;
}
} Lazy-init instances; connect on use

Configure clients
carefully

$this->mc = new Memcached;
$this->mc->addServers($s);

$this->mc->setOption(
Memcached::OPT_CONNECT_TIMEOUT,
$connectTimeout);
Memcached::OPT_SEND_TIMEOUT,
$sendRecvTimeout);
Memcached::OPT_RECV_TIMEOUT,
$sendRecvTimeout);
Memcached::OPT_POLL_TIMEOUT,
$connectionPollTimeout);
Make sure timeouts are configured

here?

Choose timeouts based on data

“Fail Fast: Set aggressive timeouts
such that failing components
don’t make the entire system
crawl to a halt.”

http://techblog.netflix.com/2011/04/lessons-
netflix-learned-from-aws-outage.html

• Kill memcache on box A,
measure impact on application

• Kill memcache on box B,
measure impact on application

All fine.. we’ve got this covered!

• Box A, running in AWS, locks up

• Any parts of application that
touch Memcache stop working

$ iptables -A INPUT -i eth0
-p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

-p tcp --dport 11211 -j DROP


Working OK!

Packets silently dropped. Expect long time outs.

-p tcp --dport 11211
-m state --state ESTABLISHED
-j DROP


Hangs! Uh oh.

• When AWS instances hang they
appear to accept connections
but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/
+bug/583031

RabbitMQ RabbitMQ RabbitMQ

HA cluster

AMQP (port 5672)

Service

-p tcp --dport 5672
-j DROP

$ php test-rabbitmq.php

Fantastic! Block AMQP port, client times out

“RabbitMQ clusters do not
tolerate network partitions
well.”

http://www.rabbitmq.com/partitions.html

$ epmd –names
epmd: up and running on port
4369 with data:
name rabbit at port 60278

Each node listens on a port assigned by EPMD

-p tcp --dport 60278
-j DROP

$ php test-rabbitmq.php

Hangs! Uh oh.

Mnesia('rabbit@dmzutilities03-global01-
test'): ** ERROR ** mnesia_event got
{inconsistent_database,
running_partitioned_network,
'rabbit@dmzutilities01-global01-test'}

application: rabbitmq_management
exited: shutdown
type: temporary

RabbitMQ logs show partitioned network error; nodes shutdown

while ($read < $n
&& !feof($this->sock->real_sock())
&& (false !== ($buf = fread(
$this->sock->real_sock(),
$n - $read)))) {
$read += strlen($buf);
$res .= $buf;
}

PHP library didn’t have any time limit on reading a frame

It would be
nice if we could
automate this

• Hailo run a dedicated automated
test environment

• Powered by bash, JMeter and
Graphite

• Continuous automated testing
with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate failure

Assert monitoring endpoint
picks this up

Assert features still work

“the best way to avoid
failure is to fail constantly.”

http://www.codinghorror.com/blog/2011/04/worki
ng-with-the-chaos-monkey.html

You should test for
failure

How does the software react?
How does the PHP client react?

Automation makes
continuous failure
testing feasible

Systems that cope well
with failure are easier
to operate

Thanks

Software used at Hailo

http://cassandra.apache.org/
http://zookeeper.apache.org/
http://www.elasticsearch.org/
http://www.acunu.com/acunu-analytics.html
https://github.com/bitly/nsq
https://github.com/davegardnerisme/cruftflake
https://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.

Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systems
https://github.com/Netflix/Hystrix

Timelike: a network simulator
http://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloods
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-
systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)
http://www.davegardner.me.uk/blog/2012/11/06/stream-de-
duplication/

ID generation in distributed systems
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-
distributed-systems

Planning to Fail #phpne13

Related slideshows

More Related Content

Planning to Fail #phpne13

Editor's Notes