Instagration Pt. 2: Scaling our infrastructure to multiple data centers

Published in

Instagram Engineering

8 min readNov 11, 2015

In 2013, about a year after we joined Facebook, 200m people were using Instagram every month and we were storing 20b photos. With no slow-down in sight, we began Instagration — our move from AWS servers to Facebook’s infrastructure. Two years later, Instagram has grown to be a community of 400 million monthly active users with 40b photos and videos, serving over a million requests per second. To keep supporting this growth, and to make sure the community has a reliable experience on Instagram, we decided to scale our infrastructure geographically. In this post, we’ll talk about why we expanded our infrastructure from one to three data centers and some of the technical challenges we met along the way.

Motivation

Mike Krieger, Instagram’s co-founder and CTO, recently wrote a post that included a story about the time in 2012 when a huge storm in Virginia brought down nearly half of our instances. The small team spent the next 36 hours rebuilding almost all of our infrastructure, which was an experience they never wanted to repeat. Natural disasters like this have the potential to temporarily or permanently damage a data center — and we need to make sure to sustain the loss with minimal impact on user experience.

Other motivations for scaling out geographically include:

Resilience to regional issues: More common than natural disasters are network disconnects, power issues, etc. For example, soon after we expanded our services to Oregon, one of the racks containing memcache and async tier servers was powered off, which caused large exceptions to user requests. With our new infrastructure in place, we were able to shift traffic away from the region to mitigate the issue while we recovered from the power failure.
Flexible capacity expansion: Facebook has a few data centers. It is much easier to expand Instagram’s capacity where it is available when our infrastructure is ready to expand beyond one region and even when there is considerable delay in network latency. It helps to make quick decisions on getting new features ready for users without having to scramble for infrastructure resources to support them.

From One to Two

So how did we start spreading things out? First let’s take a look at Instagram’s overall infrastructure stack.

The key to expanding to multiple data centers is to distinguish global data and local data. Global data needs to be replicated across data centers, while local data can be different for each region (for example, the async jobs created by web server would only be viewed in that region).

The next consideration is hardware resources. These can be roughly divided into three types: storage, computing and caching.

Storage

Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store.

Global data neatly maps to data stored in these servers. The goal is to have eventual consistency of these data across data centers, but with potential delay. Because there are vastly more read than write operations, having read replica each region avoids cross data center reads from web servers.

Writing to PostgreSQL, however, still goes across data centers because they always write to the primary.

CPU Processing

Web servers, async servers are both easily distributed computing resources that are stateless, and only need to access data locally. Web servers can create async jobs that are queued by async message brokers, and then consumed by async servers, all in the same region.

Caching

The cache layer is the web servers’ most frequently accessed tier, and they need to be collocated within a data center to avoid user request latency. This means that updates to cache in one data center are not reflected in another data center, therefore creating a challenge for moving to multiple data centers.

Imagine a user commented on your newly posted photo. In the one data center case, the web server that served the request can just update the cache with the new comment. A follower will see the new comment from the same cache.

In the multi data center scenario, however, if the commenter and the follower are served in different regions, the follower’s regional cache will not be updated and the user will not see the comment.

Our solution is to use PgQ and enhance it to insert cache invalidation events to the databases that are being modified.

On the primary side:

Web server inserts a comment to PostgreSQL DB;
Web server inserts a cache invalidation entry to the same DB.

On the replica side:

Replicate primary DB, including both the newly inserted comment as well as the cache invalidation entry
Cache invalidation process reads the cache invalidation entry and invalidates regional cache
Djangos will read from DB with the newly inserted comment and refill the cache

This solves the cache consistency issue. On the other hand, compared to the one-region case where django servers directly update cache without re-reading from DB, this would create increased read load on databases. In order to mitigate this problem, we took two approaches: 1) reduce computational resources needed for each read by denormalizing counters; 2) reduce number of reads by using cache leases.

De-normalizing Counters

The most commonly cached keys are counters. For example, we would use a counter to determine the number of people who liked a specific post from Justin Bieber. When there was just one region, we would update the memcache counters by incrementing from web servers, therefore avoiding a “select count(*)” call to the database, which would take hundreds of milliseconds.

But with two regions and PgQ invalidation, each new like creates a cache invalidation event to the counter. This will create a lot of “select count(*)”, especially on hot objects.

To reduce the resources needed for each of these operations, we denormalized the counter for likes on the post. Whenever a new like comes in, the count is increased in the database. Therefore, each read of the count will just be a simple “select” which is a lot more efficient.

There is also an added benefit of denormalizing counters in the same database where the liker to the post is stored. Both updates can be included in one transaction, making the updates atomic and consistent all the time. Whereas before the change, the counter in cache could be inconsistent with what was stored in the database due to timeout, retries etc.

Memcache Lease

In the above example of a new post from Justin Bieber, during the first few minutes of the post, both the viewing of the new post and likes for the post spikes. With each new like, the counter is deleted from cache. It is very common that multiple web servers would try to retrieve the same counter from cache, but it will have a “cache miss”. If they all go to the database server for retrieval, it would create a thundering herd problem.

We used memcache lease mechanism to solve this problem. It works likes this:

Web server issues a “lease get”, not the normal “get” to memcache server.
Memcache server returns the value if it’s a hit. In this case, it is no different than a normal “get.”
If the memcache server does not find the key, it returns a “first miss” to just one web server within *n* seconds; any other “lease get” requests during that time will get a “hot miss.” In the case of “hot miss”where the key had been deleted from cache *recently,* it would return stale value. If the cache key is not filled within *n* seconds, it again issues a “first miss” to a “lease get” request.
When a web server receives “first miss,” it goes to the database to retrieve data and fill the cache.
When a web server receives “hot miss” with a stale value, it can typically use that value. If it receives “hot miss” without any value, it can choose to wait for the cache to be filled by the “first miss” web server.

In summary, with both of the above implementations, we can mitigate the increased database load by reducing the number of accesses to the database, as well as the resources required for each access.

It also improved the reliability of our backend in the cases when some hot counters fall out of cache, which wasn’t an infrequent occurrence in early days of Instagram. Each of these occurrence would cause some hurried work from an engineer to manually fix the cache. With these changes, those incidents have become memories for old-timer engineers.

From 10ms Latency to 60ms

So far, we have focused mostly on cache consistency when caches become regional. Network latency between data centers across the continent was another challenge that impacted multiple designs. Between data centers, a 60ms network latency can cause problems in database replication as well as web servers’ updates to the database. We needed to solve the following problems in order to support a seamless expansion:

PostgreSQL Read Replicas Can’t Catch up

As a Postgres primary takes in writes, it generates delta logs. The faster the writes come in, the more frequent these logs are generated. The primaries themselves store the most recent log files for occasional needs from the replicas, but they archive all the logs to storage to make sure that they are saved and accessible by any replicas that need older data than what the primary has retained. This way, the primary does not run out of disk space.

When we build a new readreplica, the readreplica starts to read a snapshot of the database from the primary. Once it’s done, it needs to apply the logs that have happened since the snapshot to the database. When all the logs are applied, it will be up-to-date and can stream from the primary and serve reads from web servers.

However, when a large database’s write rate is quite high, and there is a lot of network latency between the replica and storage device, it is possible that the rate at which the logs are read is slower than the log creation rate. The replica will fall further and further behind and never catch up!

Our fix was to start a second streamer on the new readreplica as soon as it starts to transfer the base snapshot from the primary. It streams logs and stores it on local disk. When snapshot finishes transfer, the readreplica can read the logs locally, making it a much faster recovery process.

This not only solved our database replication issues across the US, but also cut down the time it took to build a new replica by half. Now, even if the primary and replica are in the same region, operational efficiency is drastically increased.

Summary

Instagram is now running in multiple data centers across the US, giving us more flexible capacity planning and acquisition, higher reliability, and better preparedness for the kind of natural disaster that happened in 2012. In fact, we recently survived a staged “disaster.” Facebook regularly tests its data centers by shutting them down during peak hours. About a month ago, right as we had finished migrated our data to a new data center, Facebook ran a test and took it down. It was a high-stakes simulation, but fortunately we survived the loss of capacity without users noticing. Instagration part 2 was a success!