How does a CDN get "data" when there's hundreds of terabytes stored in data centers?

Question

I am still trying to wrap my head around CDN's in a real world scenario.

Suppose I am building a Netflix clone.
I have about 1,000 terabytes of video content stored in an S3 bucket.
Since Netflix is used globally, I'd probably replicate it across the globe for lower latency. So, now I have 20 of the same S3 buckets full of video content.
However, if 25 MILLION people are on the platform, that could potentially be 500tb per hour being streamed. This does not seem practical and I can't just keep making more data stores.

Here is what I understand about CDN's:

Reduce latency (The DNS will forward you to the closest CDN server anywhere in the globe).
Performant (CDN's will optimize caching so you may not need to fetch from origin again).
Security (Extra protection against DDOS attacks)

However, wouldn't the CDN (server) still have to hit the data center to get the video content (suppose nothing is cached). How is the CDN helping with reducing the 500tb/h that is being streamed?

Here is what I have drawn out (I am looking for criticism).

My idea is that, with load balancers and reverse proxies, the user visits one of the many applications servers (that respond with the video page, genres page, etc). Once the user selects a particular film, then I think what I want to do is use the CDNs to deliver the GB's of data across many users... however, I am still confused how is the CDN going to serve that data? Where did it get it from? Wouldn't it still have to the datastore?

I think you need to consider 80/20 rule. That 80% of users are accessing 20% of the content. With the ratio possibly even higher, like 90/10. This has big impact on caching and distribution strategies. As you don't need to store all the movies in all the nodes. — Euphoric, Commented Jun 15, 2020 at 13:43
@Euphoric Well said! With this in mind, we could leverage the regional distribution to prioritize disturbing content in those regions' languages. For example, in Punjab, you may want to cache new Punjabi movies first. — dsomel21, Commented Jun 15, 2020 at 14:57

RibaldEddie · Accepted Answer · 2020-06-13 19:40:54Z

The answer is that at Netflix scale, you build your own CDN / cache device and offer it to ISPs to install in their own networks.

https://openconnect.netflix.com/en/

I should point out as well that network topologies usually look like tree structures, and Netflix places these devices not just within individual ISP networks but also at peering sites that would reflect a “middle-tier” of cache devices. So, assuming that the movies are in an S3 bucket in some region, those movies presumably get stored in intermediate Open Connect appliances where multiple ISPs all connect, and then each ISP has its own Open Connect appliance that only connects back to the one stored at the peering point. In this way, the sum of the individual households streaming the show aren’t reflecting the bandwidth that the origin actually has to deal with, the aggregate bandwidth of households stays within the ISP network and can be managed appropriately.

Final point: Netflix probably also can push content to these devices before a show or movie goes live based on all kinds of factors of their choosing. So if they expect a movie launch will result in high traffic they can push the movie content directly down to ISPs so that the rush of viewers can be served within the ISP network.

Apple has done that for years with things like iOS software updates. They “preseed” the CDN with content before the content is made available to end-devices.

user1937198 · Accepted Answer · 2020-06-15 13:11:55Z

CDNs generally are based on 2 key concepts:

1) the 80/20 rule, or for most services, most users are only interested a relatively small proportion of that content. So for that netflix example there might be 10TB of content that fulfills the vast majority of the requests. So the CDN edge nodes can cache that commonly requested content close to the users, using only one transfer per edge node, rather than one for each user behind the edge node. Latency benefits come from the edge node being able to serve the content directly. For less commonly accessed content, the edge node forwards the request to the original server, so there are less benefits in that scenario.

2) CDNs entire business is moving large amounts of data over long distances, so they build their networks to optimize this, with dedicated long distance fiber networks. These networks can be less congested than the general internet path.

Stack Exchange Supports Israel · Accepted Answer · 2020-06-30 17:54:13Z

Suppose I am building a Netflix clone. I have about 1,000 terabytes of video content stored in an S3 bucket.

Okay.

Since Netflix is used globally, I'd probably replicate it across the globe for lower latency. So, now I have 20 of the same S3 buckets full of video content.

Okay. This won't save you money, because (I think) S3 costs the same no matter which region your customer is in. Actually, it will cost you less (but still too much) if you only put your content in the cheapest region.

S3 is a really expensive way to serve lots of data, by the way. Even if you want to serve 500TB per month, you should shop around for non-cloud dedicated server hosting.

However, if 25 MILLION people are on the platform, that could potentially be 500tb per hour being streamed. This does not seem practical and I can't just keep making more data stores.

You can't? Netflix does. That is why Netflix costs money.

However, wouldn't the CDN (server) still have to hit the data center to get the video content (suppose nothing is cached).

Yup.

How is the CDN helping with reducing the 500tb/h that is being streamed?

Well it caches things. You just said "(suppose nothing is cached)" but the whole reduction is because of the cache.

however, I am still confused how is the CDN going to serve that data? Where did it get it from? Wouldn't it still have to the datastore?

Yes. The first time. Not the second time. And not the third time.

Stack Exchange Network

How does a CDN get "data" when there's hundreds of terabytes stored in data centers?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
design
performance
microservices
distributed-system
cdn
or ask your own question.

Hot Network Questions

How does a CDN get "data" when there's hundreds of terabytes stored in data centers?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged designperformancemicroservicesdistributed-systemcdn or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
design
performance
microservices
distributed-system
cdn
or ask your own question.