SlideShare a Scribd company logo
Andrew Spyker
@aspyker
Netflix Cloud Platform
and Open Source
Introduction
Big enterprise/datacenter to consumer/cloud
Netflix Cloud Platform
Elastic and Web Scale
High Availability and Auto Recovery
Continuous Delivery
Operational Visibility
Security
Agenda
About me, road to Netflix
● Working for IBM on Java/Middleware performance
○ Cloud & mobile deemed Enterprise Java benchmarks non-interesting
○ Monolithic DB’s, resiliency and code updates not required
● Acme Air (Benchmark) Example App
○ Showed web/cloud scale
■ 4B+ per day mobile requests end to end, hundreds of nodes
■ But, wasn’t operable
○ Rewrote using NetflixOSS libraries & services
■ Now operable, with same levels of scale
■ Also enabled Microservices and CI/CD
■ Won Netflix Cloud Prize
About me, road to Netflix
● Now that NetflixOSS was understood
○ Ported libraries & services to IBM middleware and cloud
■ POC’s for Open Stack, Docker, Mesos, Kubernetes
○ Started to onboard and operate IBM SaaS businesses
■ Most interestingly … IBM Watson
● 2014 - “Should I work on applying this platform to more
systems or help build the next cloud platform?”
● Joined Netflix in the cloud platform team
○ Focusing on performance/scalability
○ Also helping with architecture, containers, open source
@aspyker
ispyker.
blogspot.
com
Elastic, Web and Hyper Scale
Doing this
Not doing that
Elastic, Web and Hyper Scale
Front end API
(browser and mobile)
Authentication
Service
Booking
Service Temporal
caching
Durable
Storage
Load
Balancers
…
Strategy Benefit
Make deployments automated Without automation impossible
Expose well designed API to users Offloads presentation complexity to clients
Remove state for mid tier services Allows easy elastic scale out
Push temporal state to client and caching tier Leverage clients, avoids data tier overload
Use partitioned data storage Data design and storage scales with HA
…
…
…
…
…
HA and Automatic Recovery
Feeling This
Not Feeling That
Micro service
Implementation
Call “Auth Service”
Highly Available Service Runtime Recipe
Ribbon REST client
with Eureka
Web App
Front End
(REST services)
App Service
(auth-service)
Execute
auth-service
call
Hystrix
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Karyon
Fallback
Implementation
Implementation Detail Benefits
Decompose into micro services
• Key user path always available
• Failure does not propagate across service boundaries
Karyon /w automatic Eureka registration
• New instances are quickly found
• Failing individual instances disappear
Ribbon client with Eureka awareness
• Load balances & retries across instances with “smarts”
• Handles temporal instance failure
Hystrix as dependency circuit breaker
• Allows for fast failure
• Provides graceful cross service degradation/recovery
IaaS High Availability
Region (us-east-1)
us-east-1e
us-east-1c
Eureka
Web App Service1 Service2
Cluster Auto Recovery and Scaling Services (Auto Scaling Groups)
…
Global Load
Balancers
Rule Why?
Always > 2 of everything 1 is SPOF, 2 doesn’t web scale and slow DR recovery
Including IaaS and cloud services You’re only as strong as your weakest dependency
Use auto scaler/recovery monitoring Clusters guarantee availability and service latency
Use application level health checks Instance on the network != healthy
Worldwide availability Data replication, global front-end routing, cross region traffic
us-east-1d
Testing is only way to prove HA
● Chaos Monkey
○ Kill instances in production - runs regularly
● Chaos Gorilla
○ Kills availability zones (single datacenter)
○ Also testing for split brain important
● Chaos Kong
○ Kill entire region and shift traffic globally
○ Run frequently but with prior scheduling
Continuous Delivery
Reading This
Not This
v
Continuous Delivery
Cluster v1 Canary v2 Cluster V2
Step Technology
Developers test locally Unit test frameworks
Continuous build Continuous build server based on gradle builds
Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts
Developer work across dev and test Archaius allows for environment based context
Developers do canary tests, red/black
deployments in prod
Asgard console provides app cluster common devops approach,
security patterns, and visibility
Continuous
Build Server
Baked to images
(AMI’s)
… …
Operational Visibility
If you can’t see it, you can’t improve it
Operational Visibility
Web App Auth Service
Visibility Point Technology
Basic IaaS instance monitoring Not enough (not scalable, not app specific)
User like external monitoring SaaS offerings or OSS like Uptime
Targeted performance, sampling Vector performance and app level metrics
Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard
Application centric metrics Servo gauges, counters, timers sent to metrics store like Atlas
Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks
Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management
Servo
Hystrix/Turbine
External
Uptime
Monitoring Metric/Event
Repositories
LogStash/Elastic
Search/Kibana
Incidents
……
…
…
Atlas
Vector
Security
Solid Security
Done in new ways
NOT
Security
Security must consider fluid environment
Security must be automated!
Security Monkey
● Monitors security policies, tracks changes, alerts on situations
Scumblr
● Searches the web, social media for security “nuggets” (credentials, hacking
discussions, etc.). Collect via Sketchy.
Sketchy
● A safe way to collect text and screenshots from websites
What did we not cover?
Over 50 github projects
● “Technical indigestion as a service”
Big Data and User Interface Engineering
● Both deserve their own sections
● Extensive Open Source existing and coming
projects (Falcor)
How do I get started?
● All of the previous slides shows NetflixOSS components
○ Code: http://netflix.github.io
○ Announcements: http://techblog.netflix.com/
● Want to get running a bit faster?
● ZeroToCloud
○ Workshop for getting started with build/bake/deploy in Amazon EC2
● ZeroToDocker
○ Docker images that containing running Netflix technologies (not
production ready, but easy to understand)

More Related Content

Netflix Cloud Platform and Open Source

  • 1. Andrew Spyker @aspyker Netflix Cloud Platform and Open Source
  • 2. Introduction Big enterprise/datacenter to consumer/cloud Netflix Cloud Platform Elastic and Web Scale High Availability and Auto Recovery Continuous Delivery Operational Visibility Security Agenda
  • 3. About me, road to Netflix ● Working for IBM on Java/Middleware performance ○ Cloud & mobile deemed Enterprise Java benchmarks non-interesting ○ Monolithic DB’s, resiliency and code updates not required ● Acme Air (Benchmark) Example App ○ Showed web/cloud scale ■ 4B+ per day mobile requests end to end, hundreds of nodes ■ But, wasn’t operable ○ Rewrote using NetflixOSS libraries & services ■ Now operable, with same levels of scale ■ Also enabled Microservices and CI/CD ■ Won Netflix Cloud Prize
  • 4. About me, road to Netflix ● Now that NetflixOSS was understood ○ Ported libraries & services to IBM middleware and cloud ■ POC’s for Open Stack, Docker, Mesos, Kubernetes ○ Started to onboard and operate IBM SaaS businesses ■ Most interestingly … IBM Watson ● 2014 - “Should I work on applying this platform to more systems or help build the next cloud platform?” ● Joined Netflix in the cloud platform team ○ Focusing on performance/scalability ○ Also helping with architecture, containers, open source @aspyker ispyker. blogspot. com
  • 5. Elastic, Web and Hyper Scale Doing this Not doing that
  • 6. Elastic, Web and Hyper Scale Front end API (browser and mobile) Authentication Service Booking Service Temporal caching Durable Storage Load Balancers … Strategy Benefit Make deployments automated Without automation impossible Expose well designed API to users Offloads presentation complexity to clients Remove state for mid tier services Allows easy elastic scale out Push temporal state to client and caching tier Leverage clients, avoids data tier overload Use partitioned data storage Data design and storage scales with HA … … … … …
  • 7. HA and Automatic Recovery Feeling This Not Feeling That
  • 8. Micro service Implementation Call “Auth Service” Highly Available Service Runtime Recipe Ribbon REST client with Eureka Web App Front End (REST services) App Service (auth-service) Execute auth-service call Hystrix Eureka Server(s) Eureka Server(s) Eureka Server(s) Karyon Fallback Implementation Implementation Detail Benefits Decompose into micro services • Key user path always available • Failure does not propagate across service boundaries Karyon /w automatic Eureka registration • New instances are quickly found • Failing individual instances disappear Ribbon client with Eureka awareness • Load balances & retries across instances with “smarts” • Handles temporal instance failure Hystrix as dependency circuit breaker • Allows for fast failure • Provides graceful cross service degradation/recovery
  • 9. IaaS High Availability Region (us-east-1) us-east-1e us-east-1c Eureka Web App Service1 Service2 Cluster Auto Recovery and Scaling Services (Auto Scaling Groups) … Global Load Balancers Rule Why? Always > 2 of everything 1 is SPOF, 2 doesn’t web scale and slow DR recovery Including IaaS and cloud services You’re only as strong as your weakest dependency Use auto scaler/recovery monitoring Clusters guarantee availability and service latency Use application level health checks Instance on the network != healthy Worldwide availability Data replication, global front-end routing, cross region traffic us-east-1d
  • 10. Testing is only way to prove HA ● Chaos Monkey ○ Kill instances in production - runs regularly ● Chaos Gorilla ○ Kills availability zones (single datacenter) ○ Also testing for split brain important ● Chaos Kong ○ Kill entire region and shift traffic globally ○ Run frequently but with prior scheduling
  • 12. v Continuous Delivery Cluster v1 Canary v2 Cluster V2 Step Technology Developers test locally Unit test frameworks Continuous build Continuous build server based on gradle builds Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts Developer work across dev and test Archaius allows for environment based context Developers do canary tests, red/black deployments in prod Asgard console provides app cluster common devops approach, security patterns, and visibility Continuous Build Server Baked to images (AMI’s) … …
  • 13. Operational Visibility If you can’t see it, you can’t improve it
  • 14. Operational Visibility Web App Auth Service Visibility Point Technology Basic IaaS instance monitoring Not enough (not scalable, not app specific) User like external monitoring SaaS offerings or OSS like Uptime Targeted performance, sampling Vector performance and app level metrics Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard Application centric metrics Servo gauges, counters, timers sent to metrics store like Atlas Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management Servo Hystrix/Turbine External Uptime Monitoring Metric/Event Repositories LogStash/Elastic Search/Kibana Incidents …… … … Atlas Vector
  • 16. Security Security must consider fluid environment Security must be automated! Security Monkey ● Monitors security policies, tracks changes, alerts on situations Scumblr ● Searches the web, social media for security “nuggets” (credentials, hacking discussions, etc.). Collect via Sketchy. Sketchy ● A safe way to collect text and screenshots from websites
  • 17. What did we not cover? Over 50 github projects ● “Technical indigestion as a service” Big Data and User Interface Engineering ● Both deserve their own sections ● Extensive Open Source existing and coming projects (Falcor)
  • 18. How do I get started? ● All of the previous slides shows NetflixOSS components ○ Code: http://netflix.github.io ○ Announcements: http://techblog.netflix.com/ ● Want to get running a bit faster? ● ZeroToCloud ○ Workshop for getting started with build/bake/deploy in Amazon EC2 ● ZeroToDocker ○ Docker images that containing running Netflix technologies (not production ready, but easy to understand)