SlideShare a Scribd company logo
YOW! Conference Australia
Nov-Dec 2018
Brendan Gregg
Senior Performance Architect
Cloud and Platform Engineering
Cloud Performance
Root Cause Analysis
at Netflix
Experience: CPU Dips
YOW2018 Cloud Performance Root Cause Analysis at Netflix
# perf record -F99 -a
# perf script
[…]
java 14327 [022] 252764.179741: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14315 [014] 252764.183517: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14310 [012] 252764.185317: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8
java 14332 [015] 252764.188720: cycles: 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2
java 14341 [019] 252764.191307: cycles: 7f3656d150c8 ClassLoaderDataGraph::do_unloa
java 14341 [019] 252764.198825: cycles: 7f3656d140b8 ClassLoaderData::free_dealloca
java 14341 [019] 252764.207057: cycles: 7f3657192400 nmethod::do_unloading(BoolObje
java 14341 [019] 252764.215962: cycles: 7f3656ba807e Assembler::locate_operand(unsi
java 14341 [019] 252764.225141: cycles: 7f36571922e8 nmethod::do_unloading(BoolObje
java 14341 [019] 252764.234578: cycles: 7f3656ec4960 CodeHeap::block_start(void*) c
[…]
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Observability
Methodology
Velocity
Root Cause Analysis at Netflix
…
…
Load
Netflix
Application
ELB ASG Cluster SG
ASG 1
Instances (Linux)Instances (Linux)
JVM
Tomcat
Service
Hystrix
Zuul 2
AZ 1
ASG 2
AZ 2 AZ 3
…
Devices
Eureka
Ribbon
gRPC
Roots
Atlas
Chronos
Zipkin
Vector
sar, *stat
ftrace
bcc/eBPF
bpftrace
PMCs, MSRs
Agenda
1. The Netflix Cloud 2. Methodology
3. Cloud Analysis 4. Instance Analysis
Since 2014
Asgard Spinnaker→ Spinnaker
Salp Zipkin→ Spinnaker
gRPC adoption
New Atlas UI & Lumen
Java frame pointer
eBPF bcc & bpftrace
PMCs in EC2
...
From Clouds to Roots (2014 presentation): Old Atlas UI
>150k AWS EC2 server instances
~34% US Internet traffic at night
>130M members
Performance is customer satisfaction & Netflix cost
Acronyms
AWS: Amazon Web Services
EC2: AWS Elastic Compute 2 (cloud instances)
S3: AWS Simple Storage Service (object store)
ELB: AWS Elastic Load Balancers
SQS: AWS Simple Queue Service
SES: AWS Simple Email Service
CDN: Content Delivery Network
OCA: Netflix Open Connect Appliance (streaming CDN)
QoS: Quality of Service
AMI: Amazon Machine Image (instance image)
ASG: Auto Scaling Group
AZ: Availability Zone
NIWS: Netflix Internal Web Service framework (Ribbon)
gRPC: gRPC Remote Procedure Calls
MSR: Model Specific Register (CPU info register)
PMC: Performance Monitoring Counter (CPU perf counter)
eBPF: extended Berkeley Packet Filter (kernel VM)
1. The Netflix Cloud
Overview
S3S3
EC2
CassandraCassandra
EVCacheEVCache
Applications
(Services)
Applications
(Services)
ELBELB
ElasticsearchElasticsearch
SQSSQSSESSES
The Netflix Cloud
Netflix
Microservices User Data
Personalization
Viewing Hist.
Authentication
Web Site API
Streaming API
Client
Devices
DRM
QoS Logging
CDN Steering
Encoding
OCA CDN
…
EC2EC2
Freedom and Responsibility
●
Culture deck memo is true
– https://jobs.netflix.com/culture
●
Deployment freedom
– Purchase and use cloud instances without approvals
– Netflix environment changes fast!
●
Usually open source
●
Linux, Java, Cassandra,
Node.js, …
●
http://netflix.github.io/
Cloud Technologies
Linux (Ubuntu)
Java (JDK 8)
TomcatGC and
thread
dump
logging
GC and
thread
dump
logging
Application war files, base
servlet, platform, hystrix,
health check, metrics (Servo)
Application war files, base
servlet, platform, hystrix,
health check, metrics (Servo)
Optional Apache,
memcached, non-
Java apps (incl.
Node.js, golang)
Optional Apache,
memcached, non-
Java apps (incl.
Node.js, golang)
Atlas monitoring,
S3 log rotation,
ftrace, perf,
bcc/eBPF
Atlas monitoring,
S3 log rotation,
ftrace, perf,
bcc/eBPF
Typical BaseAMI
Cloud Instances
5 Key Issues
And How the Netflix Cloud is
Architected to Solve Them
1. Load Increases → Spinnaker Auto Scaling Groups
– Instances automatically
added or removed by a
custom scaling policy
– Alerts & monitoring used
to check scaling is sane
– Good for customers: Fast workaround
– Good for engineers: Fix later, 9-5
Scaling Policy
loadavg, latency, ���
CloudWatch,Servo
ASG
Instance
Instance
Instance
Instance
2. Bad Push → Spinnaker ASG Cluster Rollback
– ASG red black clusters: how code
versions are deployed
– Fast rollback for issues
– Traffic managed by Elastic Load
Balancers (ELBs)
– Automated Canary Analysis (ACA)
for testing
…
ASG-v011
…
ASG-v010
ASG
Cluster
prod1
Canary
ELB
Instance
Instance
Instance
Instance
Instance
Instance
3. Instance Failure → Spinnaker Hystrix Timeouts
– Hystrix: latency and fault tolerance
for dependency services
Fallbacks, degradation, fast fail and rapid
recovery, timeouts, load shedding, circuit
breaker, realtime monitoring
– Plus Ribbon or gRPC for more fault tolerance
Tomcat
Application
Hystrix
get A
Dependency
A1
Dependency
A2
>100ms
4. Region failure → Spinnaker Zuul 2 Reroute Traffic
– All device traffic goes through the Zuul 2 proxy: dynamic routing, monitoring,
resiliency, security
– Region or AZ failure: reroute traffic to another region
Zuul 2, DNS Monitoring
Region 1Region 1 Region 2Region 2 Region 3Region 3
5. Overlooked Issue → Spinnaker Chaos Engineering
lnstances: termination
Availability Zones: artificial failures
Latency: artificial delays by ChAP
Conformity: kills non-best-practices instances
Doctor: health checks
Janitor: kills unused instances
Security: checks violations
10-18: geographic issues
(Resilience)
A Resilient Architecture
…
…
Load
Netflix
Application
ELB ASG Cluster SG
ASG 1
Instances (Linux)Instances (Linux)
JVM
Tomcat
Service
Hystrix
Zuul 2
AZ 1
ASG 2
AZ 2 AZ 3
…
Devices
Chaos Engineering
Some services vary:
- Apache Web Server
- Node.js & Prana
- golang
Eureka
Ribbon
gRPC
2. Methodology
Cloud & Instance
Why Do Root Cause Perf Analysis?
…
…
Netflix
Application
ELBASG ClusterSG
ASG 1
Instances (Linux)
JVM
Tomcat
Service
AZ 1
ASG 2
AZ 2AZ 3
…
Often for:
●
High latency
●
Growth
●
Upgrades
Cloud Methodologies
●
Resource Analysis
●
Metric and event correlations
●
Latency Drilldowns
●
RED Method
Service A
Service C
Service B
Service D
For each microservice, check:
- Rate
- Errors
- Duration
Instance Methodologies
●
Log Analysis
●
Micro-benchmarking
●
Drill-down analysis
●
USE Method CPU Memory
Disk
Controller
Network
Controller
Disk Disk Net Net
For each resource, check:
- Utilization
- Saturation
- Errors
Bad Instance Anti-Method
1. Plot request latency per-instance
2. Find the bad instance
3. Terminate it
4. Someone else’s problem now!
Bad instance latency
Terminate!
Could be an early warning of a bigger issue
3. Cloud Analysis
Atlas, Lumen, Chronos, ...
Netflix Cloud Analysis Process
Instance AnalysisInstance Analysis
Atlas AlertsAtlas Alerts
Atlas/Lumen DashboardsAtlas/Lumen Dashboards
Atlas MetricsAtlas Metrics
ZipkinZipkinSlalomSlalom
PICSOUPICSOU
4. Check Dependencies
Create
New Alert
Cost
3. Drill Down
5. Root Cause
ChronosChronos
2. Check Events
Redirected to
a new Target
1. Check Issue
Example path
enumerated
Plus some other
tools not pictured
SlackSlack
Chat
Atlas: Alerts
Custom alerts on streams per second (SPS) changes, CPU usage, latency, ASG
growth, client errors, …
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Winston: Automated Diagnostics & Remediation
Chronos: Possible Related Events
Links to Atlas
Dashboards & Metrics
Atlas: Dashboards
Atlas: Dashboards
Netflix perf vitals dashboard
1. RPS, CPU
3. Instances
5. CPU/RPS
7. Java heap
9. Latency
2. Volume
6. Load avg
4. Scaling
8. ParNew
10. 99th
tile
Atlas & Lumen: Custom Dashboards
●
Dashboards are a checklist methodology: what to show first,
second, third...
●
Starting point for issues
1. Confirm and quantify issue
2. Check historic trend
3. Atlas metrics to drill down
Lumen: more flexible dashboards
eg, go/burger
Atlas: Metrics
Atlas: Metrics
Application
Metrics
Presentation
Interactive
graph
Summary
statistics
Region
Time range
Atlas: Metrics
●
All metrics in one system
– System metrics: CPU usage, disk I/O, memory, …
– Application metrics: latency percentiles, errors, …
●
Filters or breakdowns by region,
application, ASG, metric, instance
●
URL has session state: shareable
Chronos: Change Tracking
Chronos: Change Tracking
Scope Time Range Event Log
Slalom: Dependency Graphing
Slalom: Dependency Graphing
App
Dependency
Traffic Volume
Zipkin UI: Dependency Tracing
Dependency
Latency
PICSOU: AWS Usage
Cost per hour
Breakdowns
Details (redacted)
Slack: Chat
Latency is high in us-east-1
Sorry
We just did a bad push
Netflix Cloud Analysis Process
Instance AnalysisInstance Analysis
Atlas AlertsAtlas Alerts
Atlas MetricsAtlas Metrics
ZipkinZipkinSlalomSlalom
4. Check Dependencies
Create
New Alert
Cost
3. Drill Down
5. Root Cause
ChronosChronos
2. Check Events
Redirected to
a new Target
1. Check Issue
Example path
enumerated
Plus some other
tools not pictured
PICSOUPICSOU
Atlas/Lumen DashboardsAtlas/Lumen Dashboards
SlackSlack
Chat
Generic Cloud Analysis Process
Instance AnalysisInstance Analysis
AlertsAlerts
Custom DashboardsCustom Dashboards
Metric AnalysisMetric Analysis
Dependency AnalysisDependency Analysis
Usage ReportsUsage Reports
4. Check Dependencies
Create
New Alert
Cost
3. Drill Down
5. Root Cause
Change TrackingChange Tracking
2. Check Events
Redirected to
a new Target
1. Check Issue
Example path
enumerated
Plus other tools
as needed
MessagingMessaging
Chat
4. Instance Analysis
1. Statistics
2. Profiling
3. Tracing
4. Processor Analysis
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
1. Statistics
Linux Tools
●
vmstat, pidstat, sar, etc, used mostly normally
●
Micro benchmarking can be used to investigate hypervisor
behavior that can’t be observed directly
$ sar -n TCP,ETCP,DEV 1
Linux 4.15.0-1027-aws (xxx) 12/03/2018 _x86_64_ (48 CPU)
09:43:53 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
09:43:54 PM lo 15.00 15.00 1.31 1.31 0.00 0.00 0.00 0.00
09:43:54 PM eth0 26392.00 33744.00 19361.43 28065.36 0.00 0.00 0.00 0.00
09:43:53 PM active/s passive/s iseg/s oseg/s
09:43:54 PM 18.00 132.00 17512.00 33760.00
09:43:53 PM atmptf/s estres/s retrans/s isegerr/s orsts/s
09:43:54 PM 0.00 0.00 11.00 0.00 0.00
[…]
Exception: Containers
●
Most Linux tools are still not container aware
– From the container, will show the full host
●
We expose cgroup metrics in our cloud GUIs: Vector
Vector: Instance/Container Analysis
2. Profiling
Experience:
“ZFS is eating my CPUs”
Application (truncated)
38% kernel time (why?)
CPU Mixed-Mode Flame Graph
Zoomed
Java Profilers System Profilers
2014: Java Profiling
Java
JVM
Kernel
GC
2018: Java Profiling
CPU Mixed-mode Flame Graph
CPU Flame Graph
CPU Flame Chart (same data)
CPU Flame Graphs
a()
b() h()
c()
d()
e() f()
g()
i()
CPU Flame Graphs
a()
b() h()
c()
d()
e() f()
g()
i()
Top edge:
Who is running on CPU
And how much (width)
Ancestry
●
Y-axis: stack depth
– 0 at bottom
– 0 at top == icicle graph
●
X-axis: alphabet
– Time == flame chart
●
Color: random
– Hues often used for
language types
– Can be a dimension
eg, CPI
Application Profiling
●
Primary approach:
– CPU mixed-mode flame graphs (eg, via Linux perf)
– May need frame pointers (eg, Java -XX:+PreserveFramePointer)
– May need a symbol file (eg, Java perf-map-agent, Node.js --perf-basic-prof)
●
Secondary:
– Application profiler (eg, via Lightweight Java Profiler)
– Application logs
Vector: Push-button Flame Graphs
Future: eBPF-based Profiling
perf record
perf script
stackcollapse-perf.pl
flamegraph.pl
perf.data
flamegraph.pl
profile.py
Linux 4.9Linux 2.6
3. Tracing
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Core Linux Tracers
Ftrace 2.6.27+ Tracing views
perf 2.6.31+ Official profiler & tracer
eBPF 4.9+ Programmatic engine
bcc - Complex tools
bpftrace - Short scripts
Plus other kernel tech:
kprobes, uprobes
Experience: Disk %Busy
# iostat –x 1
[…]
avg-cpu: %user %nice %system %iowait %steal %idle
5.37 0.00 0.77 0.00 0.00 93.86
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdj 0.00 0.00 139.00 0.00 1056.00 0.00 15.19 0.88 6.19 6.19 0.00 6.30 87.60
[…]
# /apps/perf-tools/bin/iolatency 10
Tracing block I/O. Output every 10 seconds. Ctrl-C to end.
>=(ms) .. <(ms) : I/O |Distribution |
0 -> 1 : 421 |######################################|
1 -> 2 : 95 |######### |
2 -> 4 : 48 |##### |
4 -> 8 : 108 |########## |
8 -> 16 : 363 |################################# |
16 -> 32 : 66 |###### |
32 -> 64 : 3 |# |
64 -> 128 : 7 |# |
^C
# /apps/perf-tools/bin/iosnoop
Tracing block I/O. Ctrl-C to end.
COMM PID TYPE DEV BLOCK BYTES LATms
java 30603 RM 202,144 1670768496 8192 0.28
cat 6587 R 202,0 1727096 4096 10.07
cat 6587 R 202,0 1727120 8192 10.21
cat 6587 R 202,0 1727152 8192 10.43
java 30603 RM 202,144 620864512 4096 7.69
java 30603 RM 202,144 584767616 8192 16.12
java 30603 RM 202,144 601721984 8192 9.28
java 30603 RM 202,144 603721568 8192 9.06
java 30603 RM 202,144 61067936 8192 0.97
java 30603 RM 202,144 1678557024 8192 0.34
java 30603 RM 202,144 55299456 8192 0.61
java 30603 RM 202,144 1625084928 4096 12.00
java 30603 RM 202,144 618895408 8192 16.99
java 30603 RM 202,144 581318480 8192 13.39
java 30603 RM 202,144 1167348016 8192 9.92
java 30603 RM 202,144 51561280 8192 22.17
[...]
# perf record -e block:block_rq_issue --filter rwbs ~ "*M*" -g -a
# perf report -n –stdio
[...]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............ ................. ....................
#
70.70% 251 java [kernel.kallsyms] [k] blk_peek_request
|
--- blk_peek_request
do_blkif_request
__blk_run_queue
queue_unplugged
blk_flush_plug_list
blk_finish_plug
_xfs_buf_ioapply
xfs_buf_iorequest
|
|--88.84%-- _xfs_buf_read
| xfs_buf_read_map
| |
| |--87.89%-- xfs_trans_read_buf_map
| | |
| | |--97.96%-- xfs_imap_to_bp
| | | xfs_iread
| | | xfs_iget
| | | xfs_lookup
| | | xfs_vn_lookup
| | | lookup_real
| | | __lookup_hash
| | | lookup_slow
| | | path_lookupat
| | | filename_lookup
| | | user_path_at_empty
| | | user_path_at
| | | vfs_fstatat
| | | |
| | | |--99.48%-- SYSC_newlstat
| | | | sys_newlstat
| | | | system_call_fastpath
| | | | __lxstat64
| | | |Lsun/nio/fs/UnixNativeDispatcher;.lstat0
| | | | 0x7f8f963c847c
YOW2018 Cloud Performance Root Cause Analysis at Netflix
# /usr/share/bcc/tools/biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000000 tar 8519 xvda R 110824 4096 6.50
0.004183000 tar 8519 xvda R 111672 4096 4.08
0.016195000 tar 8519 xvda R 4198424 4096 11.88
0.018716000 tar 8519 xvda R 4201152 4096 2.43
0.019416000 tar 8519 xvda R 4201160 4096 0.61
0.032645000 tar 8519 xvda R 4207968 4096 13.16
0.033181000 tar 8519 xvda R 4207976 4096 0.47
0.033524000 tar 8519 xvda R 4208000 4096 0.27
0.033876000 tar 8519 xvda R 4207992 4096 0.28
0.034840000 tar 8519 xvda R 4208008 4096 0.89
0.035713000 tar 8519 xvda R 4207984 4096 0.81
0.036165000 tar 8519 xvda R 111720 4096 0.37
0.039969000 tar 8519 xvda R 8427264 4096 3.69
0.051614000 tar 8519 xvda R 8405640 4096 11.44
0.052310000 tar 8519 xvda R 111696 4096 0.55
0.053044000 tar 8519 xvda R 111712 4096 0.56
0.059583000 tar 8519 xvda R 8411032 4096 6.40
0.068278000 tar 8519 xvda R 4218672 4096 8.57
0.076717000 tar 8519 xvda R 4218968 4096 8.33
0.077183000 tar 8519 xvda R 4218984 4096 0.40
0.082188000 tar 8519 xvda R 8393552 4096 4.94
[...]
eBPF
eBPF: extended Berkeley Packet Filter
Kernel
kprobes
uprobes
tracepoints
sockets
SDN Configuration
User-Defined BPF Programs
…
Event TargetsRuntime
perf_events
BPF
actions
BPF
verifier
DDoS Mitigation
Intrusion Detection
Container Security
Observability
Firewalls (bpfilter)
Device Drivers
YOW2018 Cloud Performance Root Cause Analysis at Netflix
bcc
# /usr/share/bcc/tools/tcplife
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
2509 java 100.82.34.63 8078 100.82.130.159 12410 0 0 5.44
2509 java 100.82.34.63 8078 100.82.78.215 55564 0 0 135.32
2509 java 100.82.34.63 60778 100.82.207.252 7001 0 13 15126.87
2509 java 100.82.34.63 38884 100.82.208.178 7001 0 0 15568.25
2509 java 127.0.0.1 4243 127.0.0.1 42166 0 0 0.61
12030 upload-mes 127.0.0.1 34020 127.0.0.1 8078 11 0 3.38
12030 upload-mes 127.0.0.1 21196 127.0.0.1 7101 0 0 12.61
3964 mesos-slav 127.0.0.1 7101 127.0.0.1 21196 0 0 12.64
12021 upload-sys 127.0.0.1 34022 127.0.0.1 8078 372 0 15.28
2509 java 127.0.0.1 8078 127.0.0.1 34022 0 372 15.31
2235 dockerd 100.82.34.63 13730 100.82.136.233 7002 0 4 18.50
2235 dockerd 100.82.34.63 34314 100.82.64.53 7002 0 8 56.73
[...]
bpftrace
# biolatency.bt
Attaching 3 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C
@usecs:
[256, 512) 2 | |
[512, 1K) 10 |@ |
[1K, 2K) 426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K) 230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4K, 8K) 9 |@ |
[8K, 16K) 128 |@@@@@@@@@@@@@@@ |
[16K, 32K) 68 |@@@@@@@@ |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 10 |@ |
bpftrace: biolatency.bt
#!/usr/local/bin/bpftrace
BEGIN
{
printf("Tracing block device I/O... Hit Ctrl-C to end.n");
}
kprobe:blk_account_io_start
{
@start[arg0] = nsecs;
}
kprobe:blk_account_io_completion
/@start[arg0]/
{
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}
Future: eBPF GUIs
4. Processor Analysis
What “90% CPU Utilization” might suggest:
What it typically means on the Netflix cloud:
PMCs
●
Performance Monitoring Counters help you analyze stalls
●
Some instances (eg. Xen-based m4.16xl) have the architectural set:
Instructions Per Cycle (IPC)
“bad”
“good*”
<0.2
>2.0 Instruction bound
IPC
Stall-cycle bound
* probably; exception: spin locks
PMCs: EC2 Xen Hypervisor
# perf stat -a -- sleep 30
Performance counter stats for 'system wide':
1921101.773240 task-clock (msec) # 64.034 CPUs utilized (100.00%)
1,103,112 context-switches # 0.574 K/sec (100.00%)
189,173 cpu-migrations # 0.098 K/sec (100.00%)
4,044 page-faults # 0.002 K/sec
2,057,164,531,949 cycles # 1.071 GHz (75.00%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,357,979,592,699 instructions # 0.66 insns per cycle (75.01%)
243,244,156,173 branches # 126.617 M/sec (74.99%)
4,391,259,112 branch-misses # 1.81% of all branches (75.00%)
30.001112466 seconds time elapsed
# ./pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
38222881237 25412094046 0.66 4692322525 91505748 1.95 780435112 117058225 85.00
40754208291 26308406390 0.65 5286747667 95879771 1.81 751335355 123725560 83.53
35222264860 24681830086 0.70 4616980753 86190754 1.87 709841242 113254573 84.05
38176994942 26317856262 0.69 5055959631 92760370 1.83 787333902 119976728 84.76
[...]
PMCs: EC2 Nitro Hypervisor
●
Some instance types (large, Nitro-based) support most PMCs!
●
Meltdown KPTI patch TLB miss analysis on a c5.9xl:
nopti:
# tlbstat -C0 1
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
2854768 2455917 0.86 565 2777 50 40 0.00 0.00
2884618 2478929 0.86 950 2756 6 38 0.00 0.00
2847354 2455187 0.86 396 297403 46 40 0.00 0.00
[...]
pti, nopcid:
# tlbstat -C0 1
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
2875793 276051 0.10 89709496 65862302 787913 650834 27.40 22.63
2860557 273767 0.10 88829158 65213248 780301 644292 27.28 22.52
2885138 276533 0.10 89683045 65813992 787391 650494 27.29 22.55
2532843 243104 0.10 79055465 58023221 693910 573168 27.40 22.63
[...]
worst case
MSRs
●
Model Specific Registers
●
System config info, including current clock rate:
# showboost
Base CPU MHz : 2500
Set CPU MHz : 2500
Turbo MHz(s) : 3100 3200 3300 3500
Turbo Ratios : 124% 128% 132% 140%
CPU 0 summary every 1 seconds...
TIME C0_MCYC C0_ACYC UTIL RATIO MHz
23:39:07 1618910294 89419923 64% 5% 138
23:39:08 1774059258 97132588 70% 5% 136
23:39:09 2476365498 130869241 99% 5% 132
^C
Summary
Take-aways
Take Aways
1. Get push-button CPU flame graphs: kernel & user
2. Check out eBPF perf tools: bcc, bpftrace
3. Measure IPC as well as CPU utilization using PMCs
90% CPU busy:
… really means:
Observability
Methodology
Velocity
Observability
Statistics, Flame Graphs, eBPF Tracing, Cloud PMCs
Methodology
USE method, RED method, Drill-down Analysis, …
Velocity
Self-service GUIs: Vector, FlameScope, …
Resources
●
2014 talk From Clouds to Roots: http://www.slideshare.net/brendangregg/netflix-from-clouds-to-roots
http://www.youtube.com/watch?v=H-E0MQTID0g
●
Chaos: https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f https://principlesofchaos.org/
●
Atlas: https://github.com/Netflix/Atlas
●
Atlas: https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a
●
RED method: https://thenewstack.io/monitoring-microservices-red-method/
●
USE method: https://queue.acm.org/detail.cfm?id=2413037
●
Winston: https://medium.com/netflix-techblog/introducing-winston-event-driven-diagnostic-and-remediation-platform-46ce39aa81cc
●
Lumen: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c
●
Flame graphs: http://www.brendangregg.com/flamegraphs.html
●
Java flame graphs: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166
●
Vector: http://vectoross.io https://github.com/Netflix/Vector
●
FlameScope: https://github.com/Netflix/FlameScope
●
Tracing ponies: thanks Deirdré Straughan & General Zoi's Pony Creator
●
ftrace: http://lwn.net/Articles/608497/ - usually already in your kernel
●
perf: http://www.brendangregg.com/perf.html - perf is usually packaged in linux-tools-common
●
tcplife: https://github.com/iovisor/bcc - often available as a bcc or bcc-tools package
●
bpftrace: https://github.com/iovisor/bpftrace
●
pmcarch: https://github.com/brendangregg/pmc-cloud-tools
●
showboost: https://github.com/brendangregg/msr-cloud-tools - also try turbostat
Netflix Tech Blog
Thank you.
Brendan Gregg
@brendangregg

More Related Content

What's hot

Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
Brendan Gregg
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
Milind Koyande
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
Viller Hsiao
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
Viller Hsiao
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
Georg Schönberger
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
Brendan Gregg
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
Alexei Starovoitov
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
Kernel TLV
 

What's hot (20)

Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 

Similar to YOW2018 Cloud Performance Root Cause Analysis at Netflix

(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
Amazon Web Services
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
Amazon Web Services
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
Splunk
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
HostedbyConfluent
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
csching
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
Josef Adersberger
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
QAware GmbH
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Sid Anand
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
Itai Yaffe
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
InfluxData
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
PROIDEA
 

Similar to YOW2018 Cloud Performance Root Cause Analysis at Netflix (20)

(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
Brendan Gregg
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
Brendan Gregg
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
Brendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
Brendan Gregg
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
Brendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
Brendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Brendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 

YOW2018 Cloud Performance Root Cause Analysis at Netflix

  • 1. YOW! Conference Australia Nov-Dec 2018 Brendan Gregg Senior Performance Architect Cloud and Platform Engineering Cloud Performance Root Cause Analysis at Netflix
  • 4. # perf record -F99 -a # perf script […] java 14327 [022] 252764.179741: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14315 [014] 252764.183517: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14310 [012] 252764.185317: cycles: 7f36570a4932 SpinPause (/usr/lib/jvm/java-8 java 14332 [015] 252764.188720: cycles: 7f3658078350 pthread_cond_wait@@GLIBC_2.3.2 java 14341 [019] 252764.191307: cycles: 7f3656d150c8 ClassLoaderDataGraph::do_unloa java 14341 [019] 252764.198825: cycles: 7f3656d140b8 ClassLoaderData::free_dealloca java 14341 [019] 252764.207057: cycles: 7f3657192400 nmethod::do_unloading(BoolObje java 14341 [019] 252764.215962: cycles: 7f3656ba807e Assembler::locate_operand(unsi java 14341 [019] 252764.225141: cycles: 7f36571922e8 nmethod::do_unloading(BoolObje java 14341 [019] 252764.234578: cycles: 7f3656ec4960 CodeHeap::block_start(void*) c […]
  • 8. Root Cause Analysis at Netflix … … Load Netflix Application ELB ASG Cluster SG ASG 1 Instances (Linux)Instances (Linux) JVM Tomcat Service Hystrix Zuul 2 AZ 1 ASG 2 AZ 2 AZ 3 … Devices Eureka Ribbon gRPC Roots Atlas Chronos Zipkin Vector sar, *stat ftrace bcc/eBPF bpftrace PMCs, MSRs
  • 9. Agenda 1. The Netflix Cloud 2. Methodology 3. Cloud Analysis 4. Instance Analysis
  • 10. Since 2014 Asgard Spinnaker→ Spinnaker Salp Zipkin→ Spinnaker gRPC adoption New Atlas UI & Lumen Java frame pointer eBPF bcc & bpftrace PMCs in EC2 ... From Clouds to Roots (2014 presentation): Old Atlas UI
  • 11. >150k AWS EC2 server instances ~34% US Internet traffic at night >130M members Performance is customer satisfaction & Netflix cost
  • 12. Acronyms AWS: Amazon Web Services EC2: AWS Elastic Compute 2 (cloud instances) S3: AWS Simple Storage Service (object store) ELB: AWS Elastic Load Balancers SQS: AWS Simple Queue Service SES: AWS Simple Email Service CDN: Content Delivery Network OCA: Netflix Open Connect Appliance (streaming CDN) QoS: Quality of Service AMI: Amazon Machine Image (instance image) ASG: Auto Scaling Group AZ: Availability Zone NIWS: Netflix Internal Web Service framework (Ribbon) gRPC: gRPC Remote Procedure Calls MSR: Model Specific Register (CPU info register) PMC: Performance Monitoring Counter (CPU perf counter) eBPF: extended Berkeley Packet Filter (kernel VM)
  • 13. 1. The Netflix Cloud Overview
  • 15. Netflix Microservices User Data Personalization Viewing Hist. Authentication Web Site API Streaming API Client Devices DRM QoS Logging CDN Steering Encoding OCA CDN … EC2EC2
  • 16. Freedom and Responsibility ● Culture deck memo is true – https://jobs.netflix.com/culture ● Deployment freedom – Purchase and use cloud instances without approvals – Netflix environment changes fast!
  • 17. ● Usually open source ● Linux, Java, Cassandra, Node.js, … ● http://netflix.github.io/ Cloud Technologies
  • 18. Linux (Ubuntu) Java (JDK 8) TomcatGC and thread dump logging GC and thread dump logging Application war files, base servlet, platform, hystrix, health check, metrics (Servo) Application war files, base servlet, platform, hystrix, health check, metrics (Servo) Optional Apache, memcached, non- Java apps (incl. Node.js, golang) Optional Apache, memcached, non- Java apps (incl. Node.js, golang) Atlas monitoring, S3 log rotation, ftrace, perf, bcc/eBPF Atlas monitoring, S3 log rotation, ftrace, perf, bcc/eBPF Typical BaseAMI Cloud Instances
  • 19. 5 Key Issues And How the Netflix Cloud is Architected to Solve Them
  • 20. 1. Load Increases → Spinnaker Auto Scaling Groups – Instances automatically added or removed by a custom scaling policy – Alerts & monitoring used to check scaling is sane – Good for customers: Fast workaround – Good for engineers: Fix later, 9-5 Scaling Policy loadavg, latency, … CloudWatch,Servo ASG Instance Instance Instance Instance
  • 21. 2. Bad Push → Spinnaker ASG Cluster Rollback – ASG red black clusters: how code versions are deployed – Fast rollback for issues – Traffic managed by Elastic Load Balancers (ELBs) – Automated Canary Analysis (ACA) for testing … ASG-v011 … ASG-v010 ASG Cluster prod1 Canary ELB Instance Instance Instance Instance Instance Instance
  • 22. 3. Instance Failure → Spinnaker Hystrix Timeouts – Hystrix: latency and fault tolerance for dependency services Fallbacks, degradation, fast fail and rapid recovery, timeouts, load shedding, circuit breaker, realtime monitoring – Plus Ribbon or gRPC for more fault tolerance Tomcat Application Hystrix get A Dependency A1 Dependency A2 >100ms
  • 23. 4. Region failure → Spinnaker Zuul 2 Reroute Traffic – All device traffic goes through the Zuul 2 proxy: dynamic routing, monitoring, resiliency, security – Region or AZ failure: reroute traffic to another region Zuul 2, DNS Monitoring Region 1Region 1 Region 2Region 2 Region 3Region 3
  • 24. 5. Overlooked Issue → Spinnaker Chaos Engineering lnstances: termination Availability Zones: artificial failures Latency: artificial delays by ChAP Conformity: kills non-best-practices instances Doctor: health checks Janitor: kills unused instances Security: checks violations 10-18: geographic issues (Resilience)
  • 25. A Resilient Architecture … … Load Netflix Application ELB ASG Cluster SG ASG 1 Instances (Linux)Instances (Linux) JVM Tomcat Service Hystrix Zuul 2 AZ 1 ASG 2 AZ 2 AZ 3 … Devices Chaos Engineering Some services vary: - Apache Web Server - Node.js & Prana - golang Eureka Ribbon gRPC
  • 27. Why Do Root Cause Perf Analysis? … … Netflix Application ELBASG ClusterSG ASG 1 Instances (Linux) JVM Tomcat Service AZ 1 ASG 2 AZ 2AZ 3 … Often for: ● High latency ● Growth ● Upgrades
  • 28. Cloud Methodologies ● Resource Analysis ● Metric and event correlations ● Latency Drilldowns ● RED Method Service A Service C Service B Service D For each microservice, check: - Rate - Errors - Duration
  • 29. Instance Methodologies ● Log Analysis ● Micro-benchmarking ● Drill-down analysis ● USE Method CPU Memory Disk Controller Network Controller Disk Disk Net Net For each resource, check: - Utilization - Saturation - Errors
  • 30. Bad Instance Anti-Method 1. Plot request latency per-instance 2. Find the bad instance 3. Terminate it 4. Someone else’s problem now! Bad instance latency Terminate! Could be an early warning of a bigger issue
  • 31. 3. Cloud Analysis Atlas, Lumen, Chronos, ...
  • 32. Netflix Cloud Analysis Process Instance AnalysisInstance Analysis Atlas AlertsAtlas Alerts Atlas/Lumen DashboardsAtlas/Lumen Dashboards Atlas MetricsAtlas Metrics ZipkinZipkinSlalomSlalom PICSOUPICSOU 4. Check Dependencies Create New Alert Cost 3. Drill Down 5. Root Cause ChronosChronos 2. Check Events Redirected to a new Target 1. Check Issue Example path enumerated Plus some other tools not pictured SlackSlack Chat
  • 33. Atlas: Alerts Custom alerts on streams per second (SPS) changes, CPU usage, latency, ASG growth, client errors, …
  • 35. Winston: Automated Diagnostics & Remediation Chronos: Possible Related Events Links to Atlas Dashboards & Metrics
  • 37. Atlas: Dashboards Netflix perf vitals dashboard 1. RPS, CPU 3. Instances 5. CPU/RPS 7. Java heap 9. Latency 2. Volume 6. Load avg 4. Scaling 8. ParNew 10. 99th tile
  • 38. Atlas & Lumen: Custom Dashboards ● Dashboards are a checklist methodology: what to show first, second, third... ● Starting point for issues 1. Confirm and quantify issue 2. Check historic trend 3. Atlas metrics to drill down Lumen: more flexible dashboards eg, go/burger
  • 41. Atlas: Metrics ● All metrics in one system – System metrics: CPU usage, disk I/O, memory, … – Application metrics: latency percentiles, errors, … ● Filters or breakdowns by region, application, ASG, metric, instance ● URL has session state: shareable
  • 43. Chronos: Change Tracking Scope Time Range Event Log
  • 46. Zipkin UI: Dependency Tracing Dependency Latency
  • 47. PICSOU: AWS Usage Cost per hour Breakdowns Details (redacted)
  • 48. Slack: Chat Latency is high in us-east-1 Sorry We just did a bad push
  • 49. Netflix Cloud Analysis Process Instance AnalysisInstance Analysis Atlas AlertsAtlas Alerts Atlas MetricsAtlas Metrics ZipkinZipkinSlalomSlalom 4. Check Dependencies Create New Alert Cost 3. Drill Down 5. Root Cause ChronosChronos 2. Check Events Redirected to a new Target 1. Check Issue Example path enumerated Plus some other tools not pictured PICSOUPICSOU Atlas/Lumen DashboardsAtlas/Lumen Dashboards SlackSlack Chat
  • 50. Generic Cloud Analysis Process Instance AnalysisInstance Analysis AlertsAlerts Custom DashboardsCustom Dashboards Metric AnalysisMetric Analysis Dependency AnalysisDependency Analysis Usage ReportsUsage Reports 4. Check Dependencies Create New Alert Cost 3. Drill Down 5. Root Cause Change TrackingChange Tracking 2. Check Events Redirected to a new Target 1. Check Issue Example path enumerated Plus other tools as needed MessagingMessaging Chat
  • 51. 4. Instance Analysis 1. Statistics 2. Profiling 3. Tracing 4. Processor Analysis
  • 55. Linux Tools ● vmstat, pidstat, sar, etc, used mostly normally ● Micro benchmarking can be used to investigate hypervisor behavior that can’t be observed directly $ sar -n TCP,ETCP,DEV 1 Linux 4.15.0-1027-aws (xxx) 12/03/2018 _x86_64_ (48 CPU) 09:43:53 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 09:43:54 PM lo 15.00 15.00 1.31 1.31 0.00 0.00 0.00 0.00 09:43:54 PM eth0 26392.00 33744.00 19361.43 28065.36 0.00 0.00 0.00 0.00 09:43:53 PM active/s passive/s iseg/s oseg/s 09:43:54 PM 18.00 132.00 17512.00 33760.00 09:43:53 PM atmptf/s estres/s retrans/s isegerr/s orsts/s 09:43:54 PM 0.00 0.00 11.00 0.00 0.00 […]
  • 56. Exception: Containers ● Most Linux tools are still not container aware – From the container, will show the full host ● We expose cgroup metrics in our cloud GUIs: Vector
  • 60. Application (truncated) 38% kernel time (why?) CPU Mixed-Mode Flame Graph
  • 62. Java Profilers System Profilers 2014: Java Profiling
  • 65. CPU Flame Chart (same data)
  • 66. CPU Flame Graphs a() b() h() c() d() e() f() g() i()
  • 67. CPU Flame Graphs a() b() h() c() d() e() f() g() i() Top edge: Who is running on CPU And how much (width) Ancestry ● Y-axis: stack depth – 0 at bottom – 0 at top == icicle graph ● X-axis: alphabet – Time == flame chart ● Color: random – Hues often used for language types – Can be a dimension eg, CPI
  • 68. Application Profiling ● Primary approach: – CPU mixed-mode flame graphs (eg, via Linux perf) – May need frame pointers (eg, Java -XX:+PreserveFramePointer) – May need a symbol file (eg, Java perf-map-agent, Node.js --perf-basic-prof) ● Secondary: – Application profiler (eg, via Lightweight Java Profiler) – Application logs
  • 70. Future: eBPF-based Profiling perf record perf script stackcollapse-perf.pl flamegraph.pl perf.data flamegraph.pl profile.py Linux 4.9Linux 2.6
  • 73. Core Linux Tracers Ftrace 2.6.27+ Tracing views perf 2.6.31+ Official profiler & tracer eBPF 4.9+ Programmatic engine bcc - Complex tools bpftrace - Short scripts Plus other kernel tech: kprobes, uprobes
  • 75. # iostat –x 1 […] avg-cpu: %user %nice %system %iowait %steal %idle 5.37 0.00 0.77 0.00 0.00 93.86 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdj 0.00 0.00 139.00 0.00 1056.00 0.00 15.19 0.88 6.19 6.19 0.00 6.30 87.60 […]
  • 76. # /apps/perf-tools/bin/iolatency 10 Tracing block I/O. Output every 10 seconds. Ctrl-C to end. >=(ms) .. <(ms) : I/O |Distribution | 0 -> 1 : 421 |######################################| 1 -> 2 : 95 |######### | 2 -> 4 : 48 |##### | 4 -> 8 : 108 |########## | 8 -> 16 : 363 |################################# | 16 -> 32 : 66 |###### | 32 -> 64 : 3 |# | 64 -> 128 : 7 |# | ^C
  • 77. # /apps/perf-tools/bin/iosnoop Tracing block I/O. Ctrl-C to end. COMM PID TYPE DEV BLOCK BYTES LATms java 30603 RM 202,144 1670768496 8192 0.28 cat 6587 R 202,0 1727096 4096 10.07 cat 6587 R 202,0 1727120 8192 10.21 cat 6587 R 202,0 1727152 8192 10.43 java 30603 RM 202,144 620864512 4096 7.69 java 30603 RM 202,144 584767616 8192 16.12 java 30603 RM 202,144 601721984 8192 9.28 java 30603 RM 202,144 603721568 8192 9.06 java 30603 RM 202,144 61067936 8192 0.97 java 30603 RM 202,144 1678557024 8192 0.34 java 30603 RM 202,144 55299456 8192 0.61 java 30603 RM 202,144 1625084928 4096 12.00 java 30603 RM 202,144 618895408 8192 16.99 java 30603 RM 202,144 581318480 8192 13.39 java 30603 RM 202,144 1167348016 8192 9.92 java 30603 RM 202,144 51561280 8192 22.17 [...]
  • 78. # perf record -e block:block_rq_issue --filter rwbs ~ "*M*" -g -a # perf report -n –stdio [...] # Overhead Samples Command Shared Object Symbol # ........ ............ ............ ................. .................... # 70.70% 251 java [kernel.kallsyms] [k] blk_peek_request | --- blk_peek_request do_blkif_request __blk_run_queue queue_unplugged blk_flush_plug_list blk_finish_plug _xfs_buf_ioapply xfs_buf_iorequest | |--88.84%-- _xfs_buf_read | xfs_buf_read_map | | | |--87.89%-- xfs_trans_read_buf_map | | | | | |--97.96%-- xfs_imap_to_bp | | | xfs_iread | | | xfs_iget | | | xfs_lookup | | | xfs_vn_lookup | | | lookup_real | | | __lookup_hash | | | lookup_slow | | | path_lookupat | | | filename_lookup | | | user_path_at_empty | | | user_path_at | | | vfs_fstatat | | | | | | | |--99.48%-- SYSC_newlstat | | | | sys_newlstat | | | | system_call_fastpath | | | | __lxstat64 | | | |Lsun/nio/fs/UnixNativeDispatcher;.lstat0 | | | | 0x7f8f963c847c
  • 80. # /usr/share/bcc/tools/biosnoop TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000000000 tar 8519 xvda R 110824 4096 6.50 0.004183000 tar 8519 xvda R 111672 4096 4.08 0.016195000 tar 8519 xvda R 4198424 4096 11.88 0.018716000 tar 8519 xvda R 4201152 4096 2.43 0.019416000 tar 8519 xvda R 4201160 4096 0.61 0.032645000 tar 8519 xvda R 4207968 4096 13.16 0.033181000 tar 8519 xvda R 4207976 4096 0.47 0.033524000 tar 8519 xvda R 4208000 4096 0.27 0.033876000 tar 8519 xvda R 4207992 4096 0.28 0.034840000 tar 8519 xvda R 4208008 4096 0.89 0.035713000 tar 8519 xvda R 4207984 4096 0.81 0.036165000 tar 8519 xvda R 111720 4096 0.37 0.039969000 tar 8519 xvda R 8427264 4096 3.69 0.051614000 tar 8519 xvda R 8405640 4096 11.44 0.052310000 tar 8519 xvda R 111696 4096 0.55 0.053044000 tar 8519 xvda R 111712 4096 0.56 0.059583000 tar 8519 xvda R 8411032 4096 6.40 0.068278000 tar 8519 xvda R 4218672 4096 8.57 0.076717000 tar 8519 xvda R 4218968 4096 8.33 0.077183000 tar 8519 xvda R 4218984 4096 0.40 0.082188000 tar 8519 xvda R 8393552 4096 4.94 [...]
  • 81. eBPF
  • 82. eBPF: extended Berkeley Packet Filter Kernel kprobes uprobes tracepoints sockets SDN Configuration User-Defined BPF Programs … Event TargetsRuntime perf_events BPF actions BPF verifier DDoS Mitigation Intrusion Detection Container Security Observability Firewalls (bpfilter) Device Drivers
  • 84. bcc # /usr/share/bcc/tools/tcplife PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS 2509 java 100.82.34.63 8078 100.82.130.159 12410 0 0 5.44 2509 java 100.82.34.63 8078 100.82.78.215 55564 0 0 135.32 2509 java 100.82.34.63 60778 100.82.207.252 7001 0 13 15126.87 2509 java 100.82.34.63 38884 100.82.208.178 7001 0 0 15568.25 2509 java 127.0.0.1 4243 127.0.0.1 42166 0 0 0.61 12030 upload-mes 127.0.0.1 34020 127.0.0.1 8078 11 0 3.38 12030 upload-mes 127.0.0.1 21196 127.0.0.1 7101 0 0 12.61 3964 mesos-slav 127.0.0.1 7101 127.0.0.1 21196 0 0 12.64 12021 upload-sys 127.0.0.1 34022 127.0.0.1 8078 372 0 15.28 2509 java 127.0.0.1 8078 127.0.0.1 34022 0 372 15.31 2235 dockerd 100.82.34.63 13730 100.82.136.233 7002 0 4 18.50 2235 dockerd 100.82.34.63 34314 100.82.64.53 7002 0 8 56.73 [...]
  • 85. bpftrace # biolatency.bt Attaching 3 probes... Tracing block device I/O... Hit Ctrl-C to end. ^C @usecs: [256, 512) 2 | | [512, 1K) 10 |@ | [1K, 2K) 426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2K, 4K) 230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4K, 8K) 9 |@ | [8K, 16K) 128 |@@@@@@@@@@@@@@@ | [16K, 32K) 68 |@@@@@@@@ | [32K, 64K) 0 | | [64K, 128K) 0 | | [128K, 256K) 10 |@ |
  • 86. bpftrace: biolatency.bt #!/usr/local/bin/bpftrace BEGIN { printf("Tracing block device I/O... Hit Ctrl-C to end.n"); } kprobe:blk_account_io_start { @start[arg0] = nsecs; } kprobe:blk_account_io_completion /@start[arg0]/ { @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }
  • 89. What “90% CPU Utilization” might suggest: What it typically means on the Netflix cloud:
  • 90. PMCs ● Performance Monitoring Counters help you analyze stalls ● Some instances (eg. Xen-based m4.16xl) have the architectural set:
  • 91. Instructions Per Cycle (IPC) “bad” “good*” <0.2 >2.0 Instruction bound IPC Stall-cycle bound * probably; exception: spin locks
  • 92. PMCs: EC2 Xen Hypervisor # perf stat -a -- sleep 30 Performance counter stats for 'system wide': 1921101.773240 task-clock (msec) # 64.034 CPUs utilized (100.00%) 1,103,112 context-switches # 0.574 K/sec (100.00%) 189,173 cpu-migrations # 0.098 K/sec (100.00%) 4,044 page-faults # 0.002 K/sec 2,057,164,531,949 cycles # 1.071 GHz (75.00%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 1,357,979,592,699 instructions # 0.66 insns per cycle (75.01%) 243,244,156,173 branches # 126.617 M/sec (74.99%) 4,391,259,112 branch-misses # 1.81% of all branches (75.00%) 30.001112466 seconds time elapsed # ./pmcarch 1 CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC% 38222881237 25412094046 0.66 4692322525 91505748 1.95 780435112 117058225 85.00 40754208291 26308406390 0.65 5286747667 95879771 1.81 751335355 123725560 83.53 35222264860 24681830086 0.70 4616980753 86190754 1.87 709841242 113254573 84.05 38176994942 26317856262 0.69 5055959631 92760370 1.83 787333902 119976728 84.76 [...]
  • 93. PMCs: EC2 Nitro Hypervisor ● Some instance types (large, Nitro-based) support most PMCs! ● Meltdown KPTI patch TLB miss analysis on a c5.9xl: nopti: # tlbstat -C0 1 K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB% 2854768 2455917 0.86 565 2777 50 40 0.00 0.00 2884618 2478929 0.86 950 2756 6 38 0.00 0.00 2847354 2455187 0.86 396 297403 46 40 0.00 0.00 [...] pti, nopcid: # tlbstat -C0 1 K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB% 2875793 276051 0.10 89709496 65862302 787913 650834 27.40 22.63 2860557 273767 0.10 88829158 65213248 780301 644292 27.28 22.52 2885138 276533 0.10 89683045 65813992 787391 650494 27.29 22.55 2532843 243104 0.10 79055465 58023221 693910 573168 27.40 22.63 [...] worst case
  • 94. MSRs ● Model Specific Registers ● System config info, including current clock rate: # showboost Base CPU MHz : 2500 Set CPU MHz : 2500 Turbo MHz(s) : 3100 3200 3300 3500 Turbo Ratios : 124% 128% 132% 140% CPU 0 summary every 1 seconds... TIME C0_MCYC C0_ACYC UTIL RATIO MHz 23:39:07 1618910294 89419923 64% 5% 138 23:39:08 1774059258 97132588 70% 5% 136 23:39:09 2476365498 130869241 99% 5% 132 ^C
  • 96. Take Aways 1. Get push-button CPU flame graphs: kernel & user 2. Check out eBPF perf tools: bcc, bpftrace 3. Measure IPC as well as CPU utilization using PMCs 90% CPU busy: … really means:
  • 98. Observability Statistics, Flame Graphs, eBPF Tracing, Cloud PMCs Methodology USE method, RED method, Drill-down Analysis, … Velocity Self-service GUIs: Vector, FlameScope, …
  • 99. Resources ● 2014 talk From Clouds to Roots: http://www.slideshare.net/brendangregg/netflix-from-clouds-to-roots http://www.youtube.com/watch?v=H-E0MQTID0g ● Chaos: https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f https://principlesofchaos.org/ ● Atlas: https://github.com/Netflix/Atlas ● Atlas: https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a ● RED method: https://thenewstack.io/monitoring-microservices-red-method/ ● USE method: https://queue.acm.org/detail.cfm?id=2413037 ● Winston: https://medium.com/netflix-techblog/introducing-winston-event-driven-diagnostic-and-remediation-platform-46ce39aa81cc ● Lumen: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c ● Flame graphs: http://www.brendangregg.com/flamegraphs.html ● Java flame graphs: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 ● Vector: http://vectoross.io https://github.com/Netflix/Vector ● FlameScope: https://github.com/Netflix/FlameScope ● Tracing ponies: thanks Deirdré Straughan & General Zoi's Pony Creator ● ftrace: http://lwn.net/Articles/608497/ - usually already in your kernel ● perf: http://www.brendangregg.com/perf.html - perf is usually packaged in linux-tools-common ● tcplife: https://github.com/iovisor/bcc - often available as a bcc or bcc-tools package ● bpftrace: https://github.com/iovisor/bpftrace ● pmcarch: https://github.com/brendangregg/pmc-cloud-tools ● showboost: https://github.com/brendangregg/msr-cloud-tools - also try turbostat