Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years

•Download as PPTX, PDF•

0 likes•31 views

Troubleshooting performance issues across distributed systems can be intimidating if you don’t know where to start, and it’s even harder when the system is running on hundreds or thousands of nodes. We’re well past the point of logging into random nodes and poking around hoping we spot the problem. It’s critical to have a methodology to follow as well as a deep understanding of the tools that are available to help you prove (or disprove) your mental model. In this session, we’ll explore how to go about diagnosing performance problems you might run into, and teach you the tools and process for getting to the bottom of any issue, quickly -- even when it’s one of the biggest distributed database deployments on the planet.

Recommended for you

Loadays managing my sql with percona toolkit

Percona Toolkit includes tools for monitoring and optimizing MySQL performance. Pt-diskstats summarizes disk I/O statistics from /proc/diskstats in an interactive table, showing read and write rates, sizes, and other metrics for each disk or partition. Pt-ioprofile measures I/O usage to identify which files MySQL is accessing and how it spends time reading, writing, and syncing data. These tools help administrators understand where disk I/O is going and identify opportunities to optimize storage usage.

•by Frederic Descamps

loadays percona toolkit mysql

High Availability in 37 Easy Steps

High Availability can be a curiously nebulous term, and most people probably don't care about it until they can't access their online banking service, or their plane crashes. This presentation examines some of the considerations necessary when building highly available computer systems, then focuses on the HA infrastructure software currently available from the Corosync/OpenAIS, Linux-HA and Pacemaker projects. Originally presented at Linux Users Victoria in April 2010 (http://luv.asn.au/2010/04/06)

•by Tim Serong

linuxhigh availability

Blades for HPTC

Are blade server suitable for HPTC? This talk covers the pros and cons of building your next cluster using blades. Talk given at International Supercomputing blade workshop in 2007.

•by Guy Coates

hptchpcblades

Ask The Right Questions
How slow? What’s it normally?
Can I see a latency histogram?
Did throughput change?
Every machine or just a subset?
It’s slow!

Recommended for you

QCon London.pdf

This is part 1 in a series of talks covering Padawan Monica Beckwith’s hands-on practical experience over the last two decades. Monica, who has trained with many Knights and a few Masters, will cover what it means to be sympathetic to the underlying hardware in Scaling Up and Scaling Out scenarios. In addition, she will share examples to put cloud performance into perspective.

•by Monica Beckwith

capacity planningscaling

Fosdem managing my sql with percona toolkit

Percona Toolkit is a collection of tools for MySQL administration. It includes tools to summarize system and MySQL information, analyze disk I/O, profile I/O usage, analyze indexes and queries, and profile workload. The tools provide concise reports on server configurations, disk usage, index usage, slow queries, and more to help optimize MySQL performance.

•by Frederic Descamps

mysqlfosdempercona toolkit

OSNoise Tracer: Who Is Stealing My CPU Time?

In the context of high-performance computing (HPC), the Operating System Noise (osnoise) refers to the interference experienced by an application due to activities inside the operating system. In the context of Linux, NMIs, IRQs, softirqs, and any other system thread can cause noise to the application. Moreover, hardware-related jobs can also cause noise, for example, via SMIs. HPC users and developers that care about every microsecond stolen by the OS need not only a precise way to measure the osnoise but mainly to figure out who is stealing cpu time so that they can pursue the perfect tune of the system. These users and developers are the inspiration of Linux's osnoise tracer. The osnoise tracer runs an in-kernel loop measuring how much time is available. It does it with preemption, softirq and IRQs enabled, thus allowing all the sources of osnoise during its execution. The osnoise tracer takes note of the entry and exit point of any source of interferences. When the noise happens without any interference from the operating system level, the tracer can safely point to a hardware-related noise. In this way, osnoise can account for any source of interference. The osnoise tracer also adds new kernel tracepoints that auxiliaries the user to point to the culprits of the noise in a precise and intuitive way. At the end of a period, the osnoise tracer prints the sum of all noise, the max single noise, the percentage of CPU available for the thread, and the counters for the noise sources, serving as a benchmark tool.

•by ScyllaDB

p99 latencyp99 confhigh throughput and low latency

Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years

Recommended for you

4th-COT-Presentation-CSS 8.pptx

This document discusses the components of a computer system. It begins by stating the importance for computer technicians to understand the different hardware, software, and peopleware that make up a computer system. It then lists and describes the major hardware components, including the system unit, motherboard, CPU, memory (RAM and ROM), power supply, hard disk drive, and optical drive. Students are assigned a group activity to arrange the computer components by importance, and an individual assessment follows.

•by GladNormanLimocon

hacking-embedded-devices.pptx

The document summarizes Maycon Vitali's presentation on hacking embedded devices. It includes an agenda covering extracting firmware from devices using tools like BusPirate and flashrom, decompressing firmware to view file systems and binaries, emulating binaries using QEMU, reverse engineering code to find vulnerabilities, and details four vulnerabilities discovered in Ubiquiti networking devices designated as CVEs. The presentation aims to demonstrate common weaknesses in embedded device security and how tools can be used to analyze and hack these ubiquitous connected systems.

•by ssuserfcf43f

Survey of Percona Toolkit

The document summarizes the Percona Toolkit, which contains free and open source command-line tools for MySQL based on Percona's experience developing best practices. Some of the most popular tools are pt-summary, pt-mysql-summary, pt-stalk, pt-archiver, and pt-query-digest, which allow users to summarize MySQL servers, analyze queries from logs, and check for issues. The toolkit can be installed via package repositories or by downloading individual tools.

•by Karwin Software Solutions LLC

mysqltoolspercona

All Machines Or One Machine?
One query or all queries?

Recommended for you

MySQL Monitoring 101

In this presentation I’ll be discussing the following beginner points to understanding and creating monitoring. * Why Monitor? * What’s the minimum to Monitor? * How to monitor? * Monitoring Software Options. * How to use the most basic of monitoring to help * The basics of graphing results * The rule of Everything * The important on Application metrics and timings For a very little investment in time, simple monitoring can be in place, and I can guarantee it will be of benefit to any system. The basis of monitoring are metrics that combined with application measurements can provide trending insights, bottleneck understanding and provide valuable feedback about your growing site.

•by Ronald Bradford

expertperformancebradford

Linux Systems Performance 2016

Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."

•by Brendan Gregg

Analyzing OS X Systems Performance with the USE Method

Talk for MacIT 2014. This talk is about systems performance on OS X, and introduces the USE Method to check for common performance bottlenecks and errors. This methodology can be used by beginners and experts alike, and begins by constructing a checklist of the questions we’d like to ask of the system, before reaching for tools to answer them. The focus is resources: CPUs, GPUs, memory capacity, network interfaces, storage devices, controllers, interconnects, as well as some software resources such as mutex locks. These areas are investigated by a wide variety of tools, including vm_stat, iostat, netstat, top, latency, the DTrace scripts in /usr/bin (which were written by Brendan), custom DTrace scripts, Instruments, and more. This is a tour of the tools needed to solve our performance needs, rather than understanding tools just because they exist. This talk will make you aware of many areas of OS X that you can investigate, which will be especially useful for the time when you need to get to the bottom of a performance issue.

•by Brendan Gregg

performanceosx

Understand Your Tools
info
-------------------------------------------------------------------
distribution: full
vectorized: true
• hash join
│ estimated row count: 124,482
│ equality: (rider_id) = (id)
│
├── • scan
│ estimated row count: 125,000 (100% of the table; stats collected 13 minutes
ago)
│ table: rides@rides_pkey
│ spans: FULL SCAN
│

Recommended for you

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

This document summarizes a series of performance issues seen by the author in their work with Oracle Exadata systems. It describes random session hangs occurring across several minutes, with long transaction locks and I/O waits seen. Analysis of AWR reports and blocking trees revealed that many sessions were blocked waiting on I/O, though initial I/O metrics from the OS did not show issues. Further analysis using ASH activity breakdowns and OS tools like sar and vmstat found high apparent CPU usage in ASH that was not reflected in actual low CPU load on the system. This discrepancy was due to the way ASH attributes non-waiting time to CPU. The root cause remained unclear.

•by Tanel Poder

troubleshootingperformanceoracle

Fine grained monitoring

One of the great challenges of of monitoring any large cluster is how much data to collect and how often to collect it. Those responsible for managing the cloud infrastructure want to see everything collected centrally which places limits on how much and how often. Developers on the other hand want to see as much detail as they can at as high a frequency as reasonable without impacting the overall cloud performance. To address what seems to be conflicting requirements, we've chosen a hybrid model at HP. Like many others, we have a centralized monitoring system that records a set of key system metrics for all servers at the granularity of 1 minute, but at the same time we do fine-grained local monitoring on each server of hundreds of metrics every second so when there are problems that need more details than are available centrally, one can go to the servers in question to see exactly what was going on at any specific time. The tool of choice for this fine-grained monitoring is the open source tool collectl, which additionally has an extensible api. It is through this api that we've developed a swift monitoring capability to not only capture the number of gets, put, etc every second, but using collectl's colmux utility, we can also display these in a top-like formact to see exactly what all the object and/or proxy servers are doing in real-time. We've also developer a second cability that allows one to see what the Virtual Machines are doing on each compute node in terms of CPU, disk and network traffic. This data can also be displayed in real-time with colmux. This talk will briefly introduce the audience to collectl's capabilities but more importantly show how it's used to augment any existing centralized monitoring infrastructure. Speakers Mark Seger

•by Iben Rodriguez

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

The document describes troubleshooting a complex performance issue in an Oracle database. Key details: - The problem was sporadic extreme slowness of the Oracle database and server lasting 1-20 minutes. - Initial AWR reports and OS metrics showed a spike at 18:10 with CPU usage at 66.89%, confirming a problem occurred then. - Further investigation using additional metrics was needed to fully understand the root cause, as initial diagnostics did not provide enough context about this brief problem period.

•by Tanel Poder

troubleshootingperformanceoracle

Utilization, Saturation, Error Rate
(USE Method)

Recommended for you

Operation outbreak

The document summarizes a hacking attack on a company called mBank. The attack involved scanning the website for vulnerabilities, finding credentials in PHP files that allowed accessing the MySQL database, and uploading a PHP shell to gain remote access. Key steps included SQL injection to find files on the server, extracting credentials from the configuration file to access the database as the root user, and using the database to upload a web shell.

•by Prathan Phongthiproek

Testing pc’s performance

The document discusses testing the performance of several computers using Windows Performance Monitor and TreeSize software. It includes details of tests run on one computer to check processor and memory usage under normal browsing conditions. The results found the processor peaked at 66% during Photoshop use but was generally lower than expected, while memory increases were small. Upgrading the computer's memory and processor is recommended over buying a new system based on the lower cost of upgrading. Regular performance monitoring is advised to optimize computer usage and efficiency.

•by iteclearners

aon

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...

In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations. Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.

•by ScyllaDB

iostat
root@ubuntu-vm:~# iostat -dmc 2
Linux 5.15.0-84-generic (ubuntu-vm) 09/24/2023 _aarch64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.77 0.00 5.38 42.82 0.00 51.03
Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn
dm-0 0.00 0.00 0.00 0.00 0 0
dm-1 12165.50 47.52 0.00 0.00 95 0
loop0 0.00 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0.00 0 0
loop3 0.00 0.00 0.00 0.00 0 0
sr0 0.00 0.00 0.00 0.00 0 0
vda 12165.50 47.52 0.00 0.00 95 0

mpstat
root@ubuntu-vm:~# mpstat -P ALL 2
Linux 5.15.0-84-generic (ubuntu-vm) 09/24/2023 _aarch64_ (2 CPU)
03:12:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnic
03:12:52 AM all 0.78 0.00 4.40 43.26 0.00 0.00 0.00 0.00 0.0
03:12:52 AM 0 1.03 0.00 4.62 39.49 0.00 0.00 0.00 0.00 0.0
03:12:52 AM 1 0.52 0.00 4.19 47.12 0.00 0.00 0.00 0.00 0.0
03:12:52 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnic
03:12:54 AM all 0.78 0.00 4.91 42.89 0.00 0.00 0.00 0.00 0.0
03:12:54 AM 0 1.03 0.00 5.15 44.33 0.00 0.00 0.00 0.00 0.0
03:12:54 AM 1 0.52 0.00 4.66 41.45 0.00 0.00 0.00 0.00 0.0

Recommended for you

Mitigating the Impact of State Management in Cloud Stream Processing Systems

Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states. In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing. Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.

•by ScyllaDB

Measuring the Impact of Network Latency at Twitter

Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.

•by ScyllaDB

Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...

BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day. BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing. Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more. This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability. *BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)

•by ScyllaDB

Biolatency: Understanding I/O
$ root@ubuntu-vm:~# biolatency-bpfcc 2
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 4093 |********** |
64 -> 127 : 15175 |****************************************|
128 -> 255 : 250 | |
256 -> 511 : 108 | |
512 -> 1023 : 44 | |
1024 -> 2047 : 17 | |

Understanding Cache Effectiveness
root@ubuntu-vm:~# cachestat-bpfcc 2
HITS MISSES DIRTIES HITRATIO BUFFERS_MB CACHED_MB
0 24016 0 0.00% 31 709
0 24288 0 0.00% 31 677
0 23686 0 0.00% 31 705
0 22041 0 0.00% 31 664
0 20342 0 0.00% 31 680
0 22785 0 0.00% 31 705
0 22714 0 0.00% 31 666
0 22904 0 0.00% 31 692
0 22805 0 0.00% 31 654
0 22782 0 0.00% 31 679
0 22999 0 0.00% 31 705
0 22851 0 0.00% 31 667
0 22758 0 0.00% 31 692

Profiling and Flame Graphs
Linux: perf, Java: async-profiler

Recommended for you

Noise Canceling RUM by Tim Vereecke, Akamai

Noisy Real User Monitoring (RUM) data can ruin your P99! We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites: - Human: We exclude noise coming from bots and synthetic measurements. - Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness. - Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities. Adopting Human Visible Navigations provides you with these key benefits: - Fewer changes staying below the radar - Fewer data fluctuations - Fewer blindspots when finding bottlenecks - Better correlation with business metrics This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source) After attending this session; your P99 and other percentiles will become less noisy and easier to tune!

•by ScyllaDB

Running a Go App in Kubernetes: CPU Impacts

Understanding the impacts of running a containerized Go application inside Kubernetes with a focus on the CPU.

•by ScyllaDB

Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...

In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.

•by ScyllaDB

Today’s Summary
■ Observability is Critical
■ Narrow the problem down
■ Distributed Tracing
■ Latency, Throughput
■ Utilization, Saturation, Errors (USE)
■ Profiling is easy and effective!

Jon Haddad
jon@rustyrazorblade.com
@rustyrazorblade (BlueSky)
rustyrazorblade.com
Thank you! Let’s connect.

Similar to Distributed System Performance Troubleshooting Like You’ve Been Doing it for Twenty Years

제2회난공불락 오픈소스 세미나 커널튜닝

Tommy Lee

This document provides an overview of kernel tuning and customizing for performance on Enterprise Linux. It discusses monitoring tools, basic tuning steps like disabling unused services, memory tuning including hugepages and transparent huge pages, swap/cache tuning. It also covers I/O and filesystem tuning and networking tuning. The goal is to provide concepts and approaches for tuning the major components to optimize performance.

Beyond PHP - it's not (just) about the code

Wim Godden

20150918 klug el performance tuning-v1.4

Jinkoo Han

Loadays managing my sql with percona toolkit

Frederic Descamps

High Availability in 37 Easy Steps

Tim Serong

Blades for HPTC

Guy Coates

QCon London.pdf

Monica Beckwith