Latency is one of the most common Service Level Indicators (SLI), but where should it be measured from? There are three main ways to measure latency: •Server-side latency: Precise and high cardinality but missing the big picture •Client-side latency: Big picture but noisy •Blackbox monitoring latency: Good trade-off between the other two In this talk, we will dive deeper into each perspective and how all of them can be leveraged. We will use Criteo’s large-scale key/value infrastructure as a case study
Applying the power of Continuous Delivery to performance testing. Process, techniques, best practices. This talk describes a pragmatic approach to building a robust performance testing strategy.
Ricardo Jiménez-Peris, PhD, researcher and founder of LeanXcale, explains how his latest invention allows linear scale of relational databases.
This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
Chronix is a time series database designed specifically for anomaly detection in operational data. It offers several advantages over general purpose time series databases: 1) Chronix uses domain specific optimizations like optional timestamp compression, custom data records, and compression techniques tailored to the repetitive patterns in operational data. 2) It provides a programming interface to pre-compute representations of time series data and add domain-specific columns to speed up anomaly detection queries. 3) Chronix supports exploratory and correlating analyses through its multi-dimensional storage and ability to query on any combination of attributes. It also offers high-level domain-specific analysis functions evaluated server-side.
Create-Net is a research center that offers cloud computing research, consulting, training, and webinars. This webinar discusses monitoring in the cloud computing era, beginning with introductions to Ceilometer and Monasca. Ceilometer is OpenStack's metering framework that collects data from OpenStack services through agents and notifications. It stores data in a database and provides an API. Monasca is a monitoring as a service platform that processes metrics and events at scale through microservices and stores data for querying and visualization. The webinar concludes with a discussion of trends in cloud monitoring.
Amazon Redshift offers many powerful features. Yet, there are many instances where customers encounter sloppy performance and cost upheavals beyond control. Scaling AWS Redshift clusters to meet the increasing compute and reporting needs, while ensuring optimal cost, performance and security standards is quite a challenge for many organizations. This webinar covered the following, • Understand key design/architectural considerations of AWS Redshift • Tips & Tricks to optimize Cost & Performance • How Agilisium helped clients reduce AWS Redshift run cost up to 40% Presented by: Jay Palaniappan - CTO & Head of Innovation Labs || Smitha Basavaraju - Big Data Architect || Arun Chinnadurai - Associate Director – BD
High-speed reactive microservices (HSRM) are microservices that are in-memory, non-blocking, own their data through leasing, and use streams and batching. They provide advantages like lower costs, ability to handle more traffic with fewer resources, and cohesive codebases. The example service described handles 30k recommendations/second on a single thread through batching, streaming, and data faulting. The document discusses attributes of HSRM like single writer rules and service stores, and related concepts like reactive programming, streams, and service sharding.
The document discusses in-flux limiting for a multi-tenant logging service. It describes Symantec's logging and metrics architecture using Kafka, Elasticsearch, and InfluxDB. It addresses the issue of ingestion spikes overwhelming InfluxDB and presents a solution to normalize event rates using buffers that allocate ingestion quotas per tenant. The design implements rate limiting using a scheduled task pattern in Storm to track each tenant's event rate over a configurable window and throttle events if the threshold is exceeded.
This document discusses performance engineering for batch and web applications. It begins by outlining why performance testing is important. Key factors that influence performance testing include response time, throughput, tuning, and benchmarking. Throughput represents the number of transactions processed in a given time period and should increase linearly with load. Response time is the duration between a request and first response. Tuning improves performance by configuring parameters without changing code. The performance testing process involves test planning, creating test scripts, executing tests, monitoring tests, and analyzing results. Methods for analyzing heap dumps and thread dumps to identify bottlenecks are also provided. The document concludes with tips for optimizing PostgreSQL performance by adjusting the shared_buffers configuration parameter.
The presentation explains how to setup rate limits, how to work with 429 code, how rate limits are implemented in kubernetes, cni, loadbalancer and so on
The objective of the engagement is for Citi to have an understanding and path forward to monitor their Confluent Platform and - Platform Monitoring - Maintenance and Upgrade
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content. In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you. More specifically, we'll cover: Different techniques for measuring CDN performance Differentiating between network footprint and object delivery performance Choosing the right content to test Core metrics to focus on and how each impacts real traffic Understanding cache hit ratio, why it can be misleading, and how to measure for it
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Tom Valine and Bhinav Sura (Salesforce) We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day. BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing. Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more. This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability. *BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent. The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues. About the Speaker Jeffrey Berger Lead Database Engineer, Knewton Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
This document discusses Cassandra drivers and how to optimize queries. It begins with an introduction to Cassandra drivers and examples of basic usage in Java, Python and Ruby. It then covers the differences between synchronous and asynchronous queries. Prepared statements and consistency levels are also discussed. The document explores how consistency levels, driver policies and node outages impact performance and latency. Hinted handoff is described as a performance optimization that stores hints for missed writes on down nodes. Lastly, it provides best practices around driver usage.
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations. Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states. In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing. Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Noisy Real User Monitoring (RUM) data can ruin your P99! We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites: - Human: We exclude noise coming from bots and synthetic measurements. - Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness. - Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities. Adopting Human Visible Navigations provides you with these key benefits: - Fewer changes staying below the radar - Fewer data fluctuations - Fewer blindspots when finding bottlenecks - Better correlation with business metrics This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source) After attending this session; your P99 and other percentiles will become less noisy and easier to tune!
Understanding the impacts of running a containerized Go application inside Kubernetes with a focus on the CPU.
In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include: • Understanding performance budgets vs. performance goals • Aligning budgets with user experience • Pros and cons of Core Web Vitals • How to stay on top of your budgets to fight regressions
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset. This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules. Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!