SlideShare a Scribd company logo
Scaling Monitoring At Databricks From
Prometheus to M3
YY Wan & Nick Lanham
Virtual M3 Day 2/18/21
Introduction
Nick Lanham
Senior Software Engineer
Observability Team
YY Wan
Software Engineer
Observability Team
About
● Founded in 2013 by the original creators of Apache Spark
● Data and AI platform as a service for 5000+ customers
● 1500+ employees, 400+ engineers, >$400M annual recurring revenue
● 3 cloud providers, 50+ regions
● Launching millions of VMs / day to run data engineering and ML workloads, processing exabytes of
data
Agenda
● Monitoring at Databricks before M3
● Deploying M3
○ Architecture
○ Migration
● Lessons Learned
○ Operational advice
○ Things to monitor
○ Updates and upgrades

Recommended for you

Intro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on KubernetesIntro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on Kubernetes

Rook turns distributed storage systems into self-managing, self-scaling, self-healing storage services. It automates the tasks of a storage administrator: deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management. Rook uses the power of the Kubernetes platform to deliver its services via a Kubernetes Operator for each storage provider. Oleg Chunikhin, Co-Founder and CTO @ Kublr.com, will present an introduction to storage management on k8s using Rook and Ceph.

kuberneteskublrceph
Microservices for Application Modernisation
Microservices for Application ModernisationMicroservices for Application Modernisation
Microservices for Application Modernisation

Microservices APIs API Gateways CI CD 12 Factor Security Architecture Microservices vs SOA vs Monolithic DevSecOps NodeJs, SpringBoot Java, .Net

microservicesnode.js spring boot java .netsoa
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3

유닉스를 리눅스로 마이그레이션시 전략을 설명한 장표입니다. 여러 가지 고려사항들이 포함되어 있습니다.

u2l migrationu2lunix to linux
Monitoring At Databricks
Before M3
Monitoring At Databricks
● Monitoring targets:
○ Cloud-native, majority of services run on Kubernetes
○ Customer Spark workloads run on VMs in customer environments
● Prometheus-based monitoring since 2016
● All service teams use metrics, dashboards, alerts
○ Most engineers are PromQL-literate
● Use-cases: real-time alerting, debugging, SLO reporting, automated event response
● Monitoring and data-drivenness are core to Databricks engineering culture
Prometheus Monitoring System
Scale Numbers
● 50+ regions / k8s clusters across multiple cloud providers
● 100+ microservices
● Infrastructure footprint of 4M+ VMs of Databricks services and customer Apache Spark workers
● Largest single Prometheus instance
○ 900k samples / sec
○ Churn rate: many metrics with only < 100 samples (i.e. metrics from short-lived Spark jobs persist for only < 100
minutes at 1 min scrape interval)
○ Disk usage (15d retention): 4TB
○ Huge AWS VM: x1e.16xlarge, 64 core, 1952GB RAM

Recommended for you

Platform & Application Modernization
Platform & Application ModernizationPlatform & Application Modernization
Platform & Application Modernization

Insights to why modernization is important, types of modernization, various modernization factors and proposed JKT architectural solution

solutionsplatformapplication modernization
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...

This document discusses best practices for site reliability engineering (SRE). It recommends hiring only coders, establishing service level agreements (SLAs) and measuring performance against them. It also suggests using error budgets, maintaining a common staffing pool for SRE and development teams, ensuring on-call teams have at least 8 people, and conducting post-mortems after every incident. Key reliability metrics like availability, latency, throughput and quality are identified. Objectives, service level objectives (SLOs) and responses if the error budget is exceeded or exhausted are outlined.

devopsdevopsdaysdevopsdays tel aviv
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

Ceph is an open-source distributed storage system that provides object, block, and file storage. The document discusses optimizing Ceph for an all-flash configuration and analyzing performance issues when using Ceph on all-flash storage. It describes SK Telecom's testing of Ceph performance on VMs using all-flash SSDs and compares the results to a community Ceph version. SK Telecom also proposes their all-flash Ceph solution with custom hardware configurations and monitoring software.

cephstorageopenstack days korea
Scaling Bottlenecks & Pain Points
Operational
● Frequent capacity issues - OOMs, high disk usage
● Multi-hour Prometheus updates (long WAL recovery process during startup)
UX
● Mental overhead of sharded view of metrics
● Big queries never completing (and causes OOMs)
● Short retention period
● Subject to strict metric whitelist
Searching for a Scalable Monitoring Solution
Requirements:
● High metric volume, cardinality, churn rate
● Minimum 90d retention
● Compatible with PromQL
● Global view of (some) metrics
● High availability setup
Nice-to-have:
● Good update and maintenance story - less manual intervention, no metrics gaps
● Battle-tested in large scale production environment
● Open source
(Mid-2019) Alternatives considered: sharded Prometheus, Thanos, Cortex, Datadog, SignalFx
Why ?
● Fulfilled all our hard requirements
○ Designed for large scale workloads and horizontally scalable
○ Exposes Prometheus API query endpoint
○ High availability with multi-replica setup
○ Designed for multi-region and cloud setup, with global querying feature
● Battle-tested at high scale at Uber in a production environment
● Has a kubernetes operator for automated cluster operations
● Cool features that we would be interested to use
○ Aggregation on ingest
○ Downsampling (potentially longer retention)
Deploying M3

Recommended for you

마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관

A Case of Healthcare Information System based on Micro Service Architecture.

almci/cdcvs
Modern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOpsModern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOps

This presentation by Serhii Abanichev (System Architect, Consultant, GlobalLogic) was delivered at GlobalLogic Kharkiv DevOps TechTalk #1 on October 8, 2019. In this talk were covered: - Full coverage of DevOps with Azure DevOps Services: - Create, test and deploy in any programming language, to any cloud or local environment. - Run concurrently on Linux, macOS, and Windows, deploying containers for individual hosts or Kubernetes. - Azure DevOps Services: a Microsoft solution that replaces dozens of tools ensuring smooth delivery to end users. Event materials: https://www.globallogic.com/ua/events/kharkiv-devops-techtalk-1/

azuredevopsservices
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성

*웨비나 일시: 2021년 5월 12일(수) *웨비나 title: 컨테이너 & 클라우드 환경을 소화할 수 있는 CI/CD구축 가이드 Table of contents 1) OpenShift 소개 2) Opeshift CI/CD 구성 3) Opeshift CI/CD 데모

#openshift #오픈시프트 #컨테이너 #락플레이스
Initial Plan
Making the Write Path Scalable
Building Our Own Rule Engine
Zooming In On M3 Setup

Recommended for you

Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace

The document discusses the growth of Site Reliability Engineering (SRE) at Squarespace from a team of 2 people in New York to a global organization with teams in New York, Portland, and Dublin. It describes how the initial SRE team focused on three pillars: monitoring and alerting, configuration management, and builds and deploys. It then explains how the SRE organization expanded to include additional teams focused on areas like provisioning, release engineering, developer productivity, and observability while also embedding SREs within product teams.

graphitemonitoringdevops
The Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - CiscoThe Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - Cisco

This document provides an overview of Cisco HyperFlex systems. It discusses how HyperFlex delivers complete hyperconvergence through a unified compute and network infrastructure. It also describes the next generation HyperFlex data platform and how it was designed for distributed storage. Finally, it outlines some of the key benefits HyperFlex provides such as efficient scalability, adaptability, and cloud-speed deployment capabilities.

Dev ops Introduction
Dev ops IntroductionDev ops Introduction
Dev ops Introduction

DevOps Concept

devops
Separating M3 Coordinator Groups
Monitoring M3 & Final Architecture
Migration
Migration
1. Shadow deployment
○ Dual-write metrics to both Prom and M3 storage
○ Evaluate alerts in using both Prom and M3 rule engine
○ Open a querying endpoint for Observability team to test queries and dashboarding
2. Behavior validation
○ Compare alert evaluation between old and new system
○ Compare dashboards side-by-side
3. Incremental rollout strategy
○ Percentage-based rollout of ad-hoc query traffic to M3, staged across environments
○ Per-service rollout of alert evaluation
4. Final outcome: All ad-hoc query traffic and alerts served from M3

Recommended for you

GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database

GT.M is a tried and tested schema-less "NoSQL" database with a strong pedigree in the highly demanding banking sector. Its free open-source licensing on x86 GNU Linux makes it an excellent alternative to the list of new, largely untested, NoSQL databases.

mongodbmdbcouchdb
4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴

대용량 시스템에 대한 설계 패턴과 일반적인 대용량 시스템에 대한 아키텍쳐 구조를 알아본다

api gateway아키텍쳐대용량 분산 시스템
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf

The document provides an overview of Google Cloud Storage including key concepts like buckets, objects, storage classes, encryption, versioning, access controls, and retention policies. It also describes how to configure and use object lifecycle management and signed URLs with Cloud Storage. Hands-on examples are provided to demonstrate common Cloud Storage tasks.

google cloud
Switching Over Ad-Hoc Querying Traffic
Per-Service Migration of Alerts
Outcome
● 1-yr migration (mid-2019 to mid-2020)
● M3 runs as the sole metrics provider in all environments across clouds
○ (beta) Global query endpoint available via M3 for all metrics
● User experience largely unchanged (PromQL everything)
● Retention is widely 90d
● Migration went pretty smoothly, avoided major outages
● Higher confidence to continue scaling metrics workloads into upcoming years
● No more giant VMs with 2TB RAM!!
Lessons Learned

Recommended for you

Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture

No matter where you are in your journey to cloud native, Elastic APM helps deliver better customer experiences by spotting performance bottlenecks and identifying regressions from new deployments faster.

elasticsearchelastic stackelasticon
Api observability
Api observability Api observability
Api observability

This document provides an overview of service mesh and the Istio observability tool Kiali. It begins with an introduction to service mesh and what problems it addresses in microservices architectures. Istio is presented as an open source service mesh that provides traffic management, observability, and policy enforcement for microservices. Kiali is specifically discussed as a tool for visualizing the topology and traffic flow of services in an Istio mesh. The rest of the document provides an agenda and then a live demo of Kiali's features using the Bookinfo sample application on Istio.

apimicroservicesistio
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed

Presto User Group Singapore Meetup - March 2019. These slides talk through the current state of Presto and features that help Presto work better in cloud and a glimpse into the roadmap

presto meetupprestosingapore
M3 From The Trenches
● System metrics to monitor
● General operational advice
● What to alert on
● How we do updates/upgrades
Overview
● Overall m3 has been amazingly stable
○ By far our biggest issue is running out of disk space
● Across more than 50 deployments only a few have been problematic
○ We'll dive into why, and how to avoid it
M3 at Databricks
● Large number of clusters means things HAVE to be automated
○ We use a combination of spinnaker and jenkins to kubectl apply templates
● About 900k samples per second in large clusters
● About 200k series read per second in large clusters
Key Metrics to Watch
● Memory used (alert if steadily over 60%)
○ We've seen that spikes can cause OOMs if you're consistently over this
○ Resolve by
■ Scale up cluster, or reduce incoming metric load
○ sum(container_memory_rss{filter}) by (kubernetes_pod_name)
● Disk space used (alert if predict_linear full in 14 days)
○ 14 days seems long, but it gives us plenty of time to provision new nodes and allow data to migrate
○ Resolve by
■ Scale up cluster, reduce retention, reduce incoming metric load
○ (kubelet_volume_stats_capacity_bytes{filter} - kubelet_volume_stats_available_bytes{}) /
kubelet_volume_stats_capacity_bytes{}
● Cluster scale-up can be slow
○ Be sure to test how long it takes in your cluster

Recommended for you

Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale

This document discusses benchmarking OpenStack at scale using Rally. Rally allows OpenStack developers and operators to generate relevant and repeatable benchmarking data on how their cloud operates under different workloads and levels of load. It provides examples of synthetic stress tests and real-life workload scenarios that can be used for benchmarking. The goals of Rally are to help identify performance bottlenecks, validate optimizations, and provide historical data for comparing cloud performance over time as OpenStack and deployments evolve.

cloud benchmarkingopenstackrally
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed

Talk at Presto Bangalore Meetup by Raunaq Morarka about who to achieve lightning speed analytics with Presto in cloud.

quboleprestopresto meetup
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...

At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.

osmcopen sourcemonitoring
General Advice
● Avoid a lot of custom things
○ As close to what the operator expects is the best
● Observe query rates and set limits
● Have a good testing env
○ Need to iterate quickly
○ Be able to throw away data
○ Try to have it at scale
● Have a look at the M3 dashboards and learn what things mean
○ https://grafana.com/grafana/dashboards/8126
○ Very dev focused, suggest making your own with key metrics
Scaling Monitoring At Databricks From Prometheus to M3
Other Alerting
● high latency ingesting samples: coordinator_ingest_latency_bucket
● rate(coordinator_write_errors{code! = '4XX'}[1m])
● rate(coordinator_fetch_errors{code! = '4XX'}[1m])
● high out of order samples:
○ rate(database_tick_merged_out_of_order_blocks[5m]) > X
○ this can help catch double scrapes
■ Due to pull based arch, this can cause false alerts
○ inhibit during node startup
Upgrades / Updates
● So far very smooth from compatibility standpoint
○ Only seen one small query eval regression
○ Just did the 1.0 update, also smooth
■ Some api changes
● We manage this via spinnaker + jenkins
○ One pain point here is lack of fully self driving updates (i.e. only kubectl apply)
■ Is actually now available
○ Requires us to be vigilant to ensure our configs and m3db versions stay in sync
● Suggestion: Have a readiness check for coordinators
○ Restarting many at the same time can make k8s unhappy
○ Requires setting a connect consistency on the coordinator config

Recommended for you

OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...

Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.

osmcopen sourcemonitoring
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story

The document describes Hootsuite's scaling journey from using Apache and PHP on one MySQL server to a microservices architecture using multiple technologies like Nginx, PHP-FPM, Memcached, MongoDB, Gearman, and Scala/Akka services communicating via ZeroMQ. Key steps included caching with Memcached to reduce MySQL load, using Gearman for asynchronous tasks, and MongoDB for large datasets. Monitoring with Statsd, Logstash and Elasticsearch was added for visibility. They moved to a service-oriented architecture with independent services to keep scaling their large codebase and engineering team.

vancouverphp php vancouver meetup hootsuite scalin
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1

This summary provides an overview of the key points from the document in 3 sentences: The document outlines the agenda for Season 3 Episode 1 of the Netflix OSS podcast, which includes lightning talks on 8 new projects including Atlas, Prana, Raigad, Genie 2, Inviso, Dynomite, Nicobar, and MSL. Representatives from Netflix, IBM Watson, Nike Digital, and Pivotal then each provide a 3-5 minute presentation on their featured project. The presentations describe the motivation, features and benefits of each project for observability, integration with the Netflix ecosystem, automation of Elasticsearch deployments, job scheduling, dynamic scripting for Java, message security, and developing microservices

atlasdynomitemsl
Metric Spikes
For any high volume system, you will need a way to deal with spikes.
For example: A service adds a label with exploding cardinality
● Have a way to identify the source of the spike
● Be able to cut off that source easily
○ Preferable to OOMing your cluster
Capacity
● Brief overview of capacity planning at Databricks
● We've found that one m3db replica per 50,000 incoming time-series works pretty well
○ We are write heavy
● For same workload need about 50 write coordinators in two deployments (100 total)
Future Work
Some examples of nifty new things M3 will enable us to do now that we're getting operationally mature
● Downsampling for older metrics
○ Expect a significant savings in disk space
● Using different namespaces for metrics with different requirements
● Allowing direct push into M3 from difficult to scrape services
○ E.g. databricks jobs, developer laptops
Conclusion
● Overall a successful migration for us
● Community has been helpful
● Nice new things on the horizon

Recommended for you

How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability

Highly available databases are essential to organizations depending on mission-critical, 24/7 access to data. Postgres is widely recognized as an excellent open-source database, with critical maturity and features that allow organizations to scale and achieve high availability. This webinar will explore: - Evolution of replication in Postgres - Streaming replication - Logical replication - Replication for high availability - Important high availability parameters - Options to monitor high availability - HA infrastructure to patch the database with minimal downtime - EDB Postgres Failover Manager (EFM) - EDB tools to create a highly available Postgres architecture

 
by EDB
databasepostgresopen source
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.

tubemogulfastlyjenkins
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...

In this InfluxDays NYC 2019 session, Richard Laskey from the Wayfair Storefront team will share their monitoring best practices using InfluxEnterprise. These efforts are critical and help improve the user experience by driving forward site-wide improvements, establishing best practices, and driving change through many different teams.

influxdaysinfluxdbinfluxdata

More Related Content

What's hot

[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
Ji-Woong Choi
 
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
Alex Hahn
 
Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
Colin Su
 
Intro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on KubernetesIntro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on Kubernetes
Kublr
 
Microservices for Application Modernisation
Microservices for Application ModernisationMicroservices for Application Modernisation
Microservices for Application Modernisation
Ajay Kumar Uppal
 
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
Ji-Woong Choi
 
Platform & Application Modernization
Platform & Application ModernizationPlatform & Application Modernization
Platform & Application Modernization
JK Tech
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
제관 이
 
Modern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOpsModern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOps
GlobalLogic Ukraine
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성
rockplace
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
The Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - CiscoThe Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - Cisco
MarcoTechnologies
 
Dev ops Introduction
Dev ops IntroductionDev ops Introduction
Dev ops Introduction
영기 김
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
Rob Tweed
 
4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴
Terry Cho
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
NirajKumar938204
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture
Elasticsearch
 
Api observability
Api observability Api observability
Api observability
Red Hat
 

What's hot (20)

[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
[오픈소스컨설팅]엔터프라이즈 오픈소스 도입전략
 
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
[SW 아키텍처 컨퍼런스] 클라우드 아키텍처 개론
 
Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
 
Intro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on KubernetesIntro into Rook and Ceph on Kubernetes
Intro into Rook and Ceph on Kubernetes
 
Microservices for Application Modernisation
Microservices for Application ModernisationMicroservices for Application Modernisation
Microservices for Application Modernisation
 
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
[오픈소스컨설팅]유닉스의 리눅스 마이그레이션 전략_v3
 
Platform & Application Modernization
Platform & Application ModernizationPlatform & Application Modernization
Platform & Application Modernization
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
마이크로서비스 아키텍처 기반의 의료정보시스템 고도화 전환사례.건국대학교병원.이제관
 
Modern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOpsModern CI/CD Pipeline Using Azure DevOps
Modern CI/CD Pipeline Using Azure DevOps
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
The Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - CiscoThe Next Generation of Hyperconverged Infrastructure - Cisco
The Next Generation of Hyperconverged Infrastructure - Cisco
 
Dev ops Introduction
Dev ops IntroductionDev ops Introduction
Dev ops Introduction
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
 
4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴4. 대용량 아키텍쳐 설계 패턴
4. 대용량 아키텍쳐 설계 패턴
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture
 
Api observability
Api observability Api observability
Api observability
 

Similar to Scaling Monitoring At Databricks From Prometheus to M3

Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
Shubham Tagra
 
Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale
Mirantis
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
Shubham Tagra
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
vanphp
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
EDB
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
InfluxData
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
Geoffrey Beausire
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mydbops
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performance
MarkTaylorIBM
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
George Luong
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
Alluxio, Inc.
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 

Similar to Scaling Monitoring At Databricks From Prometheus to M3 (20)

Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performance
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 

More from LibbySchulze

Running distributed tests with k6.pdf
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdf
LibbySchulze
 
Extending Kubectl.pptx
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptx
LibbySchulze
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
LibbySchulze
 
Fallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdf
LibbySchulze
 
Intro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdf
LibbySchulze
 
Enhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxEnhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptx
LibbySchulze
 
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
LibbySchulze
 
Oh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdf
LibbySchulze
 
Rancher MasterClass - Avoiding-configuration-drift.pptx
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptx
LibbySchulze
 
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
LibbySchulze
 
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
LibbySchulze
 
EnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfEnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdf
LibbySchulze
 
AirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfAirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdf
LibbySchulze
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
LibbySchulze
 
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
LibbySchulze
 
CNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdf
LibbySchulze
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
LibbySchulze
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
LibbySchulze
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
LibbySchulze
 
Advancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for Azure
LibbySchulze
 

More from LibbySchulze (20)

Running distributed tests with k6.pdf
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdf
 
Extending Kubectl.pptx
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptx
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
 
Fallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdf
 
Intro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdf
 
Enhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxEnhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptx
 
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
 
Oh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdf
 
Rancher MasterClass - Avoiding-configuration-drift.pptx
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptx
 
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
 
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
 
EnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfEnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdf
 
AirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfAirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdf
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
 
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
 
CNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdf
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Advancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for Azure
 

Recently uploaded

Cyber Security Course & Guide. X.GI. pdf
Cyber Security Course & Guide. X.GI. pdfCyber Security Course & Guide. X.GI. pdf
Cyber Security Course & Guide. X.GI. pdf
RohitRoshanBengROHIT
 
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
taqyea
 
Lincoln University degree offer diploma Transcript
Lincoln University degree offer diploma TranscriptLincoln University degree offer diploma Transcript
Lincoln University degree offer diploma Transcript
ubufe
 
How to Choose the Right UIUX Design Service for Optimal Customer Experience
How to Choose the Right UIUX Design Service for Optimal Customer ExperienceHow to Choose the Right UIUX Design Service for Optimal Customer Experience
How to Choose the Right UIUX Design Service for Optimal Customer Experience
Serva AppLabs
 
10th International Conference on Networks, Mobile Communications and Telema...
10th International Conference on Networks, Mobile Communications and   Telema...10th International Conference on Networks, Mobile Communications and   Telema...
10th International Conference on Networks, Mobile Communications and Telema...
ijp2p
 
Carrington degree offer diploma Transcript
Carrington degree offer diploma TranscriptCarrington degree offer diploma Transcript
Carrington degree offer diploma Transcript
ubufe
 
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
taqyea
 
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
taqyea
 
About Alibaba company and brief general information regarding how to trade on...
About Alibaba company and brief general information regarding how to trade on...About Alibaba company and brief general information regarding how to trade on...
About Alibaba company and brief general information regarding how to trade on...
Erkinjon Erkinov
 
Steps involved in the implementation of EDI in a company
Steps involved in the implementation of EDI in a companySteps involved in the implementation of EDI in a company
Steps involved in the implementation of EDI in a company
sivaraman163206
 
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
taqyea
 
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
taqyea
 
Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)
Kashyap J
 
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
taqyea
 
Common Challenges in UI UX Design and How Services Can Help.pdf
Common Challenges in UI UX Design and How Services Can Help.pdfCommon Challenges in UI UX Design and How Services Can Help.pdf
Common Challenges in UI UX Design and How Services Can Help.pdf
Serva AppLabs
 
一比一原版(aber毕业证)亚伯大学毕业证如何办理
一比一原版(aber毕业证)亚伯大学毕业证如何办理一比一原版(aber毕业证)亚伯大学毕业证如何办理
一比一原版(aber毕业证)亚伯大学毕业证如何办理
taqyea
 
Book dating , international dating phgra
Book dating , international dating phgraBook dating , international dating phgra
Book dating , international dating phgra
thomaskurtha9
 
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
taqyea
 
Corporate Minimal Newspaper Headline Style Newsletter.pptx
Corporate Minimal Newspaper Headline Style Newsletter.pptxCorporate Minimal Newspaper Headline Style Newsletter.pptx
Corporate Minimal Newspaper Headline Style Newsletter.pptx
byubyu7
 
Founders Of Digital World Social Media..
Founders Of Digital World Social Media..Founders Of Digital World Social Media..
Founders Of Digital World Social Media..
jom pom
 

Recently uploaded (20)

Cyber Security Course & Guide. X.GI. pdf
Cyber Security Course & Guide. X.GI. pdfCyber Security Course & Guide. X.GI. pdf
Cyber Security Course & Guide. X.GI. pdf
 
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt毕业证书)英国赫瑞瓦特大学毕业证如何办理
 
Lincoln University degree offer diploma Transcript
Lincoln University degree offer diploma TranscriptLincoln University degree offer diploma Transcript
Lincoln University degree offer diploma Transcript
 
How to Choose the Right UIUX Design Service for Optimal Customer Experience
How to Choose the Right UIUX Design Service for Optimal Customer ExperienceHow to Choose the Right UIUX Design Service for Optimal Customer Experience
How to Choose the Right UIUX Design Service for Optimal Customer Experience
 
10th International Conference on Networks, Mobile Communications and Telema...
10th International Conference on Networks, Mobile Communications and   Telema...10th International Conference on Networks, Mobile Communications and   Telema...
10th International Conference on Networks, Mobile Communications and Telema...
 
Carrington degree offer diploma Transcript
Carrington degree offer diploma TranscriptCarrington degree offer diploma Transcript
Carrington degree offer diploma Transcript
 
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
一比一原版澳洲巴拉特大学毕业证(utas毕业证书)如何办理
 
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
一比一原版(hull毕业证书)英国赫尔大学毕业证如何办理
 
About Alibaba company and brief general information regarding how to trade on...
About Alibaba company and brief general information regarding how to trade on...About Alibaba company and brief general information regarding how to trade on...
About Alibaba company and brief general information regarding how to trade on...
 
Steps involved in the implementation of EDI in a company
Steps involved in the implementation of EDI in a companySteps involved in the implementation of EDI in a company
Steps involved in the implementation of EDI in a company
 
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
一比一原版(ic毕业证书)英国帝国理工学院毕业证如何办理
 
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
一比一原版(greenwich毕业证书)英国格林威治大学毕业证如何办理
 
Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)Quiz Quiz Hota Hai (School Quiz 2018-19)
Quiz Quiz Hota Hai (School Quiz 2018-19)
 
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)英国爱丁堡大学毕业证如何办理
 
Common Challenges in UI UX Design and How Services Can Help.pdf
Common Challenges in UI UX Design and How Services Can Help.pdfCommon Challenges in UI UX Design and How Services Can Help.pdf
Common Challenges in UI UX Design and How Services Can Help.pdf
 
一比一原版(aber毕业证)亚伯大学毕业证如何办理
一比一原版(aber毕业证)亚伯大学毕业证如何办理一比一原版(aber毕业证)亚伯大学毕业证如何办理
一比一原版(aber毕业证)亚伯大学毕业证如何办理
 
Book dating , international dating phgra
Book dating , international dating phgraBook dating , international dating phgra
Book dating , international dating phgra
 
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
一比一原版(lu毕业证书)英国拉夫堡大学毕业证如何办理
 
Corporate Minimal Newspaper Headline Style Newsletter.pptx
Corporate Minimal Newspaper Headline Style Newsletter.pptxCorporate Minimal Newspaper Headline Style Newsletter.pptx
Corporate Minimal Newspaper Headline Style Newsletter.pptx
 
Founders Of Digital World Social Media..
Founders Of Digital World Social Media..Founders Of Digital World Social Media..
Founders Of Digital World Social Media..
 

Scaling Monitoring At Databricks From Prometheus to M3

  • 1. Scaling Monitoring At Databricks From Prometheus to M3 YY Wan & Nick Lanham Virtual M3 Day 2/18/21
  • 2. Introduction Nick Lanham Senior Software Engineer Observability Team YY Wan Software Engineer Observability Team
  • 3. About ● Founded in 2013 by the original creators of Apache Spark ● Data and AI platform as a service for 5000+ customers ● 1500+ employees, 400+ engineers, >$400M annual recurring revenue ● 3 cloud providers, 50+ regions ● Launching millions of VMs / day to run data engineering and ML workloads, processing exabytes of data
  • 4. Agenda ● Monitoring at Databricks before M3 ● Deploying M3 ○ Architecture ○ Migration ● Lessons Learned ○ Operational advice ○ Things to monitor ○ Updates and upgrades
  • 6. Monitoring At Databricks ● Monitoring targets: ○ Cloud-native, majority of services run on Kubernetes ○ Customer Spark workloads run on VMs in customer environments ● Prometheus-based monitoring since 2016 ● All service teams use metrics, dashboards, alerts ○ Most engineers are PromQL-literate ● Use-cases: real-time alerting, debugging, SLO reporting, automated event response ● Monitoring and data-drivenness are core to Databricks engineering culture
  • 8. Scale Numbers ● 50+ regions / k8s clusters across multiple cloud providers ● 100+ microservices ● Infrastructure footprint of 4M+ VMs of Databricks services and customer Apache Spark workers ● Largest single Prometheus instance ○ 900k samples / sec ○ Churn rate: many metrics with only < 100 samples (i.e. metrics from short-lived Spark jobs persist for only < 100 minutes at 1 min scrape interval) ○ Disk usage (15d retention): 4TB ○ Huge AWS VM: x1e.16xlarge, 64 core, 1952GB RAM
  • 9. Scaling Bottlenecks & Pain Points Operational ● Frequent capacity issues - OOMs, high disk usage ● Multi-hour Prometheus updates (long WAL recovery process during startup) UX ● Mental overhead of sharded view of metrics ● Big queries never completing (and causes OOMs) ● Short retention period ● Subject to strict metric whitelist
  • 10. Searching for a Scalable Monitoring Solution Requirements: ● High metric volume, cardinality, churn rate ● Minimum 90d retention ● Compatible with PromQL ● Global view of (some) metrics ● High availability setup Nice-to-have: ● Good update and maintenance story - less manual intervention, no metrics gaps ● Battle-tested in large scale production environment ● Open source (Mid-2019) Alternatives considered: sharded Prometheus, Thanos, Cortex, Datadog, SignalFx
  • 11. Why ? ● Fulfilled all our hard requirements ○ Designed for large scale workloads and horizontally scalable ○ Exposes Prometheus API query endpoint ○ High availability with multi-replica setup ○ Designed for multi-region and cloud setup, with global querying feature ● Battle-tested at high scale at Uber in a production environment ● Has a kubernetes operator for automated cluster operations ● Cool features that we would be interested to use ○ Aggregation on ingest ○ Downsampling (potentially longer retention)
  • 14. Making the Write Path Scalable
  • 15. Building Our Own Rule Engine
  • 16. Zooming In On M3 Setup
  • 18. Monitoring M3 & Final Architecture
  • 20. Migration 1. Shadow deployment ○ Dual-write metrics to both Prom and M3 storage ○ Evaluate alerts in using both Prom and M3 rule engine ○ Open a querying endpoint for Observability team to test queries and dashboarding 2. Behavior validation ○ Compare alert evaluation between old and new system ○ Compare dashboards side-by-side 3. Incremental rollout strategy ○ Percentage-based rollout of ad-hoc query traffic to M3, staged across environments ○ Per-service rollout of alert evaluation 4. Final outcome: All ad-hoc query traffic and alerts served from M3
  • 21. Switching Over Ad-Hoc Querying Traffic
  • 23. Outcome ● 1-yr migration (mid-2019 to mid-2020) ● M3 runs as the sole metrics provider in all environments across clouds ○ (beta) Global query endpoint available via M3 for all metrics ● User experience largely unchanged (PromQL everything) ● Retention is widely 90d ● Migration went pretty smoothly, avoided major outages ● Higher confidence to continue scaling metrics workloads into upcoming years ● No more giant VMs with 2TB RAM!!
  • 25. M3 From The Trenches ● System metrics to monitor ● General operational advice ● What to alert on ● How we do updates/upgrades
  • 26. Overview ● Overall m3 has been amazingly stable ○ By far our biggest issue is running out of disk space ● Across more than 50 deployments only a few have been problematic ○ We'll dive into why, and how to avoid it
  • 27. M3 at Databricks ● Large number of clusters means things HAVE to be automated ○ We use a combination of spinnaker and jenkins to kubectl apply templates ● About 900k samples per second in large clusters ● About 200k series read per second in large clusters
  • 28. Key Metrics to Watch ● Memory used (alert if steadily over 60%) ○ We've seen that spikes can cause OOMs if you're consistently over this ○ Resolve by ■ Scale up cluster, or reduce incoming metric load ○ sum(container_memory_rss{filter}) by (kubernetes_pod_name) ● Disk space used (alert if predict_linear full in 14 days) ○ 14 days seems long, but it gives us plenty of time to provision new nodes and allow data to migrate ○ Resolve by ■ Scale up cluster, reduce retention, reduce incoming metric load ○ (kubelet_volume_stats_capacity_bytes{filter} - kubelet_volume_stats_available_bytes{}) / kubelet_volume_stats_capacity_bytes{} ● Cluster scale-up can be slow ○ Be sure to test how long it takes in your cluster
  • 29. General Advice ● Avoid a lot of custom things ○ As close to what the operator expects is the best ● Observe query rates and set limits ● Have a good testing env ○ Need to iterate quickly ○ Be able to throw away data ○ Try to have it at scale ● Have a look at the M3 dashboards and learn what things mean ○ https://grafana.com/grafana/dashboards/8126 ○ Very dev focused, suggest making your own with key metrics
  • 31. Other Alerting ● high latency ingesting samples: coordinator_ingest_latency_bucket ● rate(coordinator_write_errors{code! = '4XX'}[1m]) ● rate(coordinator_fetch_errors{code! = '4XX'}[1m]) ● high out of order samples: ○ rate(database_tick_merged_out_of_order_blocks[5m]) > X ○ this can help catch double scrapes ■ Due to pull based arch, this can cause false alerts ○ inhibit during node startup
  • 32. Upgrades / Updates ● So far very smooth from compatibility standpoint ○ Only seen one small query eval regression ○ Just did the 1.0 update, also smooth ■ Some api changes ● We manage this via spinnaker + jenkins ○ One pain point here is lack of fully self driving updates (i.e. only kubectl apply) ■ Is actually now available ○ Requires us to be vigilant to ensure our configs and m3db versions stay in sync ● Suggestion: Have a readiness check for coordinators ○ Restarting many at the same time can make k8s unhappy ○ Requires setting a connect consistency on the coordinator config
  • 33. Metric Spikes For any high volume system, you will need a way to deal with spikes. For example: A service adds a label with exploding cardinality ● Have a way to identify the source of the spike ● Be able to cut off that source easily ○ Preferable to OOMing your cluster
  • 34. Capacity ● Brief overview of capacity planning at Databricks ● We've found that one m3db replica per 50,000 incoming time-series works pretty well ○ We are write heavy ● For same workload need about 50 write coordinators in two deployments (100 total)
  • 35. Future Work Some examples of nifty new things M3 will enable us to do now that we're getting operationally mature ● Downsampling for older metrics ○ Expect a significant savings in disk space ● Using different namespaces for metrics with different requirements ● Allowing direct push into M3 from difficult to scrape services ○ E.g. databricks jobs, developer laptops
  • 36. Conclusion ● Overall a successful migration for us ● Community has been helpful ● Nice new things on the horizon