SlideShare a Scribd company logo
Data Science for Infrastructure:
Observe, Understand, Automate
Zain Asgar & Natalie Serrino
https://px.dev
Zain Asgar Natalie Serrino
@nserrino
Principal Engineer - TLM @ New Relic
Prior: Eng @ Observe, Eng @ Trifacta,
Eng @ Intel
@zainasgar
GM @ New Relic
Adjunct Professor of CS @ Stanford
Prior: Co-founder/CEO - Pixie Labs
Eng @ Google, Trifacta, NVIDIA
https://px.dev
We see observability as a data problem
- It’s easy for machines to generate GBs of data per second
- It’s hard to get complete coverage applications, especially in distributed
environments
- It’s hard to make sure this data is relevant
- It’s hard to distill the data into something usable
https://px.dev
What we learned in the data space
- Collecting the right data is half the battle
- Simple models on relevant data usually outperform complex models on a
skewed/incomplete dataset
- Important to be able to audit and inspect your data pipelines
https://px.dev
How to do data-driven automation?
Transform data
into signal!
Do something
based on signal!
Gather
raw data!
⏰ Most time is spent here
Need variety and depth in
input data
👀 Disproportionate
emphasis
Can be a simple rule set or a
statistical/ML model
🤞 Ideally with limits + alerts
Huge possibilities here with the
Kubernetes API
https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Application metrics
- Raw requests
- Aggregates
- Anomaly detection
- Regex
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Allocate more resources
https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Infrastructure utilization
- Application metrics
- Raw requests
- Application profiles
- Network connections
- Kubernetes state
- Mostly data wrangling...
- Aggregates
- Anomaly detection
- Thresholds
- Regex/pattern-matching
- Linear regression
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Restart pod/service
- Page someone
- Allocate more resources
- Roll back
- Disable/enable feature
https://px.dev
We built Pixie to solve these problems
Auto-telemetry using eBPF
100% scriptable & API-driven
Kubernetes native
https://px.dev
Application, network, and infrastructure data
Full-body request traces and flamegraphs!
Low overhead! <5% CPU
Auto-Telemetry using eBPF
https://px.dev
Query Kubernetes entities like pods, services,
deployments, nodes!
Entirely in-cluster data storage and edge
compute
Kubernetes Native
https://px.dev
Infrastructure as code!
Everything is a script and can be accessed via API
Easily integrate with Grafana, Slack, or other tools
API driven & 100% Scriptable
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
🔍 Query
⛏ Collect
󰣼 Don’t invent a new language
PxL provides a programmable API for Pixie
● Valid
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
● Valid
● Valid
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
● Valid
● Valid
● Built for data analysis and ML
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL specifies logical
flow of data
(declarative)
Pixie plans &
optimizes the
execution
Operator
Data
PxL is an dataflow language
How do I Transform Data?
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
All transforms = methods on
a PxL dataFrame
Aggregate
Join
Filter
...etc
PxL scripts use transforms to analyze data
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
Declarative +
Functional +
No implicit side effects
=
Composable
PxL scripts are composable
https://px.dev
PxL provides an interface to work with data
It allows us to construct powerful, composabe workflows.
These following demos demonstrate this capability:
1. Slack alert on SQL injection attacks
2. Auto-scale deployment by HTTP request throughput
Demos!
https://px.dev
> px deploy
Demo 1: Slack Alert for SQL Injection Attacks
Demo app: DVWA
https://github.com/digininja/DVWA
https://px.dev
What is a SQL injection?
“SQL injection is a code injection technique used to attack
applications, in which malicious SQL statements are inserted into an
entry field for execution.“
https://px.dev
Example SQL injection
User accesses
http://foobar.com?user_id=123
Application executes
SELECT * from users where user_id=123
Malicious actor accesses
http://foobar.com?user_id=123 or 1=1
Application executes
SELECT * from users where user_id=123 or 1=1
�� ��
https://px.dev
How can we detect SQL injections?
💥 Rules 💥
- Parse query to detect prohibited syntax (e.g. unions)
- Regexes to detect prohibited syntax
💭 Complication: What if your app has a legitimate use of union?
💥 Machine learning 💥
- Train model on real world examples
- Can theoretically learn that certain usage of syntax are okay
💭 Complication: Where to get the dataset?
https://px.dev
Vulnerability testing tool 🚀
SQL Vulnerability testing via
github.com/SQLMapproject/SQLMap
Live Demo 1!
https://px.dev
Slack Alert for SQL Injection Attacks
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Generate alert about
SQL injections
Diagnose SQL
injection events
Collect raw
SQL events
Demo 2: Autoscale deployment by HTTP
request throughput
https://px.dev
Autoscaling
💭 How do you know how many pods your deployment should
have?
💭 How do you know the amount of resources to provision for
those pods?
https://px.dev
Possible autoscaling metrics
- CPU, memory of pod
- Avg / p90 / p99 request latency
- Latency of downstream dependencies
- # of outbound connections
- Application-specific metrics
- ….. Many more …...
https://px.dev
K8s Autoscalers
- Both “Horizontal” and “Vertical” scaling
- Some built-in autoscaling metrics:
- Pod CPU
- Pod Memory
- Custom metrics API allows to scale on
custom metrics! 😎
https://github.com/kubernetes/metrics
Credit: kubernetes.io
https://px.dev
Very sophisticated demo app
https://px.dev
Other tools supporting this demo
Custom metrics server adapted from this project:
github.com/kubernetes-sigs/custom-metrics-apiserver
👆 Check it out to build your own K8s metrics server!
HTTP load testing via Hey
https://github.com/rakyll/hey
Live Demo 2!
https://px.dev
Autoscale deployment by HTTP request throughput
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Autoscale # of pods
by HTTP req/s
Calculate HTTP
req/s by pod
Collect raw HTTP
requests
https://px.dev
We’d love to get your feedback
In these demos we showed some simple data workflows on Pixie.
- More details about SQL injection here: blog.px.dev/sql-injection
- More details about autoscaling: blog.px.dev/autoscaling-custom-k8s-metric
What’s next:
- We are working on XSS detection.
- We want to learn about more use cases. Find us on GitHub (pixie-io/pixie) or
Slack (slackin.px.dev).
Thanks!
Github: github.com/pixie-io/pixie
Blog: blog.px.dev
Website: px.dev

More Related Content

Data science for infrastructure dev week 2022

  • 1. Data Science for Infrastructure: Observe, Understand, Automate Zain Asgar & Natalie Serrino
  • 2. https://px.dev Zain Asgar Natalie Serrino @nserrino Principal Engineer - TLM @ New Relic Prior: Eng @ Observe, Eng @ Trifacta, Eng @ Intel @zainasgar GM @ New Relic Adjunct Professor of CS @ Stanford Prior: Co-founder/CEO - Pixie Labs Eng @ Google, Trifacta, NVIDIA
  • 3. https://px.dev We see observability as a data problem - It’s easy for machines to generate GBs of data per second - It’s hard to get complete coverage applications, especially in distributed environments - It’s hard to make sure this data is relevant - It’s hard to distill the data into something usable
  • 4. https://px.dev What we learned in the data space - Collecting the right data is half the battle - Simple models on relevant data usually outperform complex models on a skewed/incomplete dataset - Important to be able to audit and inspect your data pipelines
  • 5. https://px.dev How to do data-driven automation? Transform data into signal! Do something based on signal! Gather raw data! ⏰ Most time is spent here Need variety and depth in input data 👀 Disproportionate emphasis Can be a simple rule set or a statistical/ML model 🤞 Ideally with limits + alerts Huge possibilities here with the Kubernetes API
  • 6. https://px.dev How to do data-driven automation? Transform data into signal! Do something based on signal! Gather raw data! - Logs - Application metrics - Raw requests - Aggregates - Anomaly detection - Regex - Machine learning models - Ping Slack/JIRA - Scale deployment up/down - Allocate more resources
  • 7. https://px.dev How to do data-driven automation? Transform data into signal! Do something based on signal! Gather raw data! - Logs - Infrastructure utilization - Application metrics - Raw requests - Application profiles - Network connections - Kubernetes state - Mostly data wrangling... - Aggregates - Anomaly detection - Thresholds - Regex/pattern-matching - Linear regression - Machine learning models - Ping Slack/JIRA - Scale deployment up/down - Restart pod/service - Page someone - Allocate more resources - Roll back - Disable/enable feature
  • 8. https://px.dev We built Pixie to solve these problems Auto-telemetry using eBPF 100% scriptable & API-driven Kubernetes native
  • 9. https://px.dev Application, network, and infrastructure data Full-body request traces and flamegraphs! Low overhead! <5% CPU Auto-Telemetry using eBPF
  • 10. https://px.dev Query Kubernetes entities like pods, services, deployments, nodes! Entirely in-cluster data storage and edge compute Kubernetes Native
  • 11. https://px.dev Infrastructure as code! Everything is a script and can be accessed via API Easily integrate with Grafana, Slack, or other tools API driven & 100% Scriptable
  • 12. import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) 🔍 Query ⛏ Collect 󰣼 Don’t invent a new language PxL provides a programmable API for Pixie
  • 13. ● Valid import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) PxL is an embedded DSL
  • 14. ● Valid ● Valid import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) PxL is an embedded DSL
  • 15. ● Valid ● Valid ● Built for data analysis and ML import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) PxL is an embedded DSL
  • 16. import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) PxL specifies logical flow of data (declarative) Pixie plans & optimizes the execution Operator Data PxL is an dataflow language
  • 17. How do I Transform Data?
  • 18. import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) All transforms = methods on a PxL dataFrame Aggregate Join Filter ...etc PxL scripts use transforms to analyze data
  • 19. import px def http_data(): df = px.DataFrame(table='http_events', start_time='-30s') df.pod = df.ctx['pod'] return df[['pod', 'http_req_path', 'http_resp_latency_ns']] px.display(http_data()) Declarative + Functional + No implicit side effects = Composable PxL scripts are composable
  • 20. https://px.dev PxL provides an interface to work with data It allows us to construct powerful, composabe workflows. These following demos demonstrate this capability: 1. Slack alert on SQL injection attacks 2. Auto-scale deployment by HTTP request throughput
  • 23. Demo 1: Slack Alert for SQL Injection Attacks
  • 25. https://px.dev What is a SQL injection? “SQL injection is a code injection technique used to attack applications, in which malicious SQL statements are inserted into an entry field for execution.“
  • 26. https://px.dev Example SQL injection User accesses http://foobar.com?user_id=123 Application executes SELECT * from users where user_id=123 Malicious actor accesses http://foobar.com?user_id=123 or 1=1 Application executes SELECT * from users where user_id=123 or 1=1 �� ��
  • 27. https://px.dev How can we detect SQL injections? 💥 Rules 💥 - Parse query to detect prohibited syntax (e.g. unions) - Regexes to detect prohibited syntax 💭 Complication: What if your app has a legitimate use of union? 💥 Machine learning 💥 - Train model on real world examples - Can theoretically learn that certain usage of syntax are okay 💭 Complication: Where to get the dataset?
  • 28. https://px.dev Vulnerability testing tool 🚀 SQL Vulnerability testing via github.com/SQLMapproject/SQLMap
  • 30. https://px.dev Slack Alert for SQL Injection Attacks Transform data into signal! Do something based on signal! Gather raw data! Generate alert about SQL injections Diagnose SQL injection events Collect raw SQL events
  • 31. Demo 2: Autoscale deployment by HTTP request throughput
  • 32. https://px.dev Autoscaling 💭 How do you know how many pods your deployment should have? 💭 How do you know the amount of resources to provision for those pods?
  • 33. https://px.dev Possible autoscaling metrics - CPU, memory of pod - Avg / p90 / p99 request latency - Latency of downstream dependencies - # of outbound connections - Application-specific metrics - ….. Many more …...
  • 34. https://px.dev K8s Autoscalers - Both “Horizontal” and “Vertical” scaling - Some built-in autoscaling metrics: - Pod CPU - Pod Memory - Custom metrics API allows to scale on custom metrics! 😎 https://github.com/kubernetes/metrics Credit: kubernetes.io
  • 36. https://px.dev Other tools supporting this demo Custom metrics server adapted from this project: github.com/kubernetes-sigs/custom-metrics-apiserver 👆 Check it out to build your own K8s metrics server! HTTP load testing via Hey https://github.com/rakyll/hey
  • 38. https://px.dev Autoscale deployment by HTTP request throughput Transform data into signal! Do something based on signal! Gather raw data! Autoscale # of pods by HTTP req/s Calculate HTTP req/s by pod Collect raw HTTP requests
  • 39. https://px.dev We’d love to get your feedback In these demos we showed some simple data workflows on Pixie. - More details about SQL injection here: blog.px.dev/sql-injection - More details about autoscaling: blog.px.dev/autoscaling-custom-k8s-metric What’s next: - We are working on XSS detection. - We want to learn about more use cases. Find us on GitHub (pixie-io/pixie) or Slack (slackin.px.dev).