Data science for infrastructure dev week 2022
- 1. Data Science for Infrastructure:
Observe, Understand, Automate
Zain Asgar & Natalie Serrino
- 2. https://px.dev
Zain Asgar Natalie Serrino
@nserrino
Principal Engineer - TLM @ New Relic
Prior: Eng @ Observe, Eng @ Trifacta,
Eng @ Intel
@zainasgar
GM @ New Relic
Adjunct Professor of CS @ Stanford
Prior: Co-founder/CEO - Pixie Labs
Eng @ Google, Trifacta, NVIDIA
- 3. https://px.dev
We see observability as a data problem
- It’s easy for machines to generate GBs of data per second
- It’s hard to get complete coverage applications, especially in distributed
environments
- It’s hard to make sure this data is relevant
- It’s hard to distill the data into something usable
- 4. https://px.dev
What we learned in the data space
- Collecting the right data is half the battle
- Simple models on relevant data usually outperform complex models on a
skewed/incomplete dataset
- Important to be able to audit and inspect your data pipelines
- 5. https://px.dev
How to do data-driven automation?
Transform data
into signal!
Do something
based on signal!
Gather
raw data!
⏰ Most time is spent here
Need variety and depth in
input data
👀 Disproportionate
emphasis
Can be a simple rule set or a
statistical/ML model
🤞 Ideally with limits + alerts
Huge possibilities here with the
Kubernetes API
- 6. https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Application metrics
- Raw requests
- Aggregates
- Anomaly detection
- Regex
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Allocate more resources
- 7. https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Infrastructure utilization
- Application metrics
- Raw requests
- Application profiles
- Network connections
- Kubernetes state
- Mostly data wrangling...
- Aggregates
- Anomaly detection
- Thresholds
- Regex/pattern-matching
- Linear regression
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Restart pod/service
- Page someone
- Allocate more resources
- Roll back
- Disable/enable feature
- 12. import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
🔍 Query
⛏ Collect
Don’t invent a new language
PxL provides a programmable API for Pixie
- 13. ● Valid
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
- 14. ● Valid
● Valid
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
- 15. ● Valid
● Valid
● Built for data analysis and ML
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
- 16. import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL specifies logical
flow of data
(declarative)
Pixie plans &
optimizes the
execution
Operator
Data
PxL is an dataflow language
- 18. import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
All transforms = methods on
a PxL dataFrame
Aggregate
Join
Filter
...etc
PxL scripts use transforms to analyze data
- 19. import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
Declarative +
Functional +
No implicit side effects
=
Composable
PxL scripts are composable
- 20. https://px.dev
PxL provides an interface to work with data
It allows us to construct powerful, composabe workflows.
These following demos demonstrate this capability:
1. Slack alert on SQL injection attacks
2. Auto-scale deployment by HTTP request throughput
- 25. https://px.dev
What is a SQL injection?
“SQL injection is a code injection technique used to attack
applications, in which malicious SQL statements are inserted into an
entry field for execution.“
- 26. https://px.dev
Example SQL injection
User accesses
http://foobar.com?user_id=123
Application executes
SELECT * from users where user_id=123
Malicious actor accesses
http://foobar.com?user_id=123 or 1=1
Application executes
SELECT * from users where user_id=123 or 1=1
�� ��
- 27. https://px.dev
How can we detect SQL injections?
💥 Rules 💥
- Parse query to detect prohibited syntax (e.g. unions)
- Regexes to detect prohibited syntax
💭 Complication: What if your app has a legitimate use of union?
💥 Machine learning 💥
- Train model on real world examples
- Can theoretically learn that certain usage of syntax are okay
💭 Complication: Where to get the dataset?
- 30. https://px.dev
Slack Alert for SQL Injection Attacks
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Generate alert about
SQL injections
Diagnose SQL
injection events
Collect raw
SQL events
- 34. https://px.dev
K8s Autoscalers
- Both “Horizontal” and “Vertical” scaling
- Some built-in autoscaling metrics:
- Pod CPU
- Pod Memory
- Custom metrics API allows to scale on
custom metrics! 😎
https://github.com/kubernetes/metrics
Credit: kubernetes.io
- 36. https://px.dev
Other tools supporting this demo
Custom metrics server adapted from this project:
github.com/kubernetes-sigs/custom-metrics-apiserver
👆 Check it out to build your own K8s metrics server!
HTTP load testing via Hey
https://github.com/rakyll/hey
- 38. https://px.dev
Autoscale deployment by HTTP request throughput
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Autoscale # of pods
by HTTP req/s
Calculate HTTP
req/s by pod
Collect raw HTTP
requests
- 39. https://px.dev
We’d love to get your feedback
In these demos we showed some simple data workflows on Pixie.
- More details about SQL injection here: blog.px.dev/sql-injection
- More details about autoscaling: blog.px.dev/autoscaling-custom-k8s-metric
What’s next:
- We are working on XSS detection.
- We want to learn about more use cases. Find us on GitHub (pixie-io/pixie) or
Slack (slackin.px.dev).