SlideShare a Scribd company logo
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shift-Left SRE: Self-Healing with
AWS Lambda
Andreas Grabner
Global Technology Lead & DevOps Activist
Dynatrace
D E V 3 1 3 - S
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1. Remediation use cases
2. PREVENT in CI/CD vs. Repair in PROD with
AWS Lambda
3. “Auto-Remediation as Code” with Lambda
4. The “Unbreakable Delivery Pipeline”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Crash -> Restart
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full or slow disk -> Clean up
$ find ./my_dir -mtime +10 -type f -
delete
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bad configuration changes -> Revert
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Low on resources -> Scale up
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Overprovisioned after drop in traffic -> Scale down
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blue vs. Green -> Redirect traffic
BLUE
GREEN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
End user impact -> Reverse Blue / Green
Deploy Blue Back to Green
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of remediation action we discussed
• Process restarts
• Resource (for example, disk) cleanup
• Revert bad configuration changes
• Scale up
• Scale down
• Blue vs. Green switching
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Add key metrics from incidents to quality gates
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
5 Issue impacting SLAs6 Add metric to quality gate
Use cases and metrics
we can “Shift-Left”!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in log behavior
• Use cases
• Are we logging too much? Did we turn on verbose logging by accident?
• Metrics
• Total log size
• Number of total and critical log messages
• How to query?
• For example: Using Amazon CloudWatch log filters
aws logs put-metric-filter 
--log-group-name MyApp/access.log 
--filter-name EventCount 
--filter-pattern "" 
--metric-transformations 
metricName=MyAppEventCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in resource consumption
• Use cases
• Bad coding leads to higher costs?
• Metrics
• Memory usage
• Bytes sent/received
• Overall CPU
• CPU per transaction type
• How to query?
• Some through CloudWatch API
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in dependencies
• Use cases
• Do we have new dependencies? On purpose?
• Are we connecting to the services we are supposed to connect?
• How many container instances are required?
• Metrics
• Number of incoming / outgoing dependencies
• Number of instances running on
• How to query?
• Maybe CloudWatch API
• Dynatrace SmartScape API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Use cases
• Did we introduce new “hidden” exceptions?
• Metrics
• Total exceptions
• Exceptions by class & service
• How to query?
• Dynatrace Timeseries API
Detect change in application exception handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in performance behavior
• Use case
• Are we jeopardizing our SLAs?
• Does load balancing work?
• Difference between canaries?
• Metrics
• Response time (percentiles)
• Throughput & perf per instance / canary
• How to query
• Dynatrace Timeseries API
• Dynatrace SmartScape API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in error behavior
• Use cases
• New unexpected error conditions?
• Metrics
• HTTP Failure Rate
• JavaScript Error Rate
• Query through
• Real user monitoring (RUM) solution
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detect change in end user scenarios
• Use cases
• Average number of page requests per user increased?
• How does this impact resource and capacity requirements?
• Metrics
• Number of user interactions / session
• Page sizes, number of resources
• Query through
• RUM solution
• Dynatrace Timeseries API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
List of metrics we just discussed
• Logging
• Total log size
• Number of total and critical log messages
• Resources
• Memory usage
• Bytes sent / received
• Overall CPU
• CPU per transaction type
• Dependencies
• Number of incoming / outgoing
dependencies
• Number of instances running on
• Exceptions
• Total exceptions
• Exceptions by class & service
• Performance
• Response time (percentiles)
• Throughput & perf per instance
• Errors
• HTTP failure rate
• JavaScript error rate
• End user scenarios
• Number of user interactions / session
• Page sizes, number of resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to add this to our pipeline?
1 2 3Staging Production
CI CD CI CD
Code / Config change 4 End users
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Curtis Bray (re:Invent 2017)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Thomas Steinmaurer @ Dynatrace
“Performance Signature”
for Build Nov 16
“Performance Signature”
for Build Nov 17
“Performance Signature”
for every build
“Multiple Metrics”
compared to prev
timeframe
Simple Regression Detection
per metric
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Build validation / Monitoring as code”
monspec.json
{
...
"perfsignature" : [
{
"timeseries" : "com.dynatrace.builtin:service.responsetime",
"aggregate" : “p90", // min, max, avg, sum, median, count, percentile
"validate" : "upper", // upper or lower
// "upperlimit" : 100, // Optional: Can be used to define a FIXED THRESHOLD
// "lowerlimit" : 50, // Optional: Can be used to define a FIXED THRESHOLD
},
{
"timeseries" : "com.dynatrace.builtin:service.failurerate",
"aggregate" : "avg"
},
{
"timeseries" : "com.dynatrace.builtin:service.requestspermin",
"aggregate" : "count",
"validate" : "lower"
},
{
"smartscape" : "toRelationships:calls",
"aggregate" : "count",
"upperlimit" : 1 // Validate that we only call to the one backend service and nowhere else!
}
],
Metrics: Which metrics, aggregation, upper/lower boundaries?
Dependencies: How many involved services?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
StagingToProduction,5,ApproveStaging
Invoke
RegisterStagingValidation
AWS Lambda
registerDynatraceBuildValidation
Monspec
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
Staging: Register Build Validation!
registerDynatraceBuildValidation
Adds build validation request
Adds item
Build validation request item
- Pipeline Information
- Monspec
- Timestamp + Timeframe
- Comparison Definition Name
- Action Name to Approve / Reject
validateBuildDynatraceWork CloudWatch Events
(e.g:, 1min)
Triggers
Approves/Rejects IF “In Progress” & if RegisterBuildValidation
was called with that Action Name
Monspec from Amazon S3
Dynatrace entities & Timeseries REST API
Resolves tags and gets list of entities
Queries metrics for these entities
Updates build validation request
- Updated Monspec
- Updated status
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automate validation into AWS CodePipeline with
Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build over build results pulled from Amazon DynamoDB
GoodBuild
GoodBuild
GoodBuild
BadBuild
BadBuild
BadBuild
BadBuild
GoodBuild
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration from Beachbody
ChatOps
Erik Landsness, Beachbody
Problem evolution
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Self-healing: Path to autonomous Ops
Auto-mitigate!
1 CPU exhausted? Add a new service instance to distribute load!
3 Caused by Canary Release? Redirect traffic to main canary!
How to escalate?
2 Exhausted connection pool? Increase pool size!
Escalate? Still ongoing?
1
2
Update teams
…
Impact mitigated??
Inform #WebTeam about JavaScript issue on IE
Push status update to inform our customers
Inform Support about potential incoming user complaints!
?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Auto-Remediation as Code” triggered by Dynatrace
#1: Push deployment information,
e.g: CodeDeploy DeploymentId
#2: Calling Lambda via API Gateway
handleDynatraceProblemNotification
#4: Redeploy previous
revision
#3
Uses Dynatrace Events API
to pull CUSTOM_DEPLOYMENT events
#5: Push comment to Dynatrace
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary: Unbreakable cloud-native pipelines
1 2 4 53
Production
Staging Approve staging Production Approve production
CI CD CI CD CI CD CI CD
Pushes deployment into
Dynatrace entities
Compares builds and
approves / rejects pipeline
Pushes deployment info into
Dynatrace entities
Validates production and
approves / rejects pipeline
Executes auto-remediating
actions e.g., roll-back
Build #17 Build #18
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Andreas Grabner
twitter: @grabnerandi
email: andreas.grabner@dynatrace.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sample code slide
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);
var pd = require('pretty-data').pd;
var xml_pp = pd.xml(data);
var xml_min = pd.xmlmin(data [,true]);
var json_pp = pd.json(data);
var json_min = pd.jsonmin(data);
var css_pp = pd.css(data);
var css_min = pd.cssmin(data [, true]);
var sql_pp = pd.sql(data);

More Related Content

Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:Invent 2018

  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shift-Left SRE: Self-Healing with AWS Lambda Andreas Grabner Global Technology Lead & DevOps Activist Dynatrace D E V 3 1 3 - S
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 1. Remediation use cases 2. PREVENT in CI/CD vs. Repair in PROD with AWS Lambda 3. “Auto-Remediation as Code” with Lambda 4. The “Unbreakable Delivery Pipeline”
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Crash -> Restart
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full or slow disk -> Clean up $ find ./my_dir -mtime +10 -type f - delete
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bad configuration changes -> Revert
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bad configuration changes -> Revert
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Low on resources -> Scale up
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overprovisioned after drop in traffic -> Scale down
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blue vs. Green -> Redirect traffic BLUE GREEN
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. End user impact -> Reverse Blue / Green Deploy Blue Back to Green
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. List of remediation action we discussed • Process restarts • Resource (for example, disk) cleanup • Revert bad configuration changes • Scale up • Scale down • Blue vs. Green switching
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Add key metrics from incidents to quality gates 1 2 3Staging Production CI CD CI CD Code / Config change 4 End users 5 Issue impacting SLAs6 Add metric to quality gate
  • 16. Use cases and metrics we can “Shift-Left”!
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in log behavior • Use cases • Are we logging too much? Did we turn on verbose logging by accident? • Metrics • Total log size • Number of total and critical log messages • How to query? • For example: Using Amazon CloudWatch log filters aws logs put-metric-filter --log-group-name MyApp/access.log --filter-name EventCount --filter-pattern "" --metric-transformations metricName=MyAppEventCount,metricNamespace=MyNamespace,metricValue=1,defaultValue=0
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in resource consumption • Use cases • Bad coding leads to higher costs? • Metrics • Memory usage • Bytes sent/received • Overall CPU • CPU per transaction type • How to query? • Some through CloudWatch API • Dynatrace Timeseries API
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in dependencies • Use cases • Do we have new dependencies? On purpose? • Are we connecting to the services we are supposed to connect? • How many container instances are required? • Metrics • Number of incoming / outgoing dependencies • Number of instances running on • How to query? • Maybe CloudWatch API • Dynatrace SmartScape API
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Use cases • Did we introduce new “hidden” exceptions? • Metrics • Total exceptions • Exceptions by class & service • How to query? • Dynatrace Timeseries API Detect change in application exception handling
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in performance behavior • Use case • Are we jeopardizing our SLAs? • Does load balancing work? • Difference between canaries? • Metrics • Response time (percentiles) • Throughput & perf per instance / canary • How to query • Dynatrace Timeseries API • Dynatrace SmartScape API
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in error behavior • Use cases • New unexpected error conditions? • Metrics • HTTP Failure Rate • JavaScript Error Rate • Query through • Real user monitoring (RUM) solution • Dynatrace Timeseries API
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detect change in end user scenarios • Use cases • Average number of page requests per user increased? • How does this impact resource and capacity requirements? • Metrics • Number of user interactions / session • Page sizes, number of resources • Query through • RUM solution • Dynatrace Timeseries API
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. List of metrics we just discussed • Logging • Total log size • Number of total and critical log messages • Resources • Memory usage • Bytes sent / received • Overall CPU • CPU per transaction type • Dependencies • Number of incoming / outgoing dependencies • Number of instances running on • Exceptions • Total exceptions • Exceptions by class & service • Performance • Response time (percentiles) • Throughput & perf per instance • Errors • HTTP failure rate • JavaScript error rate • End user scenarios • Number of user interactions / session • Page sizes, number of resources
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to add this to our pipeline? 1 2 3Staging Production CI CD CI CD Code / Config change 4 End users
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Curtis Bray (re:Invent 2017)
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Thomas Steinmaurer @ Dynatrace “Performance Signature” for Build Nov 16 “Performance Signature” for Build Nov 17 “Performance Signature” for every build “Multiple Metrics” compared to prev timeframe Simple Regression Detection per metric
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Build validation / Monitoring as code” monspec.json { ... "perfsignature" : [ { "timeseries" : "com.dynatrace.builtin:service.responsetime", "aggregate" : “p90", // min, max, avg, sum, median, count, percentile "validate" : "upper", // upper or lower // "upperlimit" : 100, // Optional: Can be used to define a FIXED THRESHOLD // "lowerlimit" : 50, // Optional: Can be used to define a FIXED THRESHOLD }, { "timeseries" : "com.dynatrace.builtin:service.failurerate", "aggregate" : "avg" }, { "timeseries" : "com.dynatrace.builtin:service.requestspermin", "aggregate" : "count", "validate" : "lower" }, { "smartscape" : "toRelationships:calls", "aggregate" : "count", "upperlimit" : 1 // Validate that we only call to the one backend service and nowhere else! } ], Metrics: Which metrics, aggregation, upper/lower boundaries? Dependencies: How many involved services?
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda StagingToProduction,5,ApproveStaging Invoke RegisterStagingValidation AWS Lambda registerDynatraceBuildValidation Monspec
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda Staging: Register Build Validation! registerDynatraceBuildValidation Adds build validation request Adds item Build validation request item - Pipeline Information - Monspec - Timestamp + Timeframe - Comparison Definition Name - Action Name to Approve / Reject validateBuildDynatraceWork CloudWatch Events (e.g:, 1min) Triggers Approves/Rejects IF “In Progress” & if RegisterBuildValidation was called with that Action Name Monspec from Amazon S3 Dynatrace entities & Timeseries REST API Resolves tags and gets list of entities Queries metrics for these entities Updates build validation request - Updated Monspec - Updated status
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automate validation into AWS CodePipeline with Lambda
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Build over build results pulled from Amazon DynamoDB GoodBuild GoodBuild GoodBuild BadBuild BadBuild BadBuild BadBuild GoodBuild
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration from Beachbody ChatOps Erik Landsness, Beachbody Problem evolution
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Self-healing: Path to autonomous Ops Auto-mitigate! 1 CPU exhausted? Add a new service instance to distribute load! 3 Caused by Canary Release? Redirect traffic to main canary! How to escalate? 2 Exhausted connection pool? Increase pool size! Escalate? Still ongoing? 1 2 Update teams … Impact mitigated?? Inform #WebTeam about JavaScript issue on IE Push status update to inform our customers Inform Support about potential incoming user complaints! ?
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Auto-Remediation as Code” triggered by Dynatrace #1: Push deployment information, e.g: CodeDeploy DeploymentId #2: Calling Lambda via API Gateway handleDynatraceProblemNotification #4: Redeploy previous revision #3 Uses Dynatrace Events API to pull CUSTOM_DEPLOYMENT events #5: Push comment to Dynatrace
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary: Unbreakable cloud-native pipelines 1 2 4 53 Production Staging Approve staging Production Approve production CI CD CI CD CI CD CI CD Pushes deployment into Dynatrace entities Compares builds and approves / rejects pipeline Pushes deployment info into Dynatrace entities Validates production and approves / rejects pipeline Executes auto-remediating actions e.g., roll-back Build #17 Build #18
  • 41. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Andreas Grabner twitter: @grabnerandi email: andreas.grabner@dynatrace.com
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sample code slide var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data);
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sample code slide var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data); var pd = require('pretty-data').pd; var xml_pp = pd.xml(data); var xml_min = pd.xmlmin(data [,true]); var json_pp = pd.json(data); var json_min = pd.jsonmin(data); var css_pp = pd.css(data); var css_min = pd.cssmin(data [, true]); var sql_pp = pd.sql(data);