Data-Driven Software Testing: The New, Lean Approach to Quality
- 2.
Ken Johnston is the principal engineering
manager for the Microsoft Operating Systems
Data Integration and Insights team. Since
joining Microsoft in 1998, Ken has filled many
other roles, including test lead, test manager,
and group program manager. Recently he has
worked on Bing Data Quality and
Measurements, Cosmos (the Microsoft big data
platform), and the Windows Apps Store. For
two and a half years, Ken served as the
Microsoft director of test excellence. He is a
frequent speaker, blogger, and author on big
data, software testing, and online services
development. Contact Ken on Twitter
@rkjohnston.
- 3. 11/25/2014
1
Data Driven SoftwareData Driven Software
Testing Quality
The Lean Approach to not Testing
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://blogs.msdn.com/kenj
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #BSCADC
About Ken
Data Scientist
in Data Driven
Outcomes
(D2O)
Office Live,
WebApps,
Office Online
Cosmos,
AutoPilot,
Local,
ShoppingOffice Online Shopping
Next is Data
Driven
Quality
EaaSy –
Everything as
a Service
“yes!”
Write Books and Blog
on Occasion
- 4. 11/25/2014
2
This is a talk about Change,
Big Change
No Test Plans• No Test Plans
• Fewer Test Cases
• Less Test
Automation
• Releasing with
Lower Initial
Quality
About this Presentation
Big Data +
Agile are
i i
Minimum
i bl
Analysis and
Insights
Driving
Change
Data Driven
Quality
Framework
Viable
Quality
Designing
for DDQ
and
Mitigating
Risk
Taking
Action
“Big Data” Search Trends
@rkjohnston #BSCADC
- 5. 11/25/2014
3
Big Data and Agile
The coming changes are being driven byg g g y
Big Data and the rate of product release
@rkjohnston #BSCADC
Massive amounts of Internal Data
Engineering Data But Opinion Still Reign
Test pass/fail results
Bug counts
Code Complexity
Code Coverage
Code Churn
• The HiPPO
• Highest Paid Person’s Opinion
Performance
Reliability
@rkjohnston #BSCADC
- 6. 11/25/2014
4
Big Data Insights are
Real
Improving Confidence Intervals
@rkjohnston #BSCADC
Here’s a Classic Story
Cocoa butter lotion
A large purse
Zinc and magnesium supplements
A bright blue rug
What would you do with this information?
But what if the expectant mother was an under
aged minor living at home?
- 7. 11/25/2014
5
Predictive Modeling is Real
Microsoft Bing Launches Predictions
Predictions Lab - https://www.prediction.microsoft.com/
2014 Elections with CNN -
http://blogs.bing.com/search/2014/10/21/bing-and-ie-team-with-cnn-for-
elections-2014/
D i Wi h h SDancing With the Stars -
http://blogs.bing.com/search/2014/09/17/who-is-going-
to-win-dancing-with-the-stars/
Uses Social, Search, and Betting
Websites
Big Data and Quality is Happening
Improving IE Quality of Experience in a Dynamic Web
The Problem: Sites break, bad experience for our customers
• Huge engineering investment to stay on top of the ever changing g g g y p g g
Web
• Traditionally evaluated through manual testing, bug reporting, and
escalations
• Public telemetry unstructured, very poor signal‐noise ratio
Solution: Get more data!
• “Report website problems”, new feature
added in April servicing release
• Enables users to provide semi‐structured
issue reports to Microsoft—URLs are
structured, comments are freeform
• Preserve user Privacy (URLs and min data)
- 8. 11/25/2014
6
Process and Normalize Data
We used a SARIMA model (Seasonal
Autoregressive Integrated Moving
Average):
Supports observed seasonality inSupports observed seasonality in
weekday/weekend reports, and non‐stationary
mean as volume of reports increases over
time.
Introduction to
Data Driven
Quality
Framework
(DDQF)
- 9. 11/25/2014
7
Traditional Testing DDQF Cycle
• Test Planning • What could go Wrong
• Test Cases and
Automation
• Test passes
• Instrumentation
• Early Release• Test passes
• Defect Management
• Sign Off/Release
• Early Release
• Data Analysis
• Rolling Releases
Asking the Right
Questions
• DDQF is an iterative
cycle
• Roots in DMAIC
Data AcquisitionRelease
• Roots in DMAIC
(Define, Measure,
Analyze, Improve,
Control)
• Less up front certainty
and more iterative
AnalysisTake Action
• Release is the key to
managing risk. Release
is more than Control, it
is also the break.
@rkjohnston #BSCADC
- 11. 11/25/2014
9
know
• Mean Time Between Failures
(MTBF)
• Launch Time
• Performance Metrics
• Hang Time
• Service Up Time and Availability
• Page Load Timeg
• Mean Time Between Failures
(MTBF)
• Page Dwell Time
• Sessions per UU
• Launch Time
• Performance Metrics
• Hang Time
• Service Up Time and Availability
• Page Load Time
p
• Engagement/Usage Time
• Feature Engagement
o Discoverability
o Return rate
• Click Through Rate (CTR)
H Ti
g
• Hover Time
• Quick Back
- 12. 11/25/2014
10
To Measure Customer Product
Satisfaction you need Scenarios
C I t t WiFi h t t il
What is a Scenario?
A scenario is a clear and succinct
description of a specific experience and
customer benefit that the product is
designed to deliver.
Can I connect to a WiFi hotspot easily
• Discoverability
• Trust
• Negotiate the connection
Can I file my expense reportg
It is a finite set of product aspects which
can be measured and evaluated from the
customer’s perspective.
y p p
• Does the software know me and keep
my profile
• Did it save my default currency
@rkjohnston #BSCADC
Scenario at a Glance
Think about the User
Software Makes
• What does the user need to
do
• What do they want to do
• How do alternative software
products do it
• Can we take steps out of the
Software Makes
Goal Achievable
User has
a goal
Each Step
is a
Scenario Time
TaskCompleti
• Can we take steps out of the
way
• What would delight the User
ion
@rkjohnston #BSCADC
- 13. 11/25/2014
11
Questions Drive Instrumentation
Instrument your Code
Software Makes
• Scenario Start
• Key Steps
• Lost Loops
• Scenario Complete
• Success Ratios
Software Makes
Goal Achievable
User has
a goal
Each Step
is a
Scenario
Success Ratios
• Time to Task Completion
• Minutes of Usage
@rkjohnston #BSCADC
Data Acquisition
- 14. 11/25/2014
12
Two types of data to acquire
Active = synthetic
Passive = organic
Active for services only?
Cli t i th i th ?Client: is the service there?
Staged Data Acquisition - Netflix
1B API
requests
per dayper day
Canary Deployment
- 15. 11/25/2014
13
Staged Data Acquisition - Facebook
Dogfood
In prod, no users (except internal ones)
Some servers in Production
World-wide deployment
Feature light-up
Staged Data Acquisition - Outlook
Filtering and aggregation at client
Be kind to the client
Pipeline to collect and process data
Make it easy
Staged Data AcquisitionStaged Data Acquisition
Feature
Crew
Outlook
Team
MS Office
Team
Microsoft Customers
Scale Validation
- 16. 11/25/2014
14
Staged Data Acquisition
Service
Stage 4: Some
Product (client, on-prem server)
Stage 1: In prod,
no users
Stage 2: Dogfood
Stage 3: Some
servers in prod
Stage 4: Some
more servers in
prod
Stage 5 : World-
wide prod
Deployment
Validation Service Validation Scale Validation
Real-time service
quality
( p )
Stage 1: Partial or
whole product
team
Stage 2: Dogfood
Stage 3:
Technology
Adoption
Programs (TAP)
Stage 4: Some
clients in
production
Stage 5: All
Customers
- 18. 11/25/2014
16
Analysis
and
Insights
Good Data lets you ask Questions
Software Makes AnalysisSoftware Makes
Goal Achievable
User has
a goal
Each Step
is a
Scenario
Analysis
measure
s
• Success and Failure Ratios
• Are we good enough
• Is Customer Engagement up
• Is time to task completion down
• Did we get enough user
feedbackfeedback
@rkjohnston #BSCADC
- 19. 11/25/2014
17
Huge Impact
Production Data is Real Data
Power of
Production Dataoduct o ata
Real users
Multiple
environmentsReal users environments
End to end
Scale &
geo‐diversity
Keep your eye on the target
The goal is not
to get a bulls eye
every time
The goal is toThe goal is to
get the data and
Learn
- 21. 11/25/2014
19
A/B testing
Controlled experimentation
Usage data on different experiences
Combine into more complex
scenariosscenarios
How did user get to shopping cart checkout?
@rkjohnston #BSCADC
We then Re-Evaluate
Software Makes AnalysisSoftware Makes
Goal Achievable
User has
a goal
Each Step
is a
Scenario
Analysis
measure
s
• Success and Failure Ratios
• Are we good enough
• Is Customer Engagement up
• Is time to task completion down
• Did we get enough user
feedbackfeedback
• What should we Change Next
@rkjohnston #BSCADC
- 23. 11/25/2014
21
Say NO toy
BUFT
@rkjohnston #BSCADC
Possible Testing
Waste, excessive automation and
excessive testing that does not find any
meaningful bugs.
Rich instrumentation identifies
remaining critical to fix bugs in the
shipped code.
Minimum Viable
Release Quality
MVQ for all users but still use a
rolling release process. Fix final
few critical bugs after release
Minimum Viable Quality
ngTestInvestment
MVQfocuses on
minimizing up front
testing
Rich telemetry from
production shifts testing
and validation into
Limited Release MVQ for sub-set of users. Beta
Users, Enthusiasts, Flighting
Increasin
production.
Under Tested – frequent rollbacks,
limited user engagement, strong
negative customer feedback, bad
press
Under
Tested
@rkjohnston #BSCADC
- 24. 11/25/2014
22
Speed is your
friend because…
Code churn is Layer 1
Maximum point of
instability is at end of
milestone
Code Churn Example 1
cumulative
Imagine this as part of a
larger multi-layered
project
y
Layer 2
Layer 3
Six week coding milestone
Layer 3
• Tightly coupled layers
• Long stabilization phase
• Complicated end-to-end integration
Sim-ship increases
risk
@rkjohnston #BSCADC
- 25. 11/25/2014
23
Code Churn Example 2 (Continuous
Deployment)
Layer 1
Rapid release cadence
(weekly or daily)
Max Risk is Production
Layer 1
Layer 2
Layer 3
• Risk per release decreases because of more
incremental change
• You still must be careful of Risk within
Production but…
• Total risk over time can be less with Layer 3• Total risk over time can be less with
incremental change
Layer N
@rkjohnston #BSCADC
User Segmentation
Organizing Users by profile and
Risk Tolerance
@rkjohnston #BSCADC
- 26. 11/25/2014
24
User Segmentation Approaches
• Profile Based
• Usage behaviors
• new vs. power users
• Browser type
• Connection Type
• Device and Device OS
• Opted in
• Users Segment themselves
• Opting in indicates risk tolerance
@rkjohnston #BSCADC
Balancing Speed and Risk with Rings
Ring 4: Everyone
Risk Tolerance
No desire
for risk
Ring 2 External Beta
UsersRing 2: Company
& NDA
Ring 1: My Team
g y
Ring 0: Buddy Build
Red Line demarks disclosure risk
and possible loss of patent rights
Risk Tolerance
is highest
@rkjohnston #BSCADC
- 27. 11/25/2014
25
The AutoPilot Watchdog Model
Servers have 3 states they can be in
• Healthyy
• Failure mode
• Probation
Watchdogs report on server health
Repair Service has 3 actions
• Kill and restart a failing serviceg
• Re-boot the server
• Re-image the server
@rkjohnston #BSCADC
- 28. 11/25/2014
26
Watchdogs are not just for Services
• Watchdogs built into apps
• 3 states for an App3 states for an App
• Kill and restart app sending report
• Re-boot the device and re-launch
• Fail back to LKG
• Devices are trickier
• Build logic into the device so that
if it loses connectivity it can selfif it loses connectivity it can self
correct.
• Instead of factory settings why not
auto fail back to LKG
@rkjohnston #BSCADC
Generic Service Stack
Production
Traffic
Front door servers for loggingService UX Front Door
Service Auth/Identity
Layer A vCurrent
DefaultP
Front door servers for logging
and access management
UX rendering layers
Identity or authentication layers
Layer B vCurrent
Service Layer C
(Persistent Data Store)
Path
Persistent data layers
@rkjohnston #BSCADC
- 29. 11/25/2014
27
Runtime Flags Example 1
Side-by-Side Deployments
Runtime Flags
Production
Traffic
Test or Forked
Traffic
Service UX Front Door
Service Auth/Identity
Runtime Flags
• Flags direct traffic through the stack
• Used to test vNext before full
release
Layer A vCurrent
Default
Runtime
Traffic Traffic
RuntimeLayer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
tPath
Runtime
Runtime
Layer B vNext
Runtime Flags Example 2
N Test Environments
Production
Traffic
Test
Case
Checkin
Tests
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Traffic CaseTests
Default
Runtime
Runtime
Layer A DevBox Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
Path
Runtime
Runtime
Layer A DevBox
Layer B Test Cluster
- 30. 11/25/2014
28
Apps as a Service: Facebook
How Facebook secretly redesigned its iPhone app with
your help
a system for creating alternate versions within the…a system for creating alternate versions… within the
native app.
The team could then turn on certain new features for a
subset of its users, directly,
…a system of "different types of Legos... and see the
results on the server in real time "results on the server in real time.
From article on The Verge by Dieter Bohn September 18, 2013
@rkjohnston #BSCADC
That Was a lot of Content
Big Data +
Agile are
i i
Minimum
i bl
Analysis and
Insights
Driving
Change
Data Driven
Quality
Framework
Viable
Quality
Designing
for DDQ
and
Mitigating
Risk
Taking
Action
“Big Data” Search Trends
@rkjohnston #BSCADC