SlideShare a Scribd company logo
DevOps
It’s About How We Work
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
• VP Engineering at Stitch Fix
o Combining “Art and Science” to revolutionize apparel retail
• Consulting “CTO as a service”
o Helping companies scale engineering organizations and technology
• Director of Engineering for Google App Engine
o World’s largest Platform-as-a-Service
• Chief Engineer / Distinguished Architect at eBay
o Multiple generations of eBay’s infrastructure
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Combining Art and
[Data] Science
• 1:1 Ratio of Data Science to Engineering
o Almost 100 software engineers
o Almost 100 data scientists and algorithm developers
o Unique in our industry
• Apply intelligence to *every* part of the business
o Buying
o Inventory management
o Logistics optimization
o Styling recommendations
o Demand prediction
• Humans and machines augmenting each other
@randyshoup linkedin.com/in/randyshoup
Styling at
Stitch Fix
Personal styling
Inventory
@randyshoup linkedin.com/in/randyshoup
Personalized
Recommendations
Inventory
Algorithmic
recommendations
Machine learning
@randyshoup linkedin.com/in/randyshoup
Expert Human
Curation
Human
curation
Algorithmic
recommendations
@randyshoup linkedin.com/in/randyshoup
How do we work, and why
does it work?
Modern Software
Development
Practices
CultureTechnology
Organization
Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams
Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams
Conway’s Law
• Organization determines architecture
o Design of a system will be a reflection of the communication paths within the
organization
• Modular system requires modular organization
o Small, independent teams lead to more flexible, composable systems
o Larger, interdependent teams lead to larger systems
• We can engineer the system we want by engineering the
organization
@randyshoup linkedin.com/in/randyshoup
Small
“Service” Teams
• Amazon “2 Pizza” Teams
o No team should be larger than can be fed by 2 large pizzas
o Typically 4-6 people
o Mix of junior and senior people
• Aligned to Business Domains
o Clear, well-defined area of responsibility
o Single service or set of related services
o Minimal, well-defined “interface”
@randyshoup linkedin.com/in/randyshoup
Full-Stack
Teams
• All disciplines required for the team’s function
o Design
o Development
o Quality and Performance
o Maintenance
o Operations
• Symphony, not a Factory
o Diversity of skills and talents
o Working together more important than individual talent
• Depend on other teams for supporting services, libraries,
and tools
@randyshoup linkedin.com/in/randyshoup
End-to-End
Ownership
• Teams own their roadmap
• No separate maintenance or sustaining engineering
team
• Teams are long-term
o Team owns service from design to deployment to retirement
@randyshoup linkedin.com/in/randyshoup
Stable Points in
Organization Size
• ~5
o Everyone fits around a conference table
o Single team, no structure
o High bandwidth communication between individuals
o Fluid roles
• ~20
o Very difficult to manage as a single team, but possible
o Need to introduce structure, but can be challenging to make it optimal / efficient
o Potential trough of productivity and motivation
• ~50-100+
o Requires shift from coordinating individuals to coordinating teams
o High-bandwidth within teams, loose coupling between teams
o Focus on team structure and responsibilities
@randyshoup linkedin.com/in/randyshoup
Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams
Test-Driven
Development
• Tests help you go faster
o Tests “have your back”
o Development velocity
• Tests make better code
o Confidence to break things
o Courage to refactor mercilessly
• Tests make better systems
o Catch bugs earlier, fail faster
@randyshoup linkedin.com/in/randyshoup
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
Test-Driven
Development
• Do it right (enough) the first time
o The more constrained you are on time and resources, the more important it is to
build solid features
o Build one great thing instead of two half-finished things
• Right ≠ Perfect (80 / 20 Rule)
•  Basically no bug tracking system (!)
o Bugs are fixed as they come up
o Backlog contains features we want to build
o Backlog contains technical debt we want to repay
@randyshoup linkedin.com/in/randyshoup
Transitioning
to Testing
• Write functional tests around a component
o If you can only have a few tests, they should be meaningful ones
• Fail any build that breaks a test
• Opportunistically add tests
o For every new bug, add a test that reproduces the bug and verifies the fix
o For every new feature, add tests for that feature
@randyshoup linkedin.com/in/randyshoup
Transitioning
to Testing
• Keep ratcheting up the level
o E.g., Compiler warnings in eBay search
Flickr user smurfie_77
Continuous
Delivery
• Most applications deployed multiple times per day
• More solid systems
o Release smaller units of work
o Smaller changes to roll back or roll forward
o Faster to repair, easier to understand, simpler to diagnose
• Rapid experimentation
o Small experiments and rapid iteration are cheap
@randyshoup linkedin.com/in/randyshoup
Continuous
Delivery
• Enabled by
o API-driven infrastructure (“cloud”)
o PaaS
o Containers
• eBay 2-week trains vs. today
@randyshoup linkedin.com/in/randyshoup
Triangle of
Technical Tradeoffs
• When you choose date and
features, you implicitly
choose a level of quality
• Changing one changes the
others
• Be open and honest when
you are making these
tradeoffs
Date
QualityFeatures
Vicious Cycle
of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
Virtuous Cycle
of Investment
Solid
Foundation
Confidence
Faster and
Better
Testing
Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams
Culture eats strategy for
breakfast.
-- Peter Drucker
Culture eats strategy and
organization and technology and
process and … for breakfast.
-- me
Cross-Functional
Collaboration
• Open communication
o Individuals encouraged to work directly with each other
o Prefer informal cooperation over formal channels
• Best decisions made through partnership
o Agreement on goals and priorities makes it easier to agree on tactics
o Given common context, well-meaning people will generally agree
o “Disagree and Commit”
• Solve problems instead of pointing fingers
o Otherwise, playing strategy instead of solving the problem
o Otherwise, avoiding blame and hiding the ball
@randyshoup linkedin.com/in/randyshoup
None of us is as smart as all of
us.
-- Japanese proverb,
as quoted by Bob Taylor
Goals of a
Service Owner
• Meet the needs of my clients …
• Functionality
• Quality
• Performance
• Stability and reliability
• Constant improvement over time
• … at minimum cost and effort
• Leverage common tools and infrastructure
• Leverage other services
• Automate building, deploying, and operating my service
• Optimize for efficient use of resources
@randyshoup linkedin.com/in/randyshoup
Responsibilities of a
Service Owner
• End-to-end Ownership
o Team owns service from design to deployment to retirement
o No separate maintenance or sustaining engineering team
• Autonomy and Accountability
o Freedom to choose technology, methodology, working environment
o Responsibility for the results of those choices
@randyshoup linkedin.com/in/randyshoup
You Build It, You Run It.
-- Werner Vogels
Service
Relationships
• Vendor – Customer Relationship
o Friendly and cooperative, but structured
o Clear ownership and division of responsibility
• Customer Focus
o Value of service comes from its value to its customers
• Customer can choose to use service or not (!)
o Must be strictly better than the alternatives of build, buy, borrow
@randyshoup linkedin.com/in/randyshoup
Service-Service
Relationships
• Service-Level Agreement (SLA)
o Promise of service levels by the provider
o Customer needs to be able to rely on the service, like a utility
• Charging and Cost Allocation
o Charge customers for *usage* of the service
o Aligns economic incentives of customer and provider
o Motivates both sides to optimize for efficiency
o (+) Pre- / post-allocation at Google
Blameless
Post-Mortems
• Post-mortem After Every Incident
o Document exactly what happened
o What went right
o What went wrong
• Open and Honest Discussion
o What contributed to the incident?
o What could we have done better?
 Engineers compete to take personal responsibility (!)
 “Finally we can fix that broken system” 
@randyshoup linkedin.com/in/randyshoup
Blameless
Post-Mortems
• Action Items
o How will we change process, technology, documentation, etc.
o How could we have automated the problems away?
o How could we have diagnosed more quickly?
o How could we have restored service more quickly?
• Follow up (!)
@randyshoup linkedin.com/in/randyshoup
Failure is not falling down,
but refusing to get back up.
-- Theodore Roosevelt
Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams
Thanks!
• Stitch Fix is hiring!
o www.stitchfix.com/careers
o Based in San Francisco
o Hiring everywhere!
o More than half remote, all across US
o Application development, Platform engineering, Data
Science
• Please contact me
o @randyshoup
o linkedin.com/in/randyshoup
Appendices
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Sharing
Specialty Skills
• Specialty Skills
o Security, User Experience, Compliance, DBA, etc.
o Often quite difficult to hire
o Rarely need a full-time person on each team
• Approach 1: Service Model
o Make domain teams as self-sufficient as possible
o Encode best practices into service / tool / library
o Own those specialty services just like domain teams do
@randyshoup linkedin.com/in/randyshoup
Sharing
Specialty Skills
• Approach 2: Shared Model
o Single person shared among multiple teams
• Approach 3: Coaching / Advisory Model
o Specialty team is a pool of advisors
o Provide special expertise as needed
o Goal is to make domain teams self-sufficient
@randyshoup linkedin.com/in/randyshoup
Team
Anti-Patterns
• Skill-based teams
o Based around tiers or technologies (e.g., front-end team, application team, DBA
team, Ops team)
o (-) Every project crosses many team boundaries
o (-) No end-to-end ownership of the system
o (-) No end-to-end ownership of the customer experience
• Project-based teams
o Form ad-hoc team for a particular project, then disband
o (-) No long-term ownership of code, product, service
o (-) Encourages short-term approach instead of sustainable technical debt
Team
Anti-Patterns
• Large teams
o (-) Teams larger than 8-10 should be split
o (-) Communication and coordination overhead makes it increasingly difficult to
sustain velocity
Effective
Global Teams
• Local Ownership
o Well-defined area of responsibility
o Clean interface with the rest of the organization
• Individual teams are co-located
o High-bandwidth communication within a team
o Minimal coordination across teams
Global Team
Anti-Patterns
• Anti-Pattern: Split Teams Over Geographies
o (-) Constant need for coordination over time zones
o (-) Local conversations become disruptive rather than helpful
o (-) No local pride of ownership
• Anti-Pattern: Remote Team as Job Shop
o (-) Constant need for management and task assignment
o (-) Resentment between first-tier and second-tier sites
o (-) No local pride of ownership
o Ex. eBay remote offices vs. Google remote offices
Remote
Teams
• Fully remote *OR* fully co-located
o Remote teams rely on virtual proximity (chat, hangouts, IRC)
o Co-located teams rely on physical proximity (co-working)
• Anti-Pattern: “Mostly” co-located
o (-) Co-located majority ends up determining communication methods
o (-) Remote individuals left out, less able to contribute, less productive
Feature
Flags
• Configuration “flag” to enable / disable a feature for a
particular set of users
o Independently discovered at eBay, Facebook, Google, etc.
• More solid systems
o Decouple feature delivery from code delivery
o Rapid on and off
o Develop / test / verify in production
o Dark launches
• Enables experimentation
o A | B testing

More Related Content

DevOps - It's About How We Work

  • 1. DevOps It’s About How We Work Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 2. Background • VP Engineering at Stitch Fix o Combining “Art and Science” to revolutionize apparel retail • Consulting “CTO as a service” o Helping companies scale engineering organizations and technology • Director of Engineering for Google App Engine o World’s largest Platform-as-a-Service • Chief Engineer / Distinguished Architect at eBay o Multiple generations of eBay’s infrastructure @randyshoup linkedin.com/in/randyshoup
  • 7. Combining Art and [Data] Science • 1:1 Ratio of Data Science to Engineering o Almost 100 software engineers o Almost 100 data scientists and algorithm developers o Unique in our industry • Apply intelligence to *every* part of the business o Buying o Inventory management o Logistics optimization o Styling recommendations o Demand prediction • Humans and machines augmenting each other @randyshoup linkedin.com/in/randyshoup
  • 8. Styling at Stitch Fix Personal styling Inventory @randyshoup linkedin.com/in/randyshoup
  • 11. How do we work, and why does it work?
  • 15. Conway’s Law • Organization determines architecture o Design of a system will be a reflection of the communication paths within the organization • Modular system requires modular organization o Small, independent teams lead to more flexible, composable systems o Larger, interdependent teams lead to larger systems • We can engineer the system we want by engineering the organization @randyshoup linkedin.com/in/randyshoup
  • 16. Small “Service” Teams • Amazon “2 Pizza” Teams o No team should be larger than can be fed by 2 large pizzas o Typically 4-6 people o Mix of junior and senior people • Aligned to Business Domains o Clear, well-defined area of responsibility o Single service or set of related services o Minimal, well-defined “interface” @randyshoup linkedin.com/in/randyshoup
  • 17. Full-Stack Teams • All disciplines required for the team’s function o Design o Development o Quality and Performance o Maintenance o Operations • Symphony, not a Factory o Diversity of skills and talents o Working together more important than individual talent • Depend on other teams for supporting services, libraries, and tools @randyshoup linkedin.com/in/randyshoup
  • 18. End-to-End Ownership • Teams own their roadmap • No separate maintenance or sustaining engineering team • Teams are long-term o Team owns service from design to deployment to retirement @randyshoup linkedin.com/in/randyshoup
  • 19. Stable Points in Organization Size • ~5 o Everyone fits around a conference table o Single team, no structure o High bandwidth communication between individuals o Fluid roles • ~20 o Very difficult to manage as a single team, but possible o Need to introduce structure, but can be challenging to make it optimal / efficient o Potential trough of productivity and motivation • ~50-100+ o Requires shift from coordinating individuals to coordinating teams o High-bandwidth within teams, loose coupling between teams o Focus on team structure and responsibilities @randyshoup linkedin.com/in/randyshoup
  • 21. Test-Driven Development • Tests help you go faster o Tests “have your back” o Development velocity • Tests make better code o Confidence to break things o Courage to refactor mercilessly • Tests make better systems o Catch bugs earlier, fail faster @randyshoup linkedin.com/in/randyshoup
  • 22. “Do you have time to do it twice?” “We don’t have time to do it right!”
  • 23. Test-Driven Development • Do it right (enough) the first time o The more constrained you are on time and resources, the more important it is to build solid features o Build one great thing instead of two half-finished things • Right ≠ Perfect (80 / 20 Rule) •  Basically no bug tracking system (!) o Bugs are fixed as they come up o Backlog contains features we want to build o Backlog contains technical debt we want to repay @randyshoup linkedin.com/in/randyshoup
  • 24. Transitioning to Testing • Write functional tests around a component o If you can only have a few tests, they should be meaningful ones • Fail any build that breaks a test • Opportunistically add tests o For every new bug, add a test that reproduces the bug and verifies the fix o For every new feature, add tests for that feature @randyshoup linkedin.com/in/randyshoup
  • 25. Transitioning to Testing • Keep ratcheting up the level o E.g., Compiler warnings in eBay search Flickr user smurfie_77
  • 26. Continuous Delivery • Most applications deployed multiple times per day • More solid systems o Release smaller units of work o Smaller changes to roll back or roll forward o Faster to repair, easier to understand, simpler to diagnose • Rapid experimentation o Small experiments and rapid iteration are cheap @randyshoup linkedin.com/in/randyshoup
  • 27. Continuous Delivery • Enabled by o API-driven infrastructure (“cloud”) o PaaS o Containers • eBay 2-week trains vs. today @randyshoup linkedin.com/in/randyshoup
  • 28. Triangle of Technical Tradeoffs • When you choose date and features, you implicitly choose a level of quality • Changing one changes the others • Be open and honest when you are making these tradeoffs Date QualityFeatures
  • 29. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  • 32. Culture eats strategy for breakfast. -- Peter Drucker
  • 33. Culture eats strategy and organization and technology and process and … for breakfast. -- me
  • 34. Cross-Functional Collaboration • Open communication o Individuals encouraged to work directly with each other o Prefer informal cooperation over formal channels • Best decisions made through partnership o Agreement on goals and priorities makes it easier to agree on tactics o Given common context, well-meaning people will generally agree o “Disagree and Commit” • Solve problems instead of pointing fingers o Otherwise, playing strategy instead of solving the problem o Otherwise, avoiding blame and hiding the ball @randyshoup linkedin.com/in/randyshoup
  • 35. None of us is as smart as all of us. -- Japanese proverb, as quoted by Bob Taylor
  • 36. Goals of a Service Owner • Meet the needs of my clients … • Functionality • Quality • Performance • Stability and reliability • Constant improvement over time • … at minimum cost and effort • Leverage common tools and infrastructure • Leverage other services • Automate building, deploying, and operating my service • Optimize for efficient use of resources @randyshoup linkedin.com/in/randyshoup
  • 37. Responsibilities of a Service Owner • End-to-end Ownership o Team owns service from design to deployment to retirement o No separate maintenance or sustaining engineering team • Autonomy and Accountability o Freedom to choose technology, methodology, working environment o Responsibility for the results of those choices @randyshoup linkedin.com/in/randyshoup
  • 38. You Build It, You Run It. -- Werner Vogels
  • 39. Service Relationships • Vendor – Customer Relationship o Friendly and cooperative, but structured o Clear ownership and division of responsibility • Customer Focus o Value of service comes from its value to its customers • Customer can choose to use service or not (!) o Must be strictly better than the alternatives of build, buy, borrow @randyshoup linkedin.com/in/randyshoup
  • 40. Service-Service Relationships • Service-Level Agreement (SLA) o Promise of service levels by the provider o Customer needs to be able to rely on the service, like a utility • Charging and Cost Allocation o Charge customers for *usage* of the service o Aligns economic incentives of customer and provider o Motivates both sides to optimize for efficiency o (+) Pre- / post-allocation at Google
  • 41. Blameless Post-Mortems • Post-mortem After Every Incident o Document exactly what happened o What went right o What went wrong • Open and Honest Discussion o What contributed to the incident? o What could we have done better?  Engineers compete to take personal responsibility (!)  “Finally we can fix that broken system”  @randyshoup linkedin.com/in/randyshoup
  • 42. Blameless Post-Mortems • Action Items o How will we change process, technology, documentation, etc. o How could we have automated the problems away? o How could we have diagnosed more quickly? o How could we have restored service more quickly? • Follow up (!) @randyshoup linkedin.com/in/randyshoup
  • 43. Failure is not falling down, but refusing to get back up. -- Theodore Roosevelt
  • 45. Thanks! • Stitch Fix is hiring! o www.stitchfix.com/careers o Based in San Francisco o Hiring everywhere! o More than half remote, all across US o Application development, Platform engineering, Data Science • Please contact me o @randyshoup o linkedin.com/in/randyshoup
  • 47. Sharing Specialty Skills • Specialty Skills o Security, User Experience, Compliance, DBA, etc. o Often quite difficult to hire o Rarely need a full-time person on each team • Approach 1: Service Model o Make domain teams as self-sufficient as possible o Encode best practices into service / tool / library o Own those specialty services just like domain teams do @randyshoup linkedin.com/in/randyshoup
  • 48. Sharing Specialty Skills • Approach 2: Shared Model o Single person shared among multiple teams • Approach 3: Coaching / Advisory Model o Specialty team is a pool of advisors o Provide special expertise as needed o Goal is to make domain teams self-sufficient @randyshoup linkedin.com/in/randyshoup
  • 49. Team Anti-Patterns • Skill-based teams o Based around tiers or technologies (e.g., front-end team, application team, DBA team, Ops team) o (-) Every project crosses many team boundaries o (-) No end-to-end ownership of the system o (-) No end-to-end ownership of the customer experience • Project-based teams o Form ad-hoc team for a particular project, then disband o (-) No long-term ownership of code, product, service o (-) Encourages short-term approach instead of sustainable technical debt
  • 50. Team Anti-Patterns • Large teams o (-) Teams larger than 8-10 should be split o (-) Communication and coordination overhead makes it increasingly difficult to sustain velocity
  • 51. Effective Global Teams • Local Ownership o Well-defined area of responsibility o Clean interface with the rest of the organization • Individual teams are co-located o High-bandwidth communication within a team o Minimal coordination across teams
  • 52. Global Team Anti-Patterns • Anti-Pattern: Split Teams Over Geographies o (-) Constant need for coordination over time zones o (-) Local conversations become disruptive rather than helpful o (-) No local pride of ownership • Anti-Pattern: Remote Team as Job Shop o (-) Constant need for management and task assignment o (-) Resentment between first-tier and second-tier sites o (-) No local pride of ownership o Ex. eBay remote offices vs. Google remote offices
  • 53. Remote Teams • Fully remote *OR* fully co-located o Remote teams rely on virtual proximity (chat, hangouts, IRC) o Co-located teams rely on physical proximity (co-working) • Anti-Pattern: “Mostly” co-located o (-) Co-located majority ends up determining communication methods o (-) Remote individuals left out, less able to contribute, less productive
  • 54. Feature Flags • Configuration “flag” to enable / disable a feature for a particular set of users o Independently discovered at eBay, Facebook, Google, etc. • More solid systems o Decouple feature delivery from code delivery o Rapid on and off o Develop / test / verify in production o Dark launches • Enables experimentation o A | B testing