DevOps - It's About How We Work

DevOps
It’s About How We Work
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup

Background
• VP Engineering at Stitch Fix
o Combining “Art and Science” to revolutionize apparel retail
• Consulting “CTO as a service”
o Helping companies scale engineering organizations and technology
• Director of Engineering for Google App Engine
o World’s largest Platform-as-a-Service
• Chief Engineer / Distinguished Architect at eBay
o Multiple generations of eBay’s infrastructure
@randyshoup linkedin.com/in/randyshoup

Stitch Fix

Combining Art and
[Data] Science
• 1:1 Ratio of Data Science to Engineering
o Almost 100 software engineers
o Almost 100 data scientists and algorithm developers
o Unique in our industry
• Apply intelligence to *every* part of the business
o Buying
o Inventory management
o Logistics optimization
o Styling recommendations
o Demand prediction
• Humans and machines augmenting each other

Styling at
Stitch Fix
Personal styling
Inventory

Personalized
Recommendations
Inventory
Algorithmic
recommendations
Machine learning

Expert Human
Curation
Human
curation
Algorithmic
recommendations

How do we work, and why
does it work?

Modern Software
Development
Practices
CultureTechnology
Organization

Modern Software
Development
TDD and
Continuous
Delivery
DevOpsMicroservices
Small
Teams

Conway’s Law
• Organization determines architecture
o Design of a system will be a reflection of the communication paths within the
organization
• Modular system requires modular organization
o Small, independent teams lead to more flexible, composable systems
o Larger, interdependent teams lead to larger systems
• We can engineer the system we want by engineering the
organization

Small
“Service” Teams
• Amazon “2 Pizza” Teams
o No team should be larger than can be fed by 2 large pizzas
o Typically 4-6 people
o Mix of junior and senior people
• Aligned to Business Domains
o Clear, well-defined area of responsibility
o Single service or set of related services
o Minimal, well-defined “interface”

Full-Stack
Teams
• All disciplines required for the team’s function
o Design
o Development
o Quality and Performance
o Maintenance
o Operations
• Symphony, not a Factory
o Diversity of skills and talents
o Working together more important than individual talent
• Depend on other teams for supporting services, libraries,
and tools

End-to-End
Ownership
• Teams own their roadmap
• No separate maintenance or sustaining engineering
team
• Teams are long-term
o Team owns service from design to deployment to retirement

Stable Points in
Organization Size
• ~5
o Everyone fits around a conference table
o Single team, no structure
o High bandwidth communication between individuals
o Fluid roles
• ~20
o Very difficult to manage as a single team, but possible
o Need to introduce structure, but can be challenging to make it optimal / efficient
o Potential trough of productivity and motivation
• ~50-100+
o Requires shift from coordinating individuals to coordinating teams
o High-bandwidth within teams, loose coupling between teams
o Focus on team structure and responsibilities

Test-Driven
Development
• Tests help you go faster
o Tests “have your back”
o Development velocity
• Tests make better code
o Confidence to break things
o Courage to refactor mercilessly
• Tests make better systems
o Catch bugs earlier, fail faster

“Do you have time to do it
twice?”
“We don’t have time to do it
right!”

Test-Driven
Development
• Do it right (enough) the first time
o The more constrained you are on time and resources, the more important it is to
build solid features
o Build one great thing instead of two half-finished things
• Right ≠ Perfect (80 / 20 Rule)
•  Basically no bug tracking system (!)
o Bugs are fixed as they come up
o Backlog contains features we want to build
o Backlog contains technical debt we want to repay

Transitioning
to Testing
• Write functional tests around a component
o If you can only have a few tests, they should be meaningful ones
• Fail any build that breaks a test
• Opportunistically add tests
o For every new bug, add a test that reproduces the bug and verifies the fix
o For every new feature, add tests for that feature

Transitioning
to Testing
• Keep ratcheting up the level
o E.g., Compiler warnings in eBay search
Flickr user smurfie_77

Continuous
Delivery
• Most applications deployed multiple times per day
• More solid systems
o Release smaller units of work
o Smaller changes to roll back or roll forward
o Faster to repair, easier to understand, simpler to diagnose
• Rapid experimentation
o Small experiments and rapid iteration are cheap

Continuous
Delivery
• Enabled by
o API-driven infrastructure (“cloud”)
o PaaS
o Containers
• eBay 2-week trains vs. today

Triangle of
Technical Tradeoffs
• When you choose date and
features, you implicitly
choose a level of quality
• Changing one changes the
others
• Be open and honest when
you are making these
tradeoffs
Date
QualityFeatures

Vicious Cycle
of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty

Virtuous Cycle
of Investment
Solid
Foundation
Confidence
Faster and
Better
Testing

Culture eats strategy for
breakfast.
-- Peter Drucker

Culture eats strategy and
organization and technology and
process and … for breakfast.
-- me

Cross-Functional
Collaboration
• Open communication
o Individuals encouraged to work directly with each other
o Prefer informal cooperation over formal channels
• Best decisions made through partnership
o Agreement on goals and priorities makes it easier to agree on tactics
o Given common context, well-meaning people will generally agree
o “Disagree and Commit”
• Solve problems instead of pointing fingers
o Otherwise, playing strategy instead of solving the problem
o Otherwise, avoiding blame and hiding the ball

None of us is as smart as all of
us.
-- Japanese proverb,
as quoted by Bob Taylor

Goals of a
Service Owner
• Meet the needs of my clients …
• Functionality
• Quality
• Performance
• Stability and reliability
• Constant improvement over time
• … at minimum cost and effort
• Leverage common tools and infrastructure
• Leverage other services
• Automate building, deploying, and operating my service
• Optimize for efficient use of resources

Responsibilities of a
Service Owner
• End-to-end Ownership
o Team owns service from design to deployment to retirement
o No separate maintenance or sustaining engineering team
• Autonomy and Accountability
o Freedom to choose technology, methodology, working environment
o Responsibility for the results of those choices

You Build It, You Run It.
-- Werner Vogels

Service
Relationships
• Vendor – Customer Relationship
o Friendly and cooperative, but structured
o Clear ownership and division of responsibility
• Customer Focus
o Value of service comes from its value to its customers
• Customer can choose to use service or not (!)
o Must be strictly better than the alternatives of build, buy, borrow

Service-Service
Relationships
• Service-Level Agreement (SLA)
o Promise of service levels by the provider
o Customer needs to be able to rely on the service, like a utility
• Charging and Cost Allocation
o Charge customers for *usage* of the service
o Aligns economic incentives of customer and provider
o Motivates both sides to optimize for efficiency
o (+) Pre- / post-allocation at Google

Blameless
Post-Mortems
• Post-mortem After Every Incident
o Document exactly what happened
o What went right
o What went wrong
• Open and Honest Discussion
o What contributed to the incident?
o What could we have done better?
 Engineers compete to take personal responsibility (!)
 “Finally we can fix that broken system” 

Blameless
Post-Mortems
• Action Items
o How will we change process, technology, documentation, etc.
o How could we have automated the problems away?
o How could we have diagnosed more quickly?
o How could we have restored service more quickly?
• Follow up (!)

Failure is not falling down,
but refusing to get back up.
-- Theodore Roosevelt

Thanks!
• Stitch Fix is hiring!
o www.stitchfix.com/careers
o Based in San Francisco
o Hiring everywhere!
o More than half remote, all across US
o Application development, Platform engineering, Data
Science
• Please contact me
o @randyshoup
o linkedin.com/in/randyshoup

Appendices
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup

Sharing
Specialty Skills
• Specialty Skills
o Security, User Experience, Compliance, DBA, etc.
o Often quite difficult to hire
o Rarely need a full-time person on each team
• Approach 1: Service Model
o Make domain teams as self-sufficient as possible
o Encode best practices into service / tool / library
o Own those specialty services just like domain teams do

Sharing
Specialty Skills
• Approach 2: Shared Model
o Single person shared among multiple teams
• Approach 3: Coaching / Advisory Model
o Specialty team is a pool of advisors
o Provide special expertise as needed
o Goal is to make domain teams self-sufficient

Team
Anti-Patterns
• Skill-based teams
o Based around tiers or technologies (e.g., front-end team, application team, DBA
team, Ops team)
o (-) Every project crosses many team boundaries
o (-) No end-to-end ownership of the system
o (-) No end-to-end ownership of the customer experience
• Project-based teams
o Form ad-hoc team for a particular project, then disband
o (-) No long-term ownership of code, product, service
o (-) Encourages short-term approach instead of sustainable technical debt

Team
Anti-Patterns
• Large teams
o (-) Teams larger than 8-10 should be split
o (-) Communication and coordination overhead makes it increasingly difficult to
sustain velocity

Effective
Global Teams
• Local Ownership
o Well-defined area of responsibility
o Clean interface with the rest of the organization
• Individual teams are co-located
o High-bandwidth communication within a team
o Minimal coordination across teams

Global Team
Anti-Patterns
• Anti-Pattern: Split Teams Over Geographies
o (-) Constant need for coordination over time zones
o (-) Local conversations become disruptive rather than helpful
o (-) No local pride of ownership
• Anti-Pattern: Remote Team as Job Shop
o (-) Constant need for management and task assignment
o (-) Resentment between first-tier and second-tier sites
o (-) No local pride of ownership
o Ex. eBay remote offices vs. Google remote offices

Remote
Teams
• Fully remote *OR* fully co-located
o Remote teams rely on virtual proximity (chat, hangouts, IRC)
o Co-located teams rely on physical proximity (co-working)
• Anti-Pattern: “Mostly” co-located
o (-) Co-located majority ends up determining communication methods
o (-) Remote individuals left out, less able to contribute, less productive

Feature
Flags
• Configuration “flag” to enable / disable a feature for a
particular set of users
o Independently discovered at eBay, Facebook, Google, etc.
• More solid systems
o Decouple feature delivery from code delivery
o Rapid on and off
o Develop / test / verify in production
o Dark launches
• Enables experimentation
o A | B testing

DevOps - It's About How We Work

More Related Content

DevOps - It's About How We Work