Good Things and Hard Things of SaaS Development/Operations
- 1. © 2019 Arm Limited
Satoshi Tagomori (@tagomoris)
Principal Software Engineer, Arm Treasure Data
クラウド活用のアーキテクチャとDevOps事例勉強会
Good Things and Hard Things
of SaaS
Development/Operations
- 2. © 2019 Arm Limited2
Satoshi Tagomori (@tagomoris)
• Treasure Data: 2015~
• Arm Treasure Data (2018~)
• Current: Backend Team
• OSS: Fluentd, MessagePack-Ruby, Woothee, Norikra, ...
• ISUCON
• 2011~ (livedoor, NHN Japan, LINE)
- 4. © 2019 Arm Limited4
What's DevOps?
• Collaboration between Devs and Ops
• "Development" and "System Operations"
• For faster development, deployment, release cycles
• For quicker improvements
• Performance
• Business Values
- 5. © 2019 Arm Limited5
Treasure Data Platform
• Distributed Systems
• Distributed Database Systems
• Distributed Data Processing Systems
• Job Queue & Workers
• API Endpoints
• Data Transferring, Conversion
• ...
• CDP (Customer Data Platform) Application
• on the top of the platform
• our application, as a Marketing platform for customers
Treasure Data Platform
Treasure Data CDP
Customers' Marketing
Businesses, Apps
Customers'
Data
Analytics
Workloads
- 6. © 2019 Arm Limited6
Engineering Teams: Own, Build and Operate Systems
Teams own components
• Design
• Development
• Configuration
• Testing
• Deployment
• Release
• Monitoring
• Alert
• Operation
• On-call!
SRE provides common features
• OS base images
• System-wide configurations
• Shared tools (chatbot, etc)
• ...
- 7. © 2019 Arm Limited7
DevOps... ?
A team own everything around a component
including Monitoring, Operations, On-call, ...
🤔
- 8. © 2019 Arm Limited8
Developers do Operations
Nothing like "Devs vs Ops"!
😆
- 9. © 2019 Arm Limited9
THE END ... ?
No, No, Wait, WE HAVE PROBLEMS 👻
- 10. © 2019 Arm Limited10
Distributed Systems by Many Teams
Complexity over Teams
Component Dependencies
• Components depend on each other
• Query engine <-> Database
• Data ingestion <-> Database
• Worker <-> Query engine
• API <-> Worker
• ...
• Feature dependencies between
components
Data Compatibilities
• Write data, Read data *later*
• Data ingestion write data
• Database subsystem read data
• Query engine read data
• ...
• Incompatible data cause crashes *later*
- 11. © 2019 Arm Limited11
Various Backend Team Components
Treasure Data Platform
Backend Team Components
Plazma
(Distributed Database)
Data Ingestion
Workers
Workflow
API
Hadoop
Presto
Data Connector
CDP
Frontend
Our Customers
- 12. © 2019 Arm Limited
• Historic components have
stories...
• Chef-based deployments
• Old-style configurations
• Old JVMs
• Unorganized AWS
components
• Unorganized monitoring/
alerts
Long History
12
Hard Things: Backend Components
with Long History, Many Components and Severe Uptime Requirement
• Backend was "not Frontend"
in past
• Different components
• Distributed database
• Data ingestion APIs
• Data ingestion workers
• Job workers
• Workflow manager
• Many ways to do things
Many Components
• Database, Workers
• Referred by all other
components
• Downtime means entire
service downtime
• Data Ingestions
• Downtime means user-side
data loss
Uptime Requirement
- 13. © 2019 Arm Limited13
The Thing Makes Us Slower
Outdated Systems
Poor System
Improvements
Bad Stability
High Operation Cost
Low Business Income
Low QoL
😱
😨 😰
- 14. © 2019 Arm Limited
Quick
Delivery
Feedbacks
From Customers
& Teams
Frequent
Deployment
14
Move Faster
Modernized
Deployment
😀
- 15. © 2019 Arm Limited15
Modernizing Deployment
Old-style (Chef) vs Modern-style (CodeDeploy)
Chef Deployed Systems Periodically
• Developer pushed "production" branch
• Release: Merge into "production"
• Chef pulls "production" branch
• On EC2 instances
• Once per 30 minutes
• Then, chef recipe does:
• Build, Install dependencies
• Restart processes (if needed)
Developers Kick CodeDeploy
• Service repo kicks CircleCI
• CI creates CodeDeploy packages on S3
• Developers kick "deploy" or "release"
• On Slack
• To trigger CodeDeploy
• CodeDeploy fetches packages from S3
• Build, Install, Restarts, etc
👤 👤
- 16. © 2019 Arm Limited16
Many, Quick, Frequent Deployments & Releases
Split Release into Small Releases
Minimize Affected Components
• Giant release affects many components
• Hard to say:
"it's safe for all components"
• Many small releases
• Easy to say:
"it's safe for THIS component!"
Minimize Affected Customers
• Many customers do various things on our
platform
• Release MAY affect their workloads
• Query compatibility
• Query performance
• Invalid data handling
• Data ingestion request patterns
• "A customer says something goes wrong"
• "What did happen in this 1 week?"
• Make things clear for support operations
- 17. © 2019 Arm Limited17
Many, Quick, Frequent Deployments & Releases
Split Release into Small Releases
Minimize Affected Components
• Giant release affects many components
• Hard to say:
"it's safe for all components"
• Many small releases
• Easy to say:
"it's safe for THIS component!"
Minimize Affected Customers
• Many customers do various things on our
platform
• Release MAY affect their workloads
• Query compatibility
• Query performance
• Invalid data handling
• Data ingestion request patterns
• "A customer says something goes wrong"
• "What did happen in this 1 week?"
• Make things clear for support operations
- 18. © 2019 Arm Limited18
Revisiting: What's DevOps?
• Collaboration between Devs and Ops
• "Development" and "System Operations"
• Is It Only for System Operations?
• "Operation" means not only system operations in many cases
(For example, Chief "Operation" Officer)
• We need to support many type of operations:
• Support Operations
• Sales Operations
• Audit Operations
- 19. © 2019 Arm Limited19
Audit
• Standards Compliance
• ISO/IEC 27001:2013
• SOC-2 (Service Organization Controls) type 2
https://www.treasuredata.co.jp/security/
- 20. © 2019 Arm Limited20
Executing Infra Operations in Explicit Way
Terraform Enterprise
• AWS infra
by code
• Run code on
Terraform
Enterprise
workspaces
with history
- 21. © 2019 Arm Limited21
Executing Release Operations in Explicit Way
Slack, Jira, Github and Automation
• Trying release
automation
• Communication
Central - Slack
• History Central -
GitHub
• Release Central
- Jira
- 22. © 2019 Arm Limited22
Own Your Service Only, Use Cloud Services
Your service is heavy enough
Own Your Service Only
• Many things to do about your service
• Design
• Development
• Configuration
• Testing
• Deployment
• Release
• Monitoring
• Alert
• Operation
• On-call
Use Cloud Services
• No additional spaces
to own additional things
• Cloud services are to help your work
• Be Agile, Move Faster!