SlideShare a Scribd company logo
© 2019 Arm Limited
Satoshi Tagomori (@tagomoris)
Principal Software Engineer, Arm Treasure Data
クラウド活用のアーキテクチャとDevOps事例勉強会
Good Things and Hard Things
of SaaS
Development/Operations
© 2019 Arm Limited2
Satoshi Tagomori (@tagomoris)
• Treasure Data: 2015~
• Arm Treasure Data (2018~)
• Current: Backend Team
• OSS: Fluentd, MessagePack-Ruby, Woothee, Norikra, ...
• ISUCON
• 2011~ (livedoor, NHN Japan, LINE)
© 2019 Arm Limited3
© 2019 Arm Limited4
What's DevOps?
• Collaboration between Devs and Ops
• "Development" and "System Operations"
• For faster development, deployment, release cycles
• For quicker improvements
• Performance
• Business Values
© 2019 Arm Limited5
Treasure Data Platform
• Distributed Systems
• Distributed Database Systems
• Distributed Data Processing Systems
• Job Queue & Workers
• API Endpoints
• Data Transferring, Conversion
• ...
• CDP (Customer Data Platform) Application
• on the top of the platform
• our application, as a Marketing platform for customers
Treasure Data Platform
Treasure Data CDP
Customers' Marketing
Businesses, Apps
Customers'
Data
Analytics
Workloads
© 2019 Arm Limited6
Engineering Teams: Own, Build and Operate Systems
Teams own components
• Design
• Development
• Configuration
• Testing
• Deployment
• Release
• Monitoring
• Alert
• Operation
• On-call!
SRE provides common features
• OS base images
• System-wide configurations
• Shared tools (chatbot, etc)
• ...
© 2019 Arm Limited7
DevOps... ?
A team own everything around a component
including Monitoring, Operations, On-call, ...
🤔
© 2019 Arm Limited8
Developers do Operations
Nothing like "Devs vs Ops"!
😆
© 2019 Arm Limited9
THE END ... ?
No, No, Wait, WE HAVE PROBLEMS 👻
© 2019 Arm Limited10
Distributed Systems by Many Teams
Complexity over Teams
Component Dependencies
• Components depend on each other
• Query engine <-> Database
• Data ingestion <-> Database
• Worker <-> Query engine
• API <-> Worker
• ...
• Feature dependencies between
components
Data Compatibilities
• Write data, Read data *later*
• Data ingestion write data
• Database subsystem read data
• Query engine read data
• ...
• Incompatible data cause crashes *later*
© 2019 Arm Limited11
Various Backend Team Components
Treasure Data Platform
Backend Team Components
Plazma
(Distributed Database)
Data Ingestion
Workers
Workflow
API
Hadoop
Presto
Data Connector
CDP
Frontend
Our Customers
© 2019 Arm Limited
• Historic components have
stories...
• Chef-based deployments
• Old-style configurations
• Old JVMs
• Unorganized AWS
components
• Unorganized monitoring/
alerts
Long History
12
Hard Things: Backend Components
with Long History, Many Components and Severe Uptime Requirement
• Backend was "not Frontend"

in past
• Different components
• Distributed database
• Data ingestion APIs
• Data ingestion workers
• Job workers
• Workflow manager
• Many ways to do things
Many Components
• Database, Workers
• Referred by all other
components
• Downtime means entire
service downtime
• Data Ingestions
• Downtime means user-side
data loss
Uptime Requirement
© 2019 Arm Limited13
The Thing Makes Us Slower
Outdated Systems
Poor System
Improvements
Bad Stability
High Operation Cost
Low Business Income
Low QoL
😱
😨 😰
© 2019 Arm Limited
Quick
Delivery
Feedbacks
From Customers
& Teams
Frequent
Deployment
14
Move Faster
Modernized
Deployment
😀
© 2019 Arm Limited15
Modernizing Deployment
Old-style (Chef) vs Modern-style (CodeDeploy)
Chef Deployed Systems Periodically
• Developer pushed "production" branch
• Release: Merge into "production"
• Chef pulls "production" branch
• On EC2 instances
• Once per 30 minutes
• Then, chef recipe does:
• Build, Install dependencies
• Restart processes (if needed)
Developers Kick CodeDeploy
• Service repo kicks CircleCI
• CI creates CodeDeploy packages on S3
• Developers kick "deploy" or "release"
• On Slack
• To trigger CodeDeploy
• CodeDeploy fetches packages from S3
• Build, Install, Restarts, etc
👤 👤
© 2019 Arm Limited16
Many, Quick, Frequent Deployments & Releases
Split Release into Small Releases
Minimize Affected Components
• Giant release affects many components
• Hard to say:

"it's safe for all components"
• Many small releases
• Easy to say:

"it's safe for THIS component!"
Minimize Affected Customers
• Many customers do various things on our
platform
• Release MAY affect their workloads
• Query compatibility
• Query performance
• Invalid data handling
• Data ingestion request patterns
• "A customer says something goes wrong"
• "What did happen in this 1 week?"
• Make things clear for support operations
© 2019 Arm Limited17
Many, Quick, Frequent Deployments & Releases
Split Release into Small Releases
Minimize Affected Components
• Giant release affects many components
• Hard to say:

"it's safe for all components"
• Many small releases
• Easy to say:

"it's safe for THIS component!"
Minimize Affected Customers
• Many customers do various things on our
platform
• Release MAY affect their workloads
• Query compatibility
• Query performance
• Invalid data handling
• Data ingestion request patterns
• "A customer says something goes wrong"
• "What did happen in this 1 week?"
• Make things clear for support operations
© 2019 Arm Limited18
Revisiting: What's DevOps?
• Collaboration between Devs and Ops
• "Development" and "System Operations"
• Is It Only for System Operations?
• "Operation" means not only system operations in many cases

(For example, Chief "Operation" Officer)
• We need to support many type of operations:
• Support Operations
• Sales Operations
• Audit Operations
© 2019 Arm Limited19
Audit
• Standards Compliance
• ISO/IEC 27001:2013
• SOC-2 (Service Organization Controls) type 2
https://www.treasuredata.co.jp/security/
© 2019 Arm Limited20
Executing Infra Operations in Explicit Way
Terraform Enterprise
• AWS infra

by code
• Run code on
Terraform
Enterprise
workspaces
with history
© 2019 Arm Limited21
Executing Release Operations in Explicit Way
Slack, Jira, Github and Automation
• Trying release
automation
• Communication
Central - Slack
• History Central -
GitHub
• Release Central
- Jira
© 2019 Arm Limited22
Own Your Service Only, Use Cloud Services
Your service is heavy enough
Own Your Service Only
• Many things to do about your service
• Design
• Development
• Configuration
• Testing
• Deployment
• Release
• Monitoring
• Alert
• Operation
• On-call
Use Cloud Services
• No additional spaces

to own additional things
• Cloud services are to help your work
• Be Agile, Move Faster!
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫ا‬ً‫شكر‬
‫תודה‬© 2019 Arm Limited

More Related Content

Good Things and Hard Things of SaaS Development/Operations

  • 1. © 2019 Arm Limited Satoshi Tagomori (@tagomoris) Principal Software Engineer, Arm Treasure Data クラウド活用のアーキテクチャとDevOps事例勉強会 Good Things and Hard Things of SaaS Development/Operations
  • 2. © 2019 Arm Limited2 Satoshi Tagomori (@tagomoris) • Treasure Data: 2015~ • Arm Treasure Data (2018~) • Current: Backend Team • OSS: Fluentd, MessagePack-Ruby, Woothee, Norikra, ... • ISUCON • 2011~ (livedoor, NHN Japan, LINE)
  • 3. © 2019 Arm Limited3
  • 4. © 2019 Arm Limited4 What's DevOps? • Collaboration between Devs and Ops • "Development" and "System Operations" • For faster development, deployment, release cycles • For quicker improvements • Performance • Business Values
  • 5. © 2019 Arm Limited5 Treasure Data Platform • Distributed Systems • Distributed Database Systems • Distributed Data Processing Systems • Job Queue & Workers • API Endpoints • Data Transferring, Conversion • ... • CDP (Customer Data Platform) Application • on the top of the platform • our application, as a Marketing platform for customers Treasure Data Platform Treasure Data CDP Customers' Marketing Businesses, Apps Customers' Data Analytics Workloads
  • 6. © 2019 Arm Limited6 Engineering Teams: Own, Build and Operate Systems Teams own components • Design • Development • Configuration • Testing • Deployment • Release • Monitoring • Alert • Operation • On-call! SRE provides common features • OS base images • System-wide configurations • Shared tools (chatbot, etc) • ...
  • 7. © 2019 Arm Limited7 DevOps... ? A team own everything around a component including Monitoring, Operations, On-call, ... 🤔
  • 8. © 2019 Arm Limited8 Developers do Operations Nothing like "Devs vs Ops"! 😆
  • 9. © 2019 Arm Limited9 THE END ... ? No, No, Wait, WE HAVE PROBLEMS 👻
  • 10. © 2019 Arm Limited10 Distributed Systems by Many Teams Complexity over Teams Component Dependencies • Components depend on each other • Query engine <-> Database • Data ingestion <-> Database • Worker <-> Query engine • API <-> Worker • ... • Feature dependencies between components Data Compatibilities • Write data, Read data *later* • Data ingestion write data • Database subsystem read data • Query engine read data • ... • Incompatible data cause crashes *later*
  • 11. © 2019 Arm Limited11 Various Backend Team Components Treasure Data Platform Backend Team Components Plazma (Distributed Database) Data Ingestion Workers Workflow API Hadoop Presto Data Connector CDP Frontend Our Customers
  • 12. © 2019 Arm Limited • Historic components have stories... • Chef-based deployments • Old-style configurations • Old JVMs • Unorganized AWS components • Unorganized monitoring/ alerts Long History 12 Hard Things: Backend Components with Long History, Many Components and Severe Uptime Requirement • Backend was "not Frontend"
 in past • Different components • Distributed database • Data ingestion APIs • Data ingestion workers • Job workers • Workflow manager • Many ways to do things Many Components • Database, Workers • Referred by all other components • Downtime means entire service downtime • Data Ingestions • Downtime means user-side data loss Uptime Requirement
  • 13. © 2019 Arm Limited13 The Thing Makes Us Slower Outdated Systems Poor System Improvements Bad Stability High Operation Cost Low Business Income Low QoL 😱 😨 😰
  • 14. © 2019 Arm Limited Quick Delivery Feedbacks From Customers & Teams Frequent Deployment 14 Move Faster Modernized Deployment 😀
  • 15. © 2019 Arm Limited15 Modernizing Deployment Old-style (Chef) vs Modern-style (CodeDeploy) Chef Deployed Systems Periodically • Developer pushed "production" branch • Release: Merge into "production" • Chef pulls "production" branch • On EC2 instances • Once per 30 minutes • Then, chef recipe does: • Build, Install dependencies • Restart processes (if needed) Developers Kick CodeDeploy • Service repo kicks CircleCI • CI creates CodeDeploy packages on S3 • Developers kick "deploy" or "release" • On Slack • To trigger CodeDeploy • CodeDeploy fetches packages from S3 • Build, Install, Restarts, etc 👤 👤
  • 16. © 2019 Arm Limited16 Many, Quick, Frequent Deployments & Releases Split Release into Small Releases Minimize Affected Components • Giant release affects many components • Hard to say:
 "it's safe for all components" • Many small releases • Easy to say:
 "it's safe for THIS component!" Minimize Affected Customers • Many customers do various things on our platform • Release MAY affect their workloads • Query compatibility • Query performance • Invalid data handling • Data ingestion request patterns • "A customer says something goes wrong" • "What did happen in this 1 week?" • Make things clear for support operations
  • 17. © 2019 Arm Limited17 Many, Quick, Frequent Deployments & Releases Split Release into Small Releases Minimize Affected Components • Giant release affects many components • Hard to say:
 "it's safe for all components" • Many small releases • Easy to say:
 "it's safe for THIS component!" Minimize Affected Customers • Many customers do various things on our platform • Release MAY affect their workloads • Query compatibility • Query performance • Invalid data handling • Data ingestion request patterns • "A customer says something goes wrong" • "What did happen in this 1 week?" • Make things clear for support operations
  • 18. © 2019 Arm Limited18 Revisiting: What's DevOps? • Collaboration between Devs and Ops • "Development" and "System Operations" • Is It Only for System Operations? • "Operation" means not only system operations in many cases
 (For example, Chief "Operation" Officer) • We need to support many type of operations: • Support Operations • Sales Operations • Audit Operations
  • 19. © 2019 Arm Limited19 Audit • Standards Compliance • ISO/IEC 27001:2013 • SOC-2 (Service Organization Controls) type 2 https://www.treasuredata.co.jp/security/
  • 20. © 2019 Arm Limited20 Executing Infra Operations in Explicit Way Terraform Enterprise • AWS infra
 by code • Run code on Terraform Enterprise workspaces with history
  • 21. © 2019 Arm Limited21 Executing Release Operations in Explicit Way Slack, Jira, Github and Automation • Trying release automation • Communication Central - Slack • History Central - GitHub • Release Central - Jira
  • 22. © 2019 Arm Limited22 Own Your Service Only, Use Cloud Services Your service is heavy enough Own Your Service Only • Many things to do about your service • Design • Development • Configuration • Testing • Deployment • Release • Monitoring • Alert • Operation • On-call Use Cloud Services • No additional spaces
 to own additional things • Cloud services are to help your work • Be Agile, Move Faster!