SlideShare a Scribd company logo
To Have Own Data
Analytics Platform,
Or NOT To
青山エンジニア勉強交流会 April 24, 2017
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.
To Have Own Data Analytics Platform, Or NOT To
http://tsuchinoko.dmmlabs.com/?p=1770
At Feb 23, 2015
• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services:
• AWS EMR, Redshift
• Google BigQuery
• Treasure Data
Options In 2017
• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services
• AWS EMR, Redshift, Athena, Kinesis Analytics, ...
• Google BigQuery, Cloud Dataflow, Cloud
Dataproc, ...
• MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ...
• Treasure Data
TO HAVE
OR
NOT TO HAVE
?
DO NOT
😝
Anyway,
NO FINE CONCLUSION
IN THIS PRESENTATION
On Premise Platform In Past
• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster
• w/ Norikra stream processing
• w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
To Be Considered
• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data
Distributed Processing Platform
• Hadoop, Presto, Spark, Flink, Storm, ...
• + Servers
• EMR, Redshift, Dataproc, ...
• Cost per instances
• BigQuery, Athena, Treasure Data, ....
• Cost per data/queries/...
Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?
Process Management
• How to run queries on schedule?
• How to build workflow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere
after queries?
Platform Management
• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?
Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?
Connecting Data
• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/
services?
• How to send notifications using query results?
Additional Topics
• Stream Processing Platform
• Machine Learning Platform
• AI(?) Services
In My Past Case:
• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management
• Hive schema & Custom made UI (Shib)
• Managed by engineers of each services
• Process Management
• Custom made query scheduler (ShibUI)
• Platform Management
• By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A
About Treasure Data
• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workflow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝
Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters
• Online Upgrade
• Support many processing frameworks
• Many new data processing software/frameworks
• Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available
• Stream processing, Machine learning, ...
MONEY
• Saving money is important - it's true.
MONEY
• Saving money introduces many issues - it's true!
MONEY
• Money solves many problems - is it true?
Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people
Chasing the World
• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!
Finding Treasure From Data
• "Data Processing" is:
• NOT the purpose
• just a tool to get something great
• Use developers and their time to find treasures!
TBD
Thank you!
@tagomoris

More Related Content

To Have Own Data Analytics Platform, Or NOT To

  • 1. To Have Own Data Analytics Platform, Or NOT To 青山エンジニア勉強交流会 April 24, 2017 Satoshi Tagomori (@tagomoris)
  • 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, ... Treasure Data, Inc.
  • 5. At Feb 23, 2015 • To Have Own Data Analytics Platform, Or NOT To, In Startup Companies: • "NOT To, in general" • Data analytics services: • AWS EMR, Redshift • Google BigQuery • Treasure Data
  • 6. Options In 2017 • On Premise • Cloudera CDH, Hortonworks HDP, ... • Services • AWS EMR, Redshift, Athena, Kinesis Analytics, ... • Google BigQuery, Cloud Dataflow, Cloud Dataproc, ... • MS Azure SQL Data Warehouse, Stream Analytics, Data Lake Analytics, ... • Treasure Data
  • 11. NO FINE CONCLUSION IN THIS PRESENTATION
  • 12. On Premise Platform In Past • 2011-2014: On-premise Hadoop&Presto cluster • w/ Fluentd stream processing cluster • w/ Norikra stream processing • w/ Web UI (Shib) https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
  • 13. To Be Considered • Distributed Processing Platform • Data Management • Process Management • Platform Management • Visualization and BI • Connecting Data
  • 14. Distributed Processing Platform • Hadoop, Presto, Spark, Flink, Storm, ... • + Servers • EMR, Redshift, Dataproc, ... • Cost per instances • BigQuery, Athena, Treasure Data, .... • Cost per data/queries/...
  • 15. Data Management • How to collect data? • How to ingest data? • How to manage schema? • How to move data from here to there?
  • 16. Process Management • How to run queries on schedule? • How to build workflow between queries? • How to run queries after data ingestion? • How to move data from the platform to elsewhere after queries?
  • 17. Platform Management • How to upgrade software? • How to add nodes? • How to manage failures / downtime? • How to replace hardware? • How to switch platforms? • How to provide compatibility for queries?
  • 18. Visualization and BI • How to show query results graphically? • How to show relations between data graphically? • How to query data interactively?
  • 19. Connecting Data • How to join logs and master data? • How to join logs and user list? • How to join logs and CRM data? • How to push query results to marketing tools/ services? • How to send notifications using query results?
  • 20. Additional Topics • Stream Processing Platform • Machine Learning Platform • AI(?) Services
  • 21. In My Past Case: • Distributed Processing Platform • Hadoop & Presto (& Norikra) • Data Management • Hive schema & Custom made UI (Shib) • Managed by engineers of each services • Process Management • Custom made query scheduler (ShibUI) • Platform Management • By tagomoris • Visualization, BI: N/A • Connecting Data: N/A
  • 22. About Treasure Data • Distributed Processing Platform: Hive, Presto • Data Management: Fluentd & Schema-less DB • Process Management: Digdag / Treasure Workflow • Platform Management: Automatic • Visualization and BI: Treasure BI • Connecting Data: Embulk / Data Connector 😝
  • 23. Recent Improvements around Data Analytics • Improvements of CDH/HDP to manage clusters • Online Upgrade • Support many processing frameworks • Many new data processing software/frameworks • Apache Flink, Apache Arrow, Apache Beam, ... • Many new services available • Stream processing, Machine learning, ...
  • 24. MONEY • Saving money is important - it's true.
  • 25. MONEY • Saving money introduces many issues - it's true!
  • 26. MONEY • Money solves many problems - is it true?
  • 27. Complexity • Connecting data / processing with applications • Connecting data / processing with services • Connecting data / processing with people
  • 28. Chasing the World • Many new software / services / platform / paradigm, day by day • Data sizes are growing day by day • Complexity is growing day by day • A data platform CANNOT live as-is 5 years!
  • 29. Finding Treasure From Data • "Data Processing" is: • NOT the purpose • just a tool to get something great • Use developers and their time to find treasures!