To Have Own Data Analytics Platform, Or NOT To
- 1. To Have Own Data
Analytics Platform,
Or NOT To
青山エンジニア勉強交流会 April 24, 2017
Satoshi Tagomori (@tagomoris)
- 5. At Feb 23, 2015
• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services:
• AWS EMR, Redshift
• Google BigQuery
• Treasure Data
- 6. Options In 2017
• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services
• AWS EMR, Redshift, Athena, Kinesis Analytics, ...
• Google BigQuery, Cloud Dataflow, Cloud
Dataproc, ...
• MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ...
• Treasure Data
- 12. On Premise Platform In Past
• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster
• w/ Norikra stream processing
• w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
- 13. To Be Considered
• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data
- 14. Distributed Processing Platform
• Hadoop, Presto, Spark, Flink, Storm, ...
• + Servers
• EMR, Redshift, Dataproc, ...
• Cost per instances
• BigQuery, Athena, Treasure Data, ....
• Cost per data/queries/...
- 15. Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?
- 16. Process Management
• How to run queries on schedule?
• How to build workflow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere
after queries?
- 17. Platform Management
• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?
- 18. Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?
- 19. Connecting Data
• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/
services?
• How to send notifications using query results?
- 21. In My Past Case:
• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management
• Hive schema & Custom made UI (Shib)
• Managed by engineers of each services
• Process Management
• Custom made query scheduler (ShibUI)
• Platform Management
• By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A
- 22. About Treasure Data
• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workflow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝
- 23. Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters
• Online Upgrade
• Support many processing frameworks
• Many new data processing software/frameworks
• Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available
• Stream processing, Machine learning, ...
- 27. Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people
- 28. Chasing the World
• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!
- 29. Finding Treasure From Data
• "Data Processing" is:
• NOT the purpose
• just a tool to get something great
• Use developers and their time to find treasures!