To Have Own Data Analytics Platform, Or NOT To

To Have Own Data
Analytics Platform,
Or NOT To
青山エンジニア勉強交流会 April 24, 2017
Satoshi Tagomori (@tagomoris)

Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.

http://tsuchinoko.dmmlabs.com/?p=1770

At Feb 23, 2015
• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services:
• AWS EMR, Redshift
• Google BigQuery
• Treasure Data

Options In 2017
• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services
• AWS EMR, Redshift, Athena, Kinesis Analytics, ...
• Google BigQuery, Cloud Dataﬂow, Cloud
Dataproc, ...
• MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ...
• Treasure Data

NO FINE CONCLUSION
IN THIS PRESENTATION

On Premise Platform In Past
• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster
• w/ Norikra stream processing
• w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan

To Be Considered
• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data

Distributed Processing Platform
• Hadoop, Presto, Spark, Flink, Storm, ...
• + Servers
• EMR, Redshift, Dataproc, ...
• Cost per instances
• BigQuery, Athena, Treasure Data, ....
• Cost per data/queries/...

Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?

Process Management
• How to run queries on schedule?
• How to build workﬂow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere
after queries?

Platform Management
• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?

Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?

Connecting Data
• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/
services?
• How to send notiﬁcations using query results?

Additional Topics
• Stream Processing Platform
• Machine Learning Platform
• AI(?) Services

In My Past Case:
• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management
• Hive schema & Custom made UI (Shib)
• Managed by engineers of each services
• Process Management
• Custom made query scheduler (ShibUI)
• Platform Management
• By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A

About Treasure Data
• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workﬂow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝

Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters
• Online Upgrade
• Support many processing frameworks
• Many new data processing software/frameworks
• Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available
• Stream processing, Machine learning, ...

MONEY
• Saving money is important - it's true.

MONEY
• Saving money introduces many issues - it's true!

MONEY
• Money solves many problems - is it true?

Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people

Chasing the World
• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!

Finding Treasure From Data
• "Data Processing" is:
• NOT the purpose
• just a tool to get something great
• Use developers and their time to ﬁnd treasures!

To Have Own Data Analytics Platform, Or NOT To

Related slideshows

More Related Content

To Have Own Data Analytics Platform, Or NOT To