SlideShare a Scribd company logo
Masahiro Nakagawa
Feb 7, 2015
dots. Summit 2015
Treasure Data

and OSS
Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc)
> etc…
Company background
•  Founded 2011 in Mountain View, CA!
–  The first cloud service for the entire data
pipeline!
–  Including: Acquisition, Storage, & Analysis!
•  Provide a “Cloud Data Service”!
–  Fast Time to Value!
–  Cloud Flexibility and Economics!
–  Simple and Well Supported!
•  Treasure Data has over 100+ customers
in production!
–  Incl. Fortune 500 companies!
–  400k new records / second!
–  Almost 9 Trillion records loaded!
–  Variety of use cases and verticals!
The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Sada Furuhashi – Software Architect
MessagaPack / Fluentd Author
Notable Investors
Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku
Sierra Ventures - Tim Guleri
Leading venture capital firm in Big Data
TD Service Architecture
Time to Value
Send query result 
Result Push
Acquire
 Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc /

Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights 
Tableau, 
Motion Board etc. 
POS
REST API
ODBC / JDBC
SQL, Pig 
Bulk Uploader
Embulk,

TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported
Data Acquisition
Log collecting in TD
> Treasure Agent
> Fluentd based log collector
> Embulk
> JavaScript SDK
> Mobile SDK (iOS, Android, Unity)
Structured logging	

!
Reliable forwarding	

!
Pluggable architecture
http://fluentd.org/
Fluentd
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???
Divide & Conquer & Retry
error retry
error retry retry
retry
Batch
Stream
Other stream
Core Plugins
> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism
> read / receive data
> from API, database,

command, etc…
> write / send data
> to API, database, alert,
graph, etc…
Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep
> record_transfomer	

> …
> Forward	

> File tail	

> ...
> Forward	

> File	

> ...
Output
> File	

> Memory
not pluggable
FormatterParser
Before (M x N)
After (M + N)
or Embulk
Other Fluentd related OSS
> Treasure Agent
> https://github.com/treasure-data/omnibus-td-agent
> Fluentd Forwarder
> https://github.com/fluent/fluentd-forwarder
> Simple forwarder for Windows / Leaf node
> Fluentd UI
> https://github.com/fluent/fluentd-ui
> Management web UI
Other OSS products
> Scribed (C++)
> Developed by Facebook
> No maintained
> Apache Flume (Java)
> Mainly for Hadoop HDFS / HBase
> Logstash (JRuby)
> Mainly for Elasticsearch
Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages (TBD)
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behaviour
✓ Idempotent retrying
Plugins Plugins
bulk load
Computing Framework
3 query engines in TD
> Hive (HiveQL, Batch)
> for ETL and large jobs
> Hivemall for machine learning
> Pig (Pig Latin, Batch)
> DataFu for data mining and statistics
> Presto (SQL, Short batch)
> for Ad hoc queries
Hadoop
> Distributed computing framework
> Consist of many components…













http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
http://nosqlessentials.com/
http://nosqlessentials.com/
> Low level framework for YARN applications
> New Query Engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining







Apache Tez
ProcessorInput Output
Task DAG
http://tez.apache.org/
Hive on MR vs. Hive on Tez
MapReduce Tez
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
Avoid unnecessary HDFS write!
SELECT g1.x, g2.avg, g2.cnt

FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY
Other OSS products
> Apache Spark
> Mainly for on-memory processing
> Spark ecosystem is now growing
> Apache Flink
> Mainly for iterative processing
> Microsoft’s Dryad
> This was premature for human being…
Presto
A distributed SQL query engine

for interactive data analisys

against GBs to PBs of data.
Presto overview
> Open sourced by Facebook
> http://prestodb.io/
> written in Java
> Built-in useful features
> Connectors
> Machine Learning
> Window function
> Approximate query
> etc…
> Used by Netflix, Dropbox, Treasure Data,
Qubole, Airbnb, LINE, GREE, Scaleout, etc
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable 
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ More work to manage

2 platforms
✓ Can’t query against

“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
All stages are pipe-
lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data

to disk
Wait between

stages
Other OSS products
> Cloudera Impala
> Mainly for HDFS / HBase
> Apache Drill
> More flexible architecture
> Apache Tajo
> For building data warehouse
Visualization
Hmm…
> There are no popular OSS products
> We don’t focus on developing
visualization tool for now
> Commercial BI tools are popular
> Tableau, Motion board and etc
> Maybe, next presentation talk about

this area deeply
Treasure Data resources
> https://github.com/treasure-data
> perfectqueue, perfectsched, etc
> https://sql.treasuredata.com/
> HiveQL syntax checker
> https://examples.treasuredata.com/
> Query catalog
http://blog.treasuredata.com/2014/11/26/12-open-source-
software-innovations-from-treasure-data-engineers/
Check: treasuredata.com
Cloud service for the entire data pipeline

More Related Content

Treasure Data and OSS

  • 1. Masahiro Nakagawa Feb 7, 2015 dots. Summit 2015 Treasure Data
 and OSS
  • 2. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc) > etc…
  • 3. Company background •  Founded 2011 in Mountain View, CA! –  The first cloud service for the entire data pipeline! –  Including: Acquisition, Storage, & Analysis! •  Provide a “Cloud Data Service”! –  Fast Time to Value! –  Cloud Flexibility and Economics! –  Simple and Well Supported! •  Treasure Data has over 100+ customers in production! –  Incl. Fortune 500 companies! –  400k new records / second! –  Almost 9 Trillion records loaded! –  Variety of use cases and verticals! The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran Kaz Ohta – CTO Founder of world’s largest Hadoop Group Sada Furuhashi – Software Architect MessagaPack / Fluentd Author Notable Investors Othman Laraki Ex-VP of Growth at Twitter Jerry Yang Founder of Yahoo! Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language James Lindenbaum Founder of Heroku Sierra Ventures - Tim Guleri Leading venture capital firm in Big Data
  • 4. TD Service Architecture Time to Value Send query result Result Push Acquire Analyze Store Plazma DB Flexible, Scalable, Columnar Storage Web Log App Log Censor CRM ERP RDBMS Treasure Agent(Server) SDK(JS, Android, iOS, Unity) Streaming Collector Batch / Reliability Ad-hoc /
 Low latency KPI$ KPI Dashboard BI Tools Other Products RDBMS, Google Docs, AWS S3, FTP Server, etc. Metric Insights Tableau, Motion Board etc. POS REST API ODBC / JDBC SQL, Pig Bulk Uploader Embulk,
 TD Toolbelt SQL-based query @AWS or @IDCF Connectivity Economy & Flexibility Simple & Supported
  • 6. Log collecting in TD > Treasure Agent > Fluentd based log collector > Embulk > JavaScript SDK > Mobile SDK (iOS, Android, Unity)
  • 7. Structured logging ! Reliable forwarding ! Pluggable architecture http://fluentd.org/
  • 8. Fluentd > Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby > Gem based various plugins > http://www.fluentd.org/plugins > Working in production > http://www.fluentd.org/testimonials
  • 9. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  • 10. Data Analytics Flow Store Process Cloudera Horton Works Treasure Data Collect Visualize Tableau Excel R easier & shorter time ???
  • 11. Divide & Conquer & Retry error retry error retry retry retry Batch Stream Other stream
  • 12. Core Plugins > Divide & Conquer
 > Buffering & Retrying
 > Error handling
 > Message routing
 > Parallelism > read / receive data > from API, database,
 command, etc… > write / send data > to API, database, alert, graph, etc…
  • 13. Architecture (v0.12 or later) EngineInput Filter Output Buffer > grep > record_transfomer > … > Forward > File tail > ... > Forward > File > ... Output > File > Memory not pluggable FormatterParser
  • 15. After (M + N) or Embulk
  • 16. Other Fluentd related OSS > Treasure Agent > https://github.com/treasure-data/omnibus-td-agent > Fluentd Forwarder > https://github.com/fluent/fluentd-forwarder > Simple forwarder for Windows / Leaf node > Fluentd UI > https://github.com/fluent/fluentd-ui > Management web UI
  • 17. Other OSS products > Scribed (C++) > Developed by Facebook > No maintained > Apache Flume (Java) > Mainly for Hadoop HDFS / HBase > Logstash (JRuby) > Mainly for Elasticsearch
  • 18. Embulk > Bulk Loader version of Fluentd > Pluggable architecture > JRuby, JVM languages (TBD) > High performance parallel processing > Share your script as a plugin > https://github.com/embulk http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
  • 19. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying Plugins Plugins bulk load
  • 21. 3 query engines in TD > Hive (HiveQL, Batch) > for ETL and large jobs > Hivemall for machine learning > Pig (Pig Latin, Batch) > DataFu for data mining and statistics > Presto (SQL, Short batch) > for Ad hoc queries
  • 22. Hadoop > Distributed computing framework > Consist of many components…
 
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  • 25. > Low level framework for YARN applications > New Query Engine > Provide good IR for Hive, Pig and more > Task and DAG based pipelining
 
 
 
 Apache Tez ProcessorInput Output Task DAG http://tez.apache.org/
  • 26. Hive on MR vs. Hive on Tez MapReduce Tez http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 M HDFS R R M M HDFS HDFS R M M R M M R M R M MM M M R R R Avoid unnecessary HDFS write! SELECT g1.x, g2.avg, g2.cnt
 FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" ON (g1.x = g2.x) ORDER BY avg; GROUP b BY b.xGROUP a BY a.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x" JOIN (a, b) ORDER BY
  • 27. Other OSS products > Apache Spark > Mainly for on-memory processing > Spark ecosystem is now growing > Apache Flink > Mainly for iterative processing > Microsoft’s Dryad > This was premature for human being…
  • 28. Presto A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  • 29. Presto overview > Open sourced by Facebook > http://prestodb.io/ > written in Java > Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc… > Used by Netflix, Dropbox, Treasure Data, Qubole, Airbnb, LINE, GREE, Scaleout, etc
  • 30. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 31. HDFS Hive Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ More work to manage
 2 platforms ✓ Can’t query against
 “live” data directly Batch analysis platform Visualization platform PostgreSQL, etc.
  • 32. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 33. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform
  • 34. All stages are pipe- lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data
 to disk Wait between
 stages
  • 35. Other OSS products > Cloudera Impala > Mainly for HDFS / HBase > Apache Drill > More flexible architecture > Apache Tajo > For building data warehouse
  • 37. Hmm… > There are no popular OSS products > We don’t focus on developing visualization tool for now > Commercial BI tools are popular > Tableau, Motion board and etc > Maybe, next presentation talk about
 this area deeply
  • 38. Treasure Data resources > https://github.com/treasure-data > perfectqueue, perfectsched, etc > https://sql.treasuredata.com/ > HiveQL syntax checker > https://examples.treasuredata.com/ > Query catalog http://blog.treasuredata.com/2014/11/26/12-open-source- software-innovations-from-treasure-data-engineers/
  • 39. Check: treasuredata.com Cloud service for the entire data pipeline