Treasure Data and OSS

Masahiro Nakagawa
Feb 7, 2015
dots. Summit 2015
Treasure Data 
and OSS

Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc)
> etc…

Company background
•  Founded 2011 in Mountain View, CA!
–  The ﬁrst cloud service for the entire data
pipeline!
–  Including: Acquisition, Storage, & Analysis!
•  Provide a “Cloud Data Service”!
–  Fast Time to Value!
–  Cloud Flexibility and Economics!
–  Simple and Well Supported!
•  Treasure Data has over 100+ customers
in production!
–  Incl. Fortune 500 companies!
–  400k new records / second!
–  Almost 9 Trillion records loaded!
–  Variety of use cases and verticals!
The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Sada Furuhashi – Software Architect
MessagaPack / Fluentd Author
Notable Investors
Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku
Sierra Ventures - Tim Guleri
Leading venture capital ﬁrm in Big Data

TD Service Architecture
Time to Value
Send query result
Result Push
Acquire
Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc / 
Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights
Tableau,
Motion Board etc.
POS
REST API
ODBC / JDBC
SQL, Pig
Bulk Uploader
Embulk, 
TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported

Log collecting in TD
> Treasure Agent
> Fluentd based log collector
> Embulk
> JavaScript SDK
> Mobile SDK (iOS, Android, Unity)

Structured logging

!
Reliable forwarding

!
Pluggable architecture
http://ﬂuentd.org/

Fluentd
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials

Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring

Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???

Divide & Conquer & Retry
error retry
error retry retry
retry
Batch
Stream
Other stream

Core Plugins
> Divide & Conquer 
> Buffering & Retrying 
> Error handling 
> Message routing 
> Parallelism
> read / receive data
> from API, database, 
command, etc…
> write / send data
> to API, database, alert,
graph, etc…

Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep
> record_transfomer

> …
> Forward

> File tail

> ...
> Forward

> File

> ...
Output
> File

> Memory
not pluggable
FormatterParser

Other Fluentd related OSS
> Treasure Agent
> https://github.com/treasure-data/omnibus-td-agent
> Fluentd Forwarder
> https://github.com/fluent/fluentd-forwarder
> Simple forwarder for Windows / Leaf node
> Fluentd UI
> https://github.com/fluent/fluentd-ui
> Management web UI

Other OSS products
> Scribed (C++)
> Developed by Facebook
> No maintained
> Apache Flume (Java)
> Mainly for Hadoop HDFS / HBase
> Logstash (JRuby)
> Mainly for Elasticsearch

Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages (TBD)
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behaviour
✓ Idempotent retrying
Plugins Plugins
bulk load

3 query engines in TD
> Hive (HiveQL, Batch)
> for ETL and large jobs
> Hivemall for machine learning
> Pig (Pig Latin, Batch)
> DataFu for data mining and statistics
> Presto (SQL, Short batch)
> for Ad hoc queries

Hadoop
> Distributed computing framework
> Consist of many components… 
 
 
 
 
 
 
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

> Low level framework for YARN applications
> New Query Engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining 
 
 
 
Apache Tez
ProcessorInput Output
Task DAG
http://tez.apache.org/

Hive on MR vs. Hive on Tez
MapReduce Tez
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
Avoid unnecessary HDFS write!
SELECT g1.x, g2.avg, g2.cnt 
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY

Other OSS products
> Apache Spark
> Mainly for on-memory processing
> Spark ecosystem is now growing
> Apache Flink
> Mainly for iterative processing
> Microsoft’s Dryad
> This was premature for human being…

Presto
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.

Presto overview
> Open sourced by Facebook
> http://prestodb.io/
> written in Java
> Built-in useful features
> Connectors
> Machine Learning
> Window function
> Approximate query
> etc…
> Used by Netﬂix, Dropbox, Treasure Data,
Qubole, Airbnb, LINE, GREE, Scaleout, etc

HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial 
BI Tools
Batch analysis platform Visualization platform
Dashboard

HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial 
BI Tools
Dashboard
✓ More work to manage 
2 platforms
✓ Can’t query against 
“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.

HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query

Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial 
BI Tools
✓ IBM Cognos 
✓ Tableau 
✓ ...
Data analysis platform

All stages are pipe-
lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
ﬁt in memory
task
disk
map map
reduce reduce
disk
disk
Write data 
to disk
Wait between 
stages

Other OSS products
> Cloudera Impala
> Mainly for HDFS / HBase
> Apache Drill
> More ﬂexible architecture
> Apache Tajo
> For building data warehouse

Hmm…
> There are no popular OSS products
> We don’t focus on developing
visualization tool for now
> Commercial BI tools are popular
> Tableau, Motion board and etc
> Maybe, next presentation talk about 
this area deeply

Treasure Data resources
> https://github.com/treasure-data
> perfectqueue, perfectsched, etc
> https://sql.treasuredata.com/
> HiveQL syntax checker
> https://examples.treasuredata.com/
> Query catalog
http://blog.treasuredata.com/2014/11/26/12-open-source-
software-innovations-from-treasure-data-engineers/

Check: treasuredata.com
Cloud service for the entire data pipeline

Treasure Data and OSS

Related slideshows

More Related Content

Treasure Data and OSS