Treasure Data and OSS
- 2. Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc)
> etc…
- 3. Company background
• Founded 2011 in Mountain View, CA!
– The first cloud service for the entire data
pipeline!
– Including: Acquisition, Storage, & Analysis!
• Provide a “Cloud Data Service”!
– Fast Time to Value!
– Cloud Flexibility and Economics!
– Simple and Well Supported!
• Treasure Data has over 100+ customers
in production!
– Incl. Fortune 500 companies!
– 400k new records / second!
– Almost 9 Trillion records loaded!
– Variety of use cases and verticals!
The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Sada Furuhashi – Software Architect
MessagaPack / Fluentd Author
Notable Investors
Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku
Sierra Ventures - Tim Guleri
Leading venture capital firm in Big Data
- 4. TD Service Architecture
Time to Value
Send query result
Result Push
Acquire
Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc /
Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights
Tableau,
Motion Board etc.
POS
REST API
ODBC / JDBC
SQL, Pig
Bulk Uploader
Embulk,
TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported
- 6. Log collecting in TD
> Treasure Agent
> Fluentd based log collector
> Embulk
> JavaScript SDK
> Mobile SDK (iOS, Android, Unity)
- 8. Fluentd
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials
- 10. Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???
- 11. Divide & Conquer & Retry
error retry
error retry retry
retry
Batch
Stream
Other stream
- 12. Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> read / receive data
> from API, database,
command, etc…
> write / send data
> to API, database, alert,
graph, etc…
- 13. Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep
> record_transfomer
> …
> Forward
> File tail
> ...
> Forward
> File
> ...
Output
> File
> Memory
not pluggable
FormatterParser
- 16. Other Fluentd related OSS
> Treasure Agent
> https://github.com/treasure-data/omnibus-td-agent
> Fluentd Forwarder
> https://github.com/fluent/fluentd-forwarder
> Simple forwarder for Windows / Leaf node
> Fluentd UI
> https://github.com/fluent/fluentd-ui
> Management web UI
- 17. Other OSS products
> Scribed (C++)
> Developed by Facebook
> No maintained
> Apache Flume (Java)
> Mainly for Hadoop HDFS / HBase
> Logstash (JRuby)
> Mainly for Elasticsearch
- 18. Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages (TBD)
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
- 21. 3 query engines in TD
> Hive (HiveQL, Batch)
> for ETL and large jobs
> Hivemall for machine learning
> Pig (Pig Latin, Batch)
> DataFu for data mining and statistics
> Presto (SQL, Short batch)
> for Ad hoc queries
- 22. Hadoop
> Distributed computing framework
> Consist of many components…
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
- 25. > Low level framework for YARN applications
> New Query Engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining
Apache Tez
ProcessorInput Output
Task DAG
http://tez.apache.org/
- 26. Hive on MR vs. Hive on Tez
MapReduce Tez
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
Avoid unnecessary HDFS write!
SELECT g1.x, g2.avg, g2.cnt
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY
- 27. Other OSS products
> Apache Spark
> Mainly for on-memory processing
> Spark ecosystem is now growing
> Apache Flink
> Mainly for iterative processing
> Microsoft’s Dryad
> This was premature for human being…
- 29. Presto overview
> Open sourced by Facebook
> http://prestodb.io/
> written in Java
> Built-in useful features
> Connectors
> Machine Learning
> Window function
> Approximate query
> etc…
> Used by Netflix, Dropbox, Treasure Data,
Qubole, Airbnb, LINE, GREE, Scaleout, etc
- 31. HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ More work to manage
2 platforms
✓ Can’t query against
“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
- 34. All stages are pipe-
lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
- 35. Other OSS products
> Cloudera Impala
> Mainly for HDFS / HBase
> Apache Drill
> More flexible architecture
> Apache Tajo
> For building data warehouse
- 37. Hmm…
> There are no popular OSS products
> We don’t focus on developing
visualization tool for now
> Commercial BI tools are popular
> Tableau, Motion board and etc
> Maybe, next presentation talk about
this area deeply
- 38. Treasure Data resources
> https://github.com/treasure-data
> perfectqueue, perfectsched, etc
> https://sql.treasuredata.com/
> HiveQL syntax checker
> https://examples.treasuredata.com/
> Query catalog
http://blog.treasuredata.com/2014/11/26/12-open-source-
software-innovations-from-treasure-data-engineers/