Fighting Against Chaotically Separated Values with Embulk
- 2. A little about me…
Sadayuki Furuhashi
An open-source hacker.
github: @frsyuki
A founder of Treasure Data, Inc. located in Silicon Valley.
Fluentd - Unifid log collection infrastracture Embulk - Plugin-based ETL tool
OSS projects I founded:
- 4. What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data loading easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,
transactions (idempotency),
performance, …
- 5. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. Run a script → fails!
> 2. Improve the script to normalize records
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many more normalization…
> 3. Second attempt → another error!
• Convert “Inf” → “Infinity”
> 4. Improve the script, fix, retry, fix, retry…
> 5. Oh, some data are loaded twice!?
- 6. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked well today.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
- 7. The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of efforts for each formats & storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
- 8. The problems:
> Difficult to parse files correctly
> How is the CSV file formatted?
> Complex error handling
> How to detect and remove broken records robustly?
> Transactional load, or idempotent retrying
> How to retry without duplicated loading?
> Hard to optimize performance
> How to parallelize the bulk data loading?
> Many formats & storage in the world
> How to save my time?
- 9. The problems at Treasure Data
What’s “Treasure Data”?
> “Fast, powerful SQL access to big data from connected
applications and products, with no new infrastructure or
special skills required.”
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data. Hard work :(
> Customers want to migrate their big data, but
> It’s hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
- 10. Embulk is an open-source, plugin-based
parallel bulk data loader
that makes data loading easy and fast.
Solution:
IMPORTANT!
- 11. Amazon S3
MySQL
FTP
CSV Files
Access Logs
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Reliable
framework :-)
Parallel execution,
transaction, auto guess,
…and many by plugins.
- 13. $ embulk selfupdate
$ embulk example demo
$ vi demo/csv/sample_01.csv.gz
$ embulk guess demo/seed.yml -o config.yml
$ embulk run config.yml
$ vi config.yml
$ embulk run config.yml
out:
type: postgresql
host: localhost
user: pg
password: ''
database: embulk_demo
table: sample1
mode: replace
:%s/,/t/g
:%s/""/"/g
# Created by Sada
# This is a comment
N
- 17. Examples of Plugins (input)
File Input
Amazon S3
Google Cloud Storage
HDFS
Riak CS
SCP
FTP
…
CSV
JSON
MessagePack
Excel
Apache common logs
pcap format
XML / XPath
regexp
grok
…
File ParserInput
PostgreSQL
MySQL
Oracle
Vertica
Redis
Amazon Redshift
Amazon DynamoDB
Salesforce.com
JIRA
Mixpanel
…
- 18. Examples of Plugins (output)
File Output
Amazon S3
Google Cloud Storage
HDFS
SFTP
SCP
FTP
…
CSV
JSON
MessagePack
Excel
…
File FormatterOutput
PostgreSQL
MySQL
Oracle
Vertica
Redis
Amazon Redshift
Elasticsearch
Salesforce.com
Treasure Data
BigQuery
…
- 19. Examples of Plugins (filters)
> Filtering columns out by conditions
> Extracting values from a JSON column to columns (JSON flattening)
> Convert User-Agent strings to browser name, OS name, etc.
> Parse query string (“?k1=v1&k2=v2…”) to columns
> Applying SHA1 hash to a column
…
- 20. Use case 1: Sync PostgreSQL to Elasticsearch
embulk-input-postgresql
embulk-filter-column embulk-output-elasticsearch
PostgreSQL
column
filter
Elasticsearch
encrypt
filter
embulk-filter-encrypt
remove unnecessary
columns
encrypt password
columns
- 21. Use case 2: Load CSV on S3 to Analytics
embulk-parser-csv
embulk-decoder-gzip
embulk-input-s3
csv.gz
on S3
Treasure Data
BigQuery
Redshift
+
+
embulk-output-td
embulk-output-bigquery
embulk-output-redshift
Distributed execution on Hadoop
embulk-executor-mapreduce
- 22. Use case 3: Embulk as a Service at Treasure Data
REST API call
MySQL
- 24. Plugin API
> A plugin is written in Java or Ruby (JRuby).
> A plugin implements “transaction” and “task”.
> transaction controls the entire bulk loading session.
> create a destination table, create a directory,
commit the transaction, etc.
> transaction creates multiple tasks.
> tasks load load data.
> Embulk runs tasks in parallel.
> Embulk retries tasks if necessary.
- 25. Transaction stage & Task stage
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads
(or machines)
- 28. Parallel execution of tasks
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
- 29. Distributed execution of tasks
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
- 30. Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Useful to partition data by hour or day
before loading data to a storage.
- 32. What’s added since the first release?
• v0.3 (Feb, 2015)
• Resuming
• Filter plugin type
• v0.4 (Feb, 2015)
• Plugin template generator
• Incremental load (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher
- 33. What’s added since the first release?
• v0.6 (Apr, 2015)
• Executor plugin type
• Liquid template engine
• v0.7 (Aug, 2015)
• EmbulkEmbed & Embulk::Runner
• Plugin bundle (embulk-mkbundle)
• JRuby 9000
• Gradle v2.6
- 34. What’s added since the first release?
• v0.8 (Jan, 2016)
• JSON column type
• Page scattaring for more parallel execution
- 35. Future plan
• v0.9
• Error plugin type (#27)
• Stats & metrics (#199)
• v0.10
• More Guess (#242, #235)
• Multiple jobs using a single config file (#167)
- 37. Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
- 38. Avoiding Conflicts in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
- 41. Header of embulk.jar
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
- 42. embulk.jar is a shell script
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
argument of “:” command (heredoc).
“:” is a command that does nothing.
#!/bin/sh is optional.
Empty first line means a shell script.
java -jar $0
shell script exits here
(following data is ignored)
- 43. embulk.jar is a bat file
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
.bat exits here
(following lines are ignored)
“:” means a comment-line
- 44. embulk.jar is a jar file
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
jar (zip) format ignores headers
(file entries are in footer)
- 45. Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)