SlideShare a Scribd company logo
Treasure Data, Inc.
Founder & Software Architect
Sadayuki Furuhashi
Embulk Internals
Execution overview
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads

(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin

(parser plugin if input is file-based)
Output plugin

(formatter plugin if output is file-based)

More Related Content

Embuk internals