Embuk internals
- 5. Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
- 7. Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters
- 9. Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin
(parser plugin if input is file-based)
Output plugin
(formatter plugin if output is file-based)