SlideShare a Scribd company logo
Onyx - Data
Processing
The Clojure Way
By Bahadir Cambel
@bahadircambel
Raise your hands if you have used
- Cascalog
- Hadoop
- Spark
- Flink
- Samza
- Storm
- Sqoop
What’s good for ?
Realtime event stream processing
Continuous computation
Extract, transform, load (ETL)
Data transformation à la map-reduce
Data ingestion and storage medium transfer
Data cleaning
Data Structures + Simple Functions
- Sounds familiar ?
Hadoop
Onyx   data processing  the clojure way
Cascading
https://github.com/Cascading/cascading.samples/blob
/master/wordcount/src/java/wordcount/Main.java
Onyx   data processing  the clojure way
Storm
https://github.com/nathanmarz/storm-
starter/blob/master/src/clj/storm/starte
r/clj/word_count.clj
Onyx   data processing  the clojure way
Spark
http://spark.apache.org/examples.html
SCALA
JAVA dev having a productive
day
Onyx   data processing  the clojure way
Cascalog
http://cascalog.org/articles/getting_st
arted.html
?-
<-
:>
Onyx   data processing  the clojure way
Internet Scale MongoDB could save youu?!
Onyx   data processing  the clojure way
Onyx
Workflow
- articulate the paths that data
flows through the cluster at
runtime
- DAG
Catalog
- Describe and configure
workflow items
Function(s)
Flow Conditions
From -> To ( if predicate correct)
Flow conditions are used for isolating
logic about whether or not segments
should pass through different tasks in
a workflow, exception handling and
support a rich degree of composition
with runtime parameterization.
Windows / Triggers
partitions a possible unbounded
sequence of data into finite
pieces, allowing aggregations to
be specified
- Timer
- Segment
- Punctuation
- Watermark
Life Cycles
allows you to hook in and execute
arbitrary code at critical points
during a task (kinda middleware)
:lifecycle/start-task?
:lifecycle/before-task-start
:lifecycle/before-batch
:lifecycle/after-read-batch
:lifecycle/after-batch
:lifecycle/after-task-stop
:lifecycle/after-ack-segment
:lifecycle/after-retry-segment
Job
A job will be translated into multiple
tasks. Peers will take care of these
tasks.
If your number of tasks > available peers
A job won’t be complete ( Buy me a beer
or 10)
Bulk functions
perform a fn more efficiently over a
batch of segments rather than
processing one segment at a time.
- Write to DB
Onyx will ignore the output of your function and
pass the same segments that you received
downstream
Group by
“like” values are always
routed to the same virtual
peer
- Group by key
- Group by a fn
Specify in the catalog!
Fixed Windows
a data point will fall into
exactly one instance of a
window (often called an
extent in the literature)
Between t1=0 and t2=4 how many
events have happened?
t1=5 t2=9, t1=10 t2=14
And so on..
Sliding Window
a slide value for how long to wait
between spawning a new window
extent
Between t1=0 and t2=14 how many events have
happened?
t1=5 t2=19 ?
t1=10 t2=24 ?
Global Window
Session Window
dynamically resize their upper and
lower bounds in reaction to
incoming data
Sessions capture a time span of activity for a
specific key, such as a user ID. If no activity
occurs within a timeout gap, the session
closes. If an event occurs within the bounds of
a session, the window size is fused with the
new event, and the session is extended by its
timeout gap either in the forward or backward
direction
Aggregation
:onyx.windowing.aggregation/conj
:onyx.windowing.aggregation/count
:onyx.windowing.aggregation/sum
:onyx.windowing.aggregation/min
:onyx.windowing.aggregation/max
:onyx.windowing.aggregation/average
Architecture
Peer
is a node in the cluster responsible for processing data
Virtual Peer
A Virtual Peer refers to a single peer process running on a single physical
machine. A single Virtual Peer executes at most one task at a time.
ZooKeeper
Watches Peers
Aeron
Efficient reliable UDP unicast, UDP
multicast, and IPC message transport
Messaging layer takes care of the direct
peer to peer transfer of segment
batches, acks, segment completion and
segment retries to the relevant virtual
peers.
The Log
Scheduling
If there is no master, how does
scheduling work ?
Peers contend to work on tasks.
Types of Job Schedulers
- Greedy ( I need ALL!!!! Gimme all!!)
- Balanced Robin ( Fair play)
- Percentage ( Not so fair play)
Types of Task Schedulers
- Balanced
- Percentage
- Colocation (assigns them to the peers on a single physical machine, low latency, min network)
Tags
a set of machines in your cluster are
privileged
Run some tasks at some specific
machines
Declare a peer with capabilities
- Datomic
- Special Hardware (GPU, Memory)
- Network
Onyx Plugins
onyx-core-async
onyx-kafka
onyx-datomic
onyx-redis
onyx-sql
onyx-bookkeeper
onyx-seq
onyx-durable-queue
onyx-elasticsearch
Example - Let’s process some logs
49556677821280438558577372995495836672945903576549425154
Check out the repository
https://github.com/bcambel/onyx-
test
End users configuring what
- workflows should look like.
- Language agnostic
- Location agnostic
- Tolerant to machine generation
- Temporally agnostic ( should wait for a time to be realized)
If you are not enjoying your experience
There is something fundamentally wrong with the tool
Think about Apple’s smooth product experience.
Pain detected, thought through. (Pain -> Pleasure)
Tools
- Onyx ETL https://github.com/onyx-platform/onyx-etl
- Dashboard https://github.com/onyx-platform/onyx-dashboard
- Replica Cons https://github.com/onyx-platform/onyx-console-dashboard
- Ansible Playbook https://github.com/onyx-platform/ansible-onyx
- Metrics Suite https://github.com/onyx-platform/onyx-metrics
- Benchmark Suite https://github.com/onyx-platform/onyx-benchmark
- Jepsen https://github.com/onyx-platform/onyx-jepsen
Onyx Dashboard
https://github.com/onyx-
platform/onyx-dashboard
Questions
- How does Onyx distributes reads (input tasks) ? Parallelization??
- evenly break up a database table into chunks which can be read by multiple peers
- Segments realized. Tasks created. Peers get into action
Helpful Links
http://www.onyxplatform.org/
https://gitter.im/onyx-platform/onyx
https://clojurians.slack.com/messages/onyx
https://github.com/onyx-platform/onyx-examples
https://github.com/onyx-platform/learn-onyx
https://github.com/Yuppiechef/cqrs-server (Open source project using Onyx)
Shameless self plug
http://www.bahadir.io/
https://twitter.com/bahadircambel
https://www.strava.com/athletes/8258974
Thanks
Michael Drogalis - https://twitter.com/michaeldrogalis
Lucas Bradstreet - https://twitter.com/ghaz
Thanks for using Clojure!

More Related Content

Onyx data processing the clojure way