Hadoop in Practice (SDN Conference, Dec 2014)
- 1. Hadoop
in
Prac,ce
Introduc*on
into
the
Hadoop
Stack
at
The
New
Mo*on
SDN
Conference,
Arnhem,
Netherlands
Dec
2,
2014
Marcel
Krcah
@mkrcah,
Data
Engineer
Daan
Debie
@daandebie,
Data
Engineer
- 3. • Provides
infrastructure
and
services
for
drivers
of
electric
vehicles
• Largest
in
Europe,
3rd
in
the
world
with
15k
charge-‐points
• Charge-‐points
are
online,
customers
see
their
data
on
web/mobile
app
• Customers
save
20
tons
of
CO2
every
day
• 10k
charge
sessions
per
day
• 100k
messages
per
day
from
charge-‐
points
Dutch startup,
born in 2009
- 4. Technical
problems
before
Hadoop:
• Data
scaRered
over
7
databases
• Many
duplicates
&
inconsistencies
• Complex
queries
were
too
*me-‐consuming
• Batch
analy*cs
running
and
sharing
resources
with
produc*on
DBs
• Databases
out-‐of-‐sync
Our industry was changing
rapidly, systems needed to
adapt very quickly
Yes, we know, this is
generally bad. We are
working on it
Our systems went down
occasionally when a complex
query was run
- 5. Business
problems
before
Hadoop:
• No
unified
and
objec*ve
view
on
data
• Hard
to
get
simple
insights
• Hard
to
quickly
analyze
our
charge-‐session
data
• Data
not
shared
within
the
company
Company ran a lot on gut feeling
instead of data
- 6. Ini*al
requirements
for
data
warehouse
• So[ware
engineers:
– Tool
that
allows
quick
analysis
of
our
session
and
log
data
– Automa*c
no*fica*ons
if
databases
get
out-‐of-‐sync
• Management:
– Easy-‐to-‐use
(BI)
tool
to
explore
clean
data
– Make
informed
decisions
based
on
data
• Data
analysts:
– Pla^orm
The most
important
to
execute
complex
SQL
queries
on
clean
data
• Company:
– Company-‐wide
dashboards
and
emails
with
key
metrics
– Single
source
of
truth
• Hadoop proved very succesful when migrating data to SalesForce.
- 7. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 8. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 9. Apache
Hadoop
Distributed
filesystem
• Consists
of
– distributed
Filesystem remains
fully functional even if
a node(s) fails
fault-‐tolerant
filesystem
(HDFS)
– execu*on
engine
(MapReduce)
• Free,
open-‐source
• Built
on
Java
• Scales
from
one
server
to
thousands
of
nodes
• We
use
Hadoop
– as
a
central
place
to
store
all
our
cleaned
data
– on
a
cluster
of
5
nodes
running
on
Amazon
EC2
- 10. Apache
Hadoop
Example:
Hadoop
commands
Example: Show all files in the root Hadoop directory:
hdfs@hadoop-ec2:~$ hdfs dfs -ls /!
Hadoop node Executes a given
command on a cluster
hdfs@hadoop-ec2:~$ hadoop fs -cp
!/user/hadoop/file1
!/user/hadoop/file2!
within a Unix shell
Standard Unix
filesystem command
prefixed with a dash
Example: Copy one file to another one in Hadoop:
- 11. Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
Hadoop doesn‘t have
to run 24/7.
To save costs, we
spun-up the cluster
at 8am and tore it
down at 10pm.
- 12. Every nigth, we dump production
DBs into S3 as CSV files
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
S3 is very cheap data storage.
It‘s cheaper than running Hadoop
24/7
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 13. Current
Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
Apache
Pig
Data
mangling
- 14. • SQL
Apache
Pig
Data
mangling
on
steroids
– Save
queries
into
variables
– Apply
user-‐defined-‐func*ons
– Load/save
data
from/to
HDFS
or
S3
• Tool
Why „ Pig“ ? Because it
can eat any type of data.
In Python, Java, Ruby
and JavaScript
for
data
mangling,
cleaning
and
complex
queries
of
large
datasets
• Runs
on
MapReduce
(Hadoop
execu*on
engine)
• Currently,
we
use
Pig
for
data
mangling
But we are moving to Apache Spark since
we find Pig slow & hard to debug
- 15. Apache
Pig:
Example
script
Load from CSV
stored on Hadoop
athletes = LOAD 'OlympicAthletes.csv’ USING PigStorage(',’)
AS (athlete:chararray, country:chararray, year:int,
sport:chararray, gold:int, silver:int, bronze:int,
total:int);
There are many custom
built-in functions
athletes_grp_country = GROUP athletes BY country;
athletes_lim = LIMIT athletes 10;
Let‘s register and
use custom function
REGISTER 'olympic_udfs.py' USING streaming_python AS udf;
athlete_score = FOREACH athletes GENERATE athlete,
udf.calculate_score(gold, silver, bronze) as score;
Store result back to
Hadoop as CSV
STORE athlete_score INTO ’/scores' USING PigStorage(‘,');
- 16. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 17. Cloudera
Impala
SQL
engine
for
Hadoop
• SQL
engine
on
top
of
Hadoop
• Supports
SQL
2013
standard
• Fastest
SQL
on
Hadoop1
• Supports
custom
user-‐defined
func*ons
• Built
in
C++
• Developed
by
Cloudera,
but
open-‐sourced
• Runs
a
daemon
on
each
Hadoop
node
• Compliant
with
JDBC/ODBC
interface
1. TPC-‐DS
benchmark
bit.ly/tpc-‐ds-‐impala
Incl. window functions
Also beats some RDBMS
In C++ or Python
- 18. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 19. Tableau
Business
intelligence
• Desktop/cloud
app
to
explore
data
• Suitable
for
management
team,
marke*ng
and
opera*ons
• Supports
Our CEO loves it
many
data
sources
• Connects
to
Impala
via
ODBC
driver
• Alterna*ves:
JasperSo[,
MicroStrategy,
DataMeer,
GoodData
We opted for Tableau due to its intuitive UI,
stable Impala integration & reasonable pricing
- 20. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 21. Hue
UI
for
Hadoop
• Web
UI
on
top
of
all
Hadoop
technologies
– Analyze
data
with
Impala,
Pig,
Spark
and
Hive
– Browse
files
and
perform
opera*ons
on
HDFS
– Schedule
jobs
with
Oozie
– Create
dashboards
(on
top
of
Solr)
– Download
Impala
results
as
CSV/XLS
files
• Suitable
for
data
analysts
and
anybody
who
knows
SQL
• Demo:
hRp://demo.gethue.com
- 22. SQL running on Hadoop
Tables with their schemas Built-in support
for basic charts
HUE
Save & Export queries
- 23. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 24. Hermes
Homegrown
web
app
with
dashboards
• Internal
web
app
with
dashboards
• Any
employee
can
login
with
Google
account
• REST
API
backend
built
in
Python/Flask
and
Impyla
• Frontend
based
on
AngularJS
with
D3/NVD3
lib
• We
are
considering
to
open-‐source
Hermes
We have
no logo
YET
Query results accessible in JSON
format via API to any system
Basically one JavaScript call and one
HTML element to add a new graph
- 25. Example:
Hermes
frontend
with
Angular
and
NVD3
One JavaScript call to load data into AngularJS
$http({method: 'GET', url: '/api/delivered-chargers'}).
success(function(data) {
$scope.data.deliveredChargers = data;
});
<nvd3-multi-bar-chart
data="data.deliveredChargers"
width="700"
height="300”
showLegend="true"
noData="No data available."
interactive="true”
tooltips="true"
xAxisLabel="Month">
</nvd3-multi-bar-chart>
Data with delivered charge-points is
One HTML element stored to AngularJS var $scope.data
to show graph
Data from AngularJS to
be shown
Parameterize graph with
attributes
- 26. • CLI
Hermes
The
command-‐line
tool
to
automate
Hadoop
ac*ons
• Built
on
Python
• Supports:
– Automa*c
provisioning
of
Hadoop
cluster
on
AWS
– Query
Impala
via
command-‐line
and
save
the
results
– Generate
CSVs
based
on
YAML
config
file
– Send
scheduled
reports
by
email
– Send
daily
no*fica*ons
to
our
chat
tool
(Slack)
- 27. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 28. Apache
Spark
Distributed
Hadoop
collec*ons
• Distributed
resilient
in-‐memory
collec*ons
• Python,
Scala
and
Java
API
• Func*onal
API
• Build
in
Scala
• Supports
HDFS,
Cassandra
for
data
storage
• Shipped
with
tools
for
SQL,
machine
learning,
streaming,
graph
processing
R binding is coming soon
LINQ users should feel at home
Also, tool for approximate SQL is coming soon
- 29. Apache
Spark
Example:
Word-‐count
in
Python
1. Load data from Hadoop
2. Calculate count
for each word
lines = spark.textFile("hdfs://...")!
counts = lines.flatMap(lambda line: line.split(" ")) !
.map(lambda word: (word, 1)) !
.reduceByKey(lambda a, b: a + b)
!
counts.saveAsTextFile("hdfs://...")!
3. Save back to
Hadoop
All these operations are distributed.
They run in parallel on each Hadoop node.
- 30. Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Apache
Pig
Data
mangling
Current
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
- 31. IPython
Notebook
Interac*ve
data
explora*on
• Interac*ve
Python
shell
embedded
in
a
web
UI
• Environment
for
interac*ve
data
experimenta*on
• Show
code
together
with
documenta*on
• Share
easily
• Great
for
data
analysts/scien*sts
• Supports
Impyla
and
PySpark
Core
feature
For example as Gist
Python driver
for Impala
Python driver for
Apache Spark
- 32. Current
Hadoop
stack
at
TheNewMo*on
Amazon
S3
Temporary
data
storage
Tableau
Business
Intelligence
IPython
Notebook
Interac*ve
data
explora*on
Hue
UI
for
Hadoop
Produc,on
DBs
Mostly
Postgres
and
MySQL
Impala
SQL
engine
for
Hadoop
Hermes
Homegrown
CLI
and
dashboard
Hadoop
Distributed
filesystem
Apache
Spark
Distributed
collec*ons
Apache
Pig
Data
mangling
- 33. Hadoop
distribu*ons
• Free
&
easy
deployment
of
complete
Hadoop
stack
• Cloudera
Basic deployment can be
done under one hour
distribu*on
(CDH)
includes:
– Hadoop,
Pig,
Hive,
Impala,
Spark,
Hue
et
al.
– Cloudera
Manager
for
easy
provisioning
and
monitoring
of
the
Hadoop
cluster
- We use CDH because of Impala and the Cloudera Manager
• Try
installing
Hadoop
yourself:
bit.ly/virt-‐hadoop
- 34. Challenges
that
we
are
currently
tackling
• port
Quicker, easier to debug.
Also, we are a Scala shop
ETL
scripts
to
Apache
Spark
• gather
click-‐stream
data
from
the
web
• predict
And also some machine learning!
failure
of
a
charge-‐point
• score
leads
in
SalesForce
• predict
customer
churn
• offer
product
recommenda*ons
• detect
anomalies
in
system
&
charger
logs
- 35. • Cloudera
just
announced
strategic
partnership
with
MS
Azure
• Selected
Deploy Hadoop easily
from Azure Marketplace
Hadoop
distribu*on
can
be
run
on
Windows
OS
• With
Hadoop
Streaming,
you
can
write
MapReduce
jobs
in
any
language
and
bind
them
with
Hadoop
via
stdin/stdout
• With
.NET
being
opensource,
maybe
Spark
on
F#
inita*ve
will
pop
up
soon...
- 36. Thank
you.
Marcel
Krcah
@mkrcah,
Data
Engineer
Daan
Debie
@daandebie,
Data
Engineer
TheNewMo*on.com
Slides
will
be
available
at
marcelkrcah.net/talks