SlideShare a Scribd company logo
Hadoop 
in 
Prac,ce 
Introduc*on 
into 
the 
Hadoop 
Stack 
at 
The 
New 
Mo*on 
SDN 
Conference, 
Arnhem, 
Netherlands 
Dec 
2, 
2014 
Marcel 
Krcah 
@mkrcah, 
Data 
Engineer 
Daan 
Debie 
@daandebie, 
Data 
Engineer
„War is ninety percent information “ 
- Napolean Bonaparte
• Provides 
infrastructure 
and 
services 
for 
drivers 
of 
electric 
vehicles 
• Largest 
in 
Europe, 
3rd 
in 
the 
world 
with 
15k 
charge-­‐points 
• Charge-­‐points 
are 
online, 
customers 
see 
their 
data 
on 
web/mobile 
app 
• Customers 
save 
20 
tons 
of 
CO2 
every 
day 
• 10k 
charge 
sessions 
per 
day 
• 100k 
messages 
per 
day 
from 
charge-­‐ 
points 
Dutch startup, 
born in 2009
Technical 
problems 
before 
Hadoop: 
• Data 
scaRered 
over 
7 
databases 
• Many 
duplicates 
& 
inconsistencies 
• Complex 
queries 
were 
too 
*me-­‐consuming 
• Batch 
analy*cs 
running 
and 
sharing 
resources 
with 
produc*on 
DBs 
• Databases 
out-­‐of-­‐sync 
Our industry was changing 
rapidly, systems needed to 
adapt very quickly 
Yes, we know, this is 
generally bad. We are 
working on it 
Our systems went down 
occasionally when a complex 
query was run
Business 
problems 
before 
Hadoop: 
• No 
unified 
and 
objec*ve 
view 
on 
data 
• Hard 
to 
get 
simple 
insights 
• Hard 
to 
quickly 
analyze 
our 
charge-­‐session 
data 
• Data 
not 
shared 
within 
the 
company 
Company ran a lot on gut feeling 
instead of data
Ini*al 
requirements 
for 
data 
warehouse 
• So[ware 
engineers: 
– Tool 
that 
allows 
quick 
analysis 
of 
our 
session 
and 
log 
data 
– Automa*c 
no*fica*ons 
if 
databases 
get 
out-­‐of-­‐sync 
• Management: 
– Easy-­‐to-­‐use 
(BI) 
tool 
to 
explore 
clean 
data 
– Make 
informed 
decisions 
based 
on 
data 
• Data 
analysts: 
– Pla^orm 
The most 
important 
to 
execute 
complex 
SQL 
queries 
on 
clean 
data 
• Company: 
– Company-­‐wide 
dashboards 
and 
emails 
with 
key 
metrics 
– Single 
source 
of 
truth 
• Hadoop proved very succesful when migrating data to SalesForce.
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Apache 
Hadoop 
Distributed 
filesystem 
• Consists 
of 
– distributed 
Filesystem remains 
fully functional even if 
a node(s) fails 
fault-­‐tolerant 
filesystem 
(HDFS) 
– execu*on 
engine 
(MapReduce) 
• Free, 
open-­‐source 
• Built 
on 
Java 
• Scales 
from 
one 
server 
to 
thousands 
of 
nodes 
• We 
use 
Hadoop 
– as 
a 
central 
place 
to 
store 
all 
our 
cleaned 
data 
– on 
a 
cluster 
of 
5 
nodes 
running 
on 
Amazon 
EC2
Apache 
Hadoop 
Example: 
Hadoop 
commands 
Example: Show all files in the root Hadoop directory: 
hdfs@hadoop-ec2:~$ hdfs dfs -ls /! 
Hadoop node Executes a given 
command on a cluster 
hdfs@hadoop-ec2:~$ hadoop fs -cp  
!/user/hadoop/file1  
!/user/hadoop/file2! 
within a Unix shell 
Standard Unix 
filesystem command 
prefixed with a dash 
Example: Copy one file to another one in Hadoop:
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons 
Hadoop doesn‘t have 
to run 24/7. 
To save costs, we 
spun-up the cluster 
at 8am and tore it 
down at 10pm.
Every nigth, we dump production 
DBs into S3 as CSV files 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
S3 is very cheap data storage. 
It‘s cheaper than running Hadoop 
24/7 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Current 
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons 
Apache 
Pig 
Data 
mangling
• SQL 
Apache 
Pig 
Data 
mangling 
on 
steroids 
– Save 
queries 
into 
variables 
– Apply 
user-­‐defined-­‐func*ons 
– Load/save 
data 
from/to 
HDFS 
or 
S3 
• Tool 
Why „ Pig“ ? Because it 
can eat any type of data. 
In Python, Java, Ruby 
and JavaScript 
for 
data 
mangling, 
cleaning 
and 
complex 
queries 
of 
large 
datasets 
• Runs 
on 
MapReduce 
(Hadoop 
execu*on 
engine) 
• Currently, 
we 
use 
Pig 
for 
data 
mangling 
But we are moving to Apache Spark since 
we find Pig slow & hard to debug
Apache 
Pig: 
Example 
script 
Load from CSV 
stored on Hadoop 
athletes = LOAD 'OlympicAthletes.csv’ USING PigStorage(',’) 
AS (athlete:chararray, country:chararray, year:int, 
sport:chararray, gold:int, silver:int, bronze:int, 
total:int); 
There are many custom 
built-in functions 
athletes_grp_country = GROUP athletes BY country; 
athletes_lim = LIMIT athletes 10; 
Let‘s register and 
use custom function 
REGISTER 'olympic_udfs.py' USING streaming_python AS udf; 
athlete_score = FOREACH athletes GENERATE athlete, 
udf.calculate_score(gold, silver, bronze) as score; 
Store result back to 
Hadoop as CSV 
STORE athlete_score INTO ’/scores' USING PigStorage(‘,');
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Cloudera 
Impala 
SQL 
engine 
for 
Hadoop 
• SQL 
engine 
on 
top 
of 
Hadoop 
• Supports 
SQL 
2013 
standard 
• Fastest 
SQL 
on 
Hadoop1 
• Supports 
custom 
user-­‐defined 
func*ons 
• Built 
in 
C++ 
• Developed 
by 
Cloudera, 
but 
open-­‐sourced 
• Runs 
a 
daemon 
on 
each 
Hadoop 
node 
• Compliant 
with 
JDBC/ODBC 
interface 
1. TPC-­‐DS 
benchmark 
bit.ly/tpc-­‐ds-­‐impala 
Incl. window functions 
Also beats some RDBMS 
In C++ or Python
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Tableau 
Business 
intelligence 
• Desktop/cloud 
app 
to 
explore 
data 
• Suitable 
for 
management 
team, 
marke*ng 
and 
opera*ons 
• Supports 
Our CEO loves it 
many 
data 
sources 
• Connects 
to 
Impala 
via 
ODBC 
driver 
• Alterna*ves: 
JasperSo[, 
MicroStrategy, 
DataMeer, 
GoodData 
We opted for Tableau due to its intuitive UI, 
stable Impala integration & reasonable pricing
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Hue 
UI 
for 
Hadoop 
• Web 
UI 
on 
top 
of 
all 
Hadoop 
technologies 
– Analyze 
data 
with 
Impala, 
Pig, 
Spark 
and 
Hive 
– Browse 
files 
and 
perform 
opera*ons 
on 
HDFS 
– Schedule 
jobs 
with 
Oozie 
– Create 
dashboards 
(on 
top 
of 
Solr) 
– Download 
Impala 
results 
as 
CSV/XLS 
files 
• Suitable 
for 
data 
analysts 
and 
anybody 
who 
knows 
SQL 
• Demo: 
hRp://demo.gethue.com
SQL running on Hadoop 
Tables with their schemas Built-in support 
for basic charts 
HUE 
Save & Export queries
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Hermes 
Homegrown 
web 
app 
with 
dashboards 
• Internal 
web 
app 
with 
dashboards 
• Any 
employee 
can 
login 
with 
Google 
account 
• REST 
API 
backend 
built 
in 
Python/Flask 
and 
Impyla 
• Frontend 
based 
on 
AngularJS 
with 
D3/NVD3 
lib 
• We 
are 
considering 
to 
open-­‐source 
Hermes 
We have 
no logo 
YET 
Query results accessible in JSON 
format via API to any system 
Basically one JavaScript call and one 
HTML element to add a new graph
Example: 
Hermes 
frontend 
with 
Angular 
and 
NVD3 
One JavaScript call to load data into AngularJS 
$http({method: 'GET', url: '/api/delivered-chargers'}). 
success(function(data) { 
$scope.data.deliveredChargers = data; 
}); 
<nvd3-multi-bar-chart 
data="data.deliveredChargers" 
width="700" 
height="300” 
showLegend="true" 
noData="No data available." 
interactive="true” 
tooltips="true" 
xAxisLabel="Month"> 
</nvd3-multi-bar-chart> 
Data with delivered charge-points is 
One HTML element stored to AngularJS var $scope.data 
to show graph 
Data from AngularJS to 
be shown 
Parameterize graph with 
attributes
• CLI 
Hermes 
The 
command-­‐line 
tool 
to 
automate 
Hadoop 
ac*ons 
• Built 
on 
Python 
• Supports: 
– Automa*c 
provisioning 
of 
Hadoop 
cluster 
on 
AWS 
– Query 
Impala 
via 
command-­‐line 
and 
save 
the 
results 
– Generate 
CSVs 
based 
on 
YAML 
config 
file 
– Send 
scheduled 
reports 
by 
email 
– Send 
daily 
no*fica*ons 
to 
our 
chat 
tool 
(Slack)
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
Apache 
Spark 
Distributed 
Hadoop 
collec*ons 
• Distributed 
resilient 
in-­‐memory 
collec*ons 
• Python, 
Scala 
and 
Java 
API 
• Func*onal 
API 
• Build 
in 
Scala 
• Supports 
HDFS, 
Cassandra 
for 
data 
storage 
• Shipped 
with 
tools 
for 
SQL, 
machine 
learning, 
streaming, 
graph 
processing 
R binding is coming soon 
LINQ users should feel at home 
Also, tool for approximate SQL is coming soon
Apache 
Spark 
Example: 
Word-­‐count 
in 
Python 
1. Load data from Hadoop 
2. Calculate count 
for each word 
lines = spark.textFile("hdfs://...")! 
counts = lines.flatMap(lambda line: line.split(" ")) ! 
.map(lambda word: (word, 1)) ! 
.reduceByKey(lambda a, b: a + b) 
! 
counts.saveAsTextFile("hdfs://...")! 
3. Save back to 
Hadoop 
All these operations are distributed. 
They run in parallel on each Hadoop node.
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Apache 
Pig 
Data 
mangling 
Current 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons
IPython 
Notebook 
Interac*ve 
data 
explora*on 
• Interac*ve 
Python 
shell 
embedded 
in 
a 
web 
UI 
• Environment 
for 
interac*ve 
data 
experimenta*on 
• Show 
code 
together 
with 
documenta*on 
• Share 
easily 
• Great 
for 
data 
analysts/scien*sts 
• Supports 
Impyla 
and 
PySpark 
Core 
feature 
For example as Gist 
Python driver 
for Impala 
Python driver for 
Apache Spark
Current 
Hadoop 
stack 
at 
TheNewMo*on 
Amazon 
S3 
Temporary 
data 
storage 
Tableau 
Business 
Intelligence 
IPython 
Notebook 
Interac*ve 
data 
explora*on 
Hue 
UI 
for 
Hadoop 
Produc,on 
DBs 
Mostly 
Postgres 
and 
MySQL 
Impala 
SQL 
engine 
for 
Hadoop 
Hermes 
Homegrown 
CLI 
and 
dashboard 
Hadoop 
Distributed 
filesystem 
Apache 
Spark 
Distributed 
collec*ons 
Apache 
Pig 
Data 
mangling
Hadoop 
distribu*ons 
• Free 
& 
easy 
deployment 
of 
complete 
Hadoop 
stack 
• Cloudera 
Basic deployment can be 
done under one hour 
distribu*on 
(CDH) 
includes: 
– Hadoop, 
Pig, 
Hive, 
Impala, 
Spark, 
Hue 
et 
al. 
– Cloudera 
Manager 
for 
easy 
provisioning 
and 
monitoring 
of 
the 
Hadoop 
cluster 
- We use CDH because of Impala and the Cloudera Manager 
• Try 
installing 
Hadoop 
yourself: 
bit.ly/virt-­‐hadoop
Challenges 
that 
we 
are 
currently 
tackling 
• port 
Quicker, easier to debug. 
Also, we are a Scala shop 
ETL 
scripts 
to 
Apache 
Spark 
• gather 
click-­‐stream 
data 
from 
the 
web 
• predict 
And also some machine learning! 
failure 
of 
a 
charge-­‐point 
• score 
leads 
in 
SalesForce 
• predict 
customer 
churn 
• offer 
product 
recommenda*ons 
• detect 
anomalies 
in 
system 
& 
charger 
logs
• Cloudera 
just 
announced 
strategic 
partnership 
with 
MS 
Azure 
• Selected 
Deploy Hadoop easily 
from Azure Marketplace 
Hadoop 
distribu*on 
can 
be 
run 
on 
Windows 
OS 
• With 
Hadoop 
Streaming, 
you 
can 
write 
MapReduce 
jobs 
in 
any 
language 
and 
bind 
them 
with 
Hadoop 
via 
stdin/stdout 
• With 
.NET 
being 
opensource, 
maybe 
Spark 
on 
F# 
inita*ve 
will 
pop 
up 
soon...
Thank 
you. 
Marcel 
Krcah 
@mkrcah, 
Data 
Engineer 
Daan 
Debie 
@daandebie, 
Data 
Engineer 
TheNewMo*on.com 
Slides 
will 
be 
available 
at 
marcelkrcah.net/talks

More Related Content

Hadoop in Practice (SDN Conference, Dec 2014)

  • 1. Hadoop in Prac,ce Introduc*on into the Hadoop Stack at The New Mo*on SDN Conference, Arnhem, Netherlands Dec 2, 2014 Marcel Krcah @mkrcah, Data Engineer Daan Debie @daandebie, Data Engineer
  • 2. „War is ninety percent information “ - Napolean Bonaparte
  • 3. • Provides infrastructure and services for drivers of electric vehicles • Largest in Europe, 3rd in the world with 15k charge-­‐points • Charge-­‐points are online, customers see their data on web/mobile app • Customers save 20 tons of CO2 every day • 10k charge sessions per day • 100k messages per day from charge-­‐ points Dutch startup, born in 2009
  • 4. Technical problems before Hadoop: • Data scaRered over 7 databases • Many duplicates & inconsistencies • Complex queries were too *me-­‐consuming • Batch analy*cs running and sharing resources with produc*on DBs • Databases out-­‐of-­‐sync Our industry was changing rapidly, systems needed to adapt very quickly Yes, we know, this is generally bad. We are working on it Our systems went down occasionally when a complex query was run
  • 5. Business problems before Hadoop: • No unified and objec*ve view on data • Hard to get simple insights • Hard to quickly analyze our charge-­‐session data • Data not shared within the company Company ran a lot on gut feeling instead of data
  • 6. Ini*al requirements for data warehouse • So[ware engineers: – Tool that allows quick analysis of our session and log data – Automa*c no*fica*ons if databases get out-­‐of-­‐sync • Management: – Easy-­‐to-­‐use (BI) tool to explore clean data – Make informed decisions based on data • Data analysts: – Pla^orm The most important to execute complex SQL queries on clean data • Company: – Company-­‐wide dashboards and emails with key metrics – Single source of truth • Hadoop proved very succesful when migrating data to SalesForce.
  • 7. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 8. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 9. Apache Hadoop Distributed filesystem • Consists of – distributed Filesystem remains fully functional even if a node(s) fails fault-­‐tolerant filesystem (HDFS) – execu*on engine (MapReduce) • Free, open-­‐source • Built on Java • Scales from one server to thousands of nodes • We use Hadoop – as a central place to store all our cleaned data – on a cluster of 5 nodes running on Amazon EC2
  • 10. Apache Hadoop Example: Hadoop commands Example: Show all files in the root Hadoop directory: hdfs@hadoop-ec2:~$ hdfs dfs -ls /! Hadoop node Executes a given command on a cluster hdfs@hadoop-ec2:~$ hadoop fs -cp !/user/hadoop/file1 !/user/hadoop/file2! within a Unix shell Standard Unix filesystem command prefixed with a dash Example: Copy one file to another one in Hadoop:
  • 11. Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons Hadoop doesn‘t have to run 24/7. To save costs, we spun-up the cluster at 8am and tore it down at 10pm.
  • 12. Every nigth, we dump production DBs into S3 as CSV files Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on S3 is very cheap data storage. It‘s cheaper than running Hadoop 24/7 Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 13. Current Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons Apache Pig Data mangling
  • 14. • SQL Apache Pig Data mangling on steroids – Save queries into variables – Apply user-­‐defined-­‐func*ons – Load/save data from/to HDFS or S3 • Tool Why „ Pig“ ? Because it can eat any type of data. In Python, Java, Ruby and JavaScript for data mangling, cleaning and complex queries of large datasets • Runs on MapReduce (Hadoop execu*on engine) • Currently, we use Pig for data mangling But we are moving to Apache Spark since we find Pig slow & hard to debug
  • 15. Apache Pig: Example script Load from CSV stored on Hadoop athletes = LOAD 'OlympicAthletes.csv’ USING PigStorage(',’) AS (athlete:chararray, country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int, total:int); There are many custom built-in functions athletes_grp_country = GROUP athletes BY country; athletes_lim = LIMIT athletes 10; Let‘s register and use custom function REGISTER 'olympic_udfs.py' USING streaming_python AS udf; athlete_score = FOREACH athletes GENERATE athlete, udf.calculate_score(gold, silver, bronze) as score; Store result back to Hadoop as CSV STORE athlete_score INTO ’/scores' USING PigStorage(‘,');
  • 16. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 17. Cloudera Impala SQL engine for Hadoop • SQL engine on top of Hadoop • Supports SQL 2013 standard • Fastest SQL on Hadoop1 • Supports custom user-­‐defined func*ons • Built in C++ • Developed by Cloudera, but open-­‐sourced • Runs a daemon on each Hadoop node • Compliant with JDBC/ODBC interface 1. TPC-­‐DS benchmark bit.ly/tpc-­‐ds-­‐impala Incl. window functions Also beats some RDBMS In C++ or Python
  • 18. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 19. Tableau Business intelligence • Desktop/cloud app to explore data • Suitable for management team, marke*ng and opera*ons • Supports Our CEO loves it many data sources • Connects to Impala via ODBC driver • Alterna*ves: JasperSo[, MicroStrategy, DataMeer, GoodData We opted for Tableau due to its intuitive UI, stable Impala integration & reasonable pricing
  • 20. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 21. Hue UI for Hadoop • Web UI on top of all Hadoop technologies – Analyze data with Impala, Pig, Spark and Hive – Browse files and perform opera*ons on HDFS – Schedule jobs with Oozie – Create dashboards (on top of Solr) – Download Impala results as CSV/XLS files • Suitable for data analysts and anybody who knows SQL • Demo: hRp://demo.gethue.com
  • 22. SQL running on Hadoop Tables with their schemas Built-in support for basic charts HUE Save & Export queries
  • 23. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 24. Hermes Homegrown web app with dashboards • Internal web app with dashboards • Any employee can login with Google account • REST API backend built in Python/Flask and Impyla • Frontend based on AngularJS with D3/NVD3 lib • We are considering to open-­‐source Hermes We have no logo YET Query results accessible in JSON format via API to any system Basically one JavaScript call and one HTML element to add a new graph
  • 25. Example: Hermes frontend with Angular and NVD3 One JavaScript call to load data into AngularJS $http({method: 'GET', url: '/api/delivered-chargers'}). success(function(data) { $scope.data.deliveredChargers = data; }); <nvd3-multi-bar-chart data="data.deliveredChargers" width="700" height="300” showLegend="true" noData="No data available." interactive="true” tooltips="true" xAxisLabel="Month"> </nvd3-multi-bar-chart> Data with delivered charge-points is One HTML element stored to AngularJS var $scope.data to show graph Data from AngularJS to be shown Parameterize graph with attributes
  • 26. • CLI Hermes The command-­‐line tool to automate Hadoop ac*ons • Built on Python • Supports: – Automa*c provisioning of Hadoop cluster on AWS – Query Impala via command-­‐line and save the results – Generate CSVs based on YAML config file – Send scheduled reports by email – Send daily no*fica*ons to our chat tool (Slack)
  • 27. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 28. Apache Spark Distributed Hadoop collec*ons • Distributed resilient in-­‐memory collec*ons • Python, Scala and Java API • Func*onal API • Build in Scala • Supports HDFS, Cassandra for data storage • Shipped with tools for SQL, machine learning, streaming, graph processing R binding is coming soon LINQ users should feel at home Also, tool for approximate SQL is coming soon
  • 29. Apache Spark Example: Word-­‐count in Python 1. Load data from Hadoop 2. Calculate count for each word lines = spark.textFile("hdfs://...")! counts = lines.flatMap(lambda line: line.split(" ")) ! .map(lambda word: (word, 1)) ! .reduceByKey(lambda a, b: a + b) ! counts.saveAsTextFile("hdfs://...")! 3. Save back to Hadoop All these operations are distributed. They run in parallel on each Hadoop node.
  • 30. Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Apache Pig Data mangling Current Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons
  • 31. IPython Notebook Interac*ve data explora*on • Interac*ve Python shell embedded in a web UI • Environment for interac*ve data experimenta*on • Show code together with documenta*on • Share easily • Great for data analysts/scien*sts • Supports Impyla and PySpark Core feature For example as Gist Python driver for Impala Python driver for Apache Spark
  • 32. Current Hadoop stack at TheNewMo*on Amazon S3 Temporary data storage Tableau Business Intelligence IPython Notebook Interac*ve data explora*on Hue UI for Hadoop Produc,on DBs Mostly Postgres and MySQL Impala SQL engine for Hadoop Hermes Homegrown CLI and dashboard Hadoop Distributed filesystem Apache Spark Distributed collec*ons Apache Pig Data mangling
  • 33. Hadoop distribu*ons • Free & easy deployment of complete Hadoop stack • Cloudera Basic deployment can be done under one hour distribu*on (CDH) includes: – Hadoop, Pig, Hive, Impala, Spark, Hue et al. – Cloudera Manager for easy provisioning and monitoring of the Hadoop cluster - We use CDH because of Impala and the Cloudera Manager • Try installing Hadoop yourself: bit.ly/virt-­‐hadoop
  • 34. Challenges that we are currently tackling • port Quicker, easier to debug. Also, we are a Scala shop ETL scripts to Apache Spark • gather click-­‐stream data from the web • predict And also some machine learning! failure of a charge-­‐point • score leads in SalesForce • predict customer churn • offer product recommenda*ons • detect anomalies in system & charger logs
  • 35. • Cloudera just announced strategic partnership with MS Azure • Selected Deploy Hadoop easily from Azure Marketplace Hadoop distribu*on can be run on Windows OS • With Hadoop Streaming, you can write MapReduce jobs in any language and bind them with Hadoop via stdin/stdout • With .NET being opensource, maybe Spark on F# inita*ve will pop up soon...
  • 36. Thank you. Marcel Krcah @mkrcah, Data Engineer Daan Debie @daandebie, Data Engineer TheNewMo*on.com Slides will be available at marcelkrcah.net/talks