YARN
- 2. CC
BY
2.0
/
Richard
Bumgardner
Been
there,
done
that.
- 3. 3
Alex
@
Cloudera
• SoluAons
Architect
• AKA
consultant
• government
• Infrastructure
©2014
Cloudera,
Inc.
All
rights
reserved.
- 4. 4
What
Does
Cloudera
Do?
• product
• distribuAon
of
Hadoop
components,
Apache
licensed
• enterprise
tooling
• support
• training
• services
(aka
consulAng)
• community
©2014
Cloudera,
Inc.
All
rights
reserved.
- 5. 5
Disclaimer
• Cloudera
builds
things
soTware
• most
donated
to
Apache
• some
closed-‐source
• Cloudera
“products”
I
reference
are
open
source
• Apache
Licensed
• source
code
is
on
GitHub
• h[ps://github.com/cloudera
©2014
Cloudera,
Inc.
All
rights
reserved.
- 6. 6
What
This
Talk
Isn’t
About
• deploying
• Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
• sizing
&
tuning
• depends
heavily
on
data
and
workload
• coding
• line
diagrams
don’t
count
• algorithms
• I
suck
at
math,
ask
anyone
©2014
Cloudera,
Inc.
All
rights
reserved.
- 7. 7
So
What
ARE
We
Talking
About?
• Why
YARN?
• Architecture
• Availability
• Resources
&
Scheduling
• MR1
to
MR2
Gotchas
• History
• Interfaces
• ApplicaAons
• StoryAme
©2014
Cloudera,
Inc.
All
rights
reserved.
- 9. 9
Why
“Ecosystem?”
• In
the
beginning,
just
Hadoop
• HDFS
• MapReduce
• Today,
dozens
of
interrelated
components
• I/O
• Processing
• Specialty
ApplicaAons
• ConfiguraAon
• Workflow
©2014
Cloudera,
Inc.
All
rights
reserved.
- 10. 10
ParAal
Ecosystem
Hadoop
external
system
web
server
device
logs
RDBMS
/
DWH
API
access
log
collecAon
DB
table
import
batch
processing
machine
learning
external
system
API
access
user
RDBMS
/
DWH
BI
tool
+
JDBC/ODBC
SQL
Search
DB
table
export
©2014
Cloudera,
Inc.
All
rights
reserved.
- 11. 11
HDFS
• Distributed,
highly
fault-‐tolerant
filesystem
• OpAmized
for
large
streaming
access
to
data
• Based
on
Google
File
System
• h[p://research.google.com/archive/gfs.html
©2014
Cloudera,
Inc.
All
rights
reserved.
- 12. 12
Lots
of
Commodity
Machines
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
©2014
Cloudera,
Inc.
All
rights
reserved.
- 13. 13
MapReduce
(MR)
• Programming
paradigm
• Batch
oriented,
not
realAme
• Works
well
with
distributed
compuAng
• Lots
of
Java,
but
other
languages
supported
• Based
on
Google’s
paper
• h[p://research.google.com/archive/mapreduce.html
©2014
Cloudera,
Inc.
All
rights
reserved.
- 14. 14
MR1
Components
• JobTracker
• accepts
jobs
from
client
• schedules
jobs
on
parAcular
nodes
• accepts
status
data
from
TaskTrackers
• TaskTracker
• one
per-‐node
• manages
tasks
• crunches
data
in-‐place
• reports
to
JobTracker
©2014
Cloudera,
Inc.
All
rights
reserved.
- 15. 15
Under
the
Covers
©2014
Cloudera,
Inc.
All
rights
reserved.
- 16. 16
You specify map() and
reduce() functions.
The framework does the
rest.
60
©2014
Cloudera,
Inc.
All
rights
reserved.
- 23. 23
Why
YARN
/
MR2?
©2014
Cloudera,
Inc.
All
rights
reserved.
• Scalability
• JT
kept
track
of
individual
tasks
and
wouldn’t
scale
• UAlizaAon
• All
slots
are
equal
even
if
the
work
is
not
equal
• MulA-‐tenancy
• Every
framework
shouldn’t
need
to
write
its
own
execuAon
engine
• All
frameworks
should
share
the
resources
on
a
cluster
- 24. 24
An
OperaAng
System?
©2014
Cloudera,
Inc.
All
rights
reserved.
TradiAonal
OperaAng
System
Storage:
File
System
ExecuAon/
Scheduling:
Processes/
Kernel
Scheduler
Hadoop
Storage:
Hadoop
Distributed
File
System
(HDFS)
ExecuAon/
Scheduling:
Yet
Another
Resource
NegoJaJor
(YARN)
- 25. 25
MulAple
levels
of
scheduling
©2014
Cloudera,
Inc.
All
rights
reserved.
• YARN
• Which
applicaAon
(framework)
to
give
resources
to?
• ApplicaAon
(Framework
-‐
MR
etc.)
• Which
task
within
the
applicaAon
should
use
these
resources?
- 28. 28
Architecture
–
running
mulAple
applicaAons
©2014
Cloudera,
Inc.
All
rights
reserved.
- 29. 29
Control
Flow:
Submit
applicaAon
©2014
Cloudera,
Inc.
All
rights
reserved.
- 30. 30
Control
Flow:
Get
applicaAon
updates
©2014
Cloudera,
Inc.
All
rights
reserved.
- 31. 31
Control
Flow:
AM
asking
for
resources
©2014
Cloudera,
Inc.
All
rights
reserved.
- 32. 32
Control
Flow:
AM
using
containers
©2014
Cloudera,
Inc.
All
rights
reserved.
- 33. 33
ExecuAon
Modes
©2014
Cloudera,
Inc.
All
rights
reserved.
• Local
mode
• Uber
mode
• Executors
• DefaultContainerExecutor
• LinuxContainerExecutor
- 35. 35
Client
Failover
Client
Failover
Availability
©2014
Cloudera,
Inc.
All
rights
reserved.
RM
Ele
ctor
RM
Ele
ctor
ZK
Store
NM
NM
NM
NM
Client
Client
Client
- 36. 36
Availability
–
SubtleAes
• Embedded
leader
elector
• No
need
for
a
separate
daemon
like
ZKFC
• Implicit
fencing
using
ZKRMStateStore
• AcAve
RM
claims
exclusive
access
to
store
through
ACL
magic
©2014
Cloudera,
Inc.
All
rights
reserved.
- 37. 37
Availability
–
ImplicaAons
• Previously
submi[ed
applicaAons
conAnue
to
run
• New
ApplicaAon
Masters
are
created
• If
the
AM
checkpoints
state,
can
conAnue
from
where
it
leT
• MR
keeps
track
of
completed
tasks.
They
don’t
have
to
be
re-‐run
©2014
Cloudera,
Inc.
All
rights
reserved.
• Future
• Work-‐preserving
RM
Restart
/
Failover
- 38. 38
Availability
–
ImplicaAons
©2014
Cloudera,
Inc.
All
rights
reserved.
• Transparent
to
clients
• RM
unavailable
for
a
small
duraAon
• AutomaAcally
failover
to
the
AcAve
RM
• Web
UI
redirects
• REST
API
redirects
(starAng
5.1.0)
- 40. 40
Resource
Model
and
CapaciAes
©2014
Cloudera,
Inc.
All
rights
reserved.
• Resource
vectors
• e.g.
1024
MB,
2
vcores,
…
• No
more
task
slots!
• Nodes
specify
the
amount
of
resources
they
have
• yarn.nodemanager.resource.memory-‐mb
• yarn.nodemanager.resource.cpu-‐vcores
• vcores
to
cores
relaAon,
not
really
“virtual”
- 41. 41
Resources
and
Scheduling
• What
you
request
is
what
you
get
• No
more
fixed-‐size
slots
• Framework/applicaAon
requests
resources
for
a
task
• MR
AM
requests
resources
for
map
and
reduce
tasks,
these
requests
can
potenAally
be
for
different
amounts
of
resources
©2014
Cloudera,
Inc.
All
rights
reserved.
- 42. 42
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Node
1
Node
2
Node
3
- 43. 43
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
I
want
2
containers
with
1024
MB
and
a
1
core
each
Node
1
Node
2
Node
3
- 44. 44
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Noted
Node
1
Node
2
Node
3
- 45. 45
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
I’m
sAll
here
Node
1
Node
2
Node
3
- 46. 46
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
I’ll
reserve
some
space
on
node1
for
AM1
Node
1
Node
2
Node
3
- 47. 47
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Got
anything
for
me?
Node
1
Node
2
Node
3
- 48. 48
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Here’s
a
security
token
to
let
you
launch
a
container
on
Node
1
Node
1
Node
2
Node
3
- 49. 49
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Hey,
launch
my
container
with
this
shell
command
Node
1
Node
2
Node
3
- 50. 50
YARN
Scheduling
ResourceManager
ApplicaAon
Master
1
ApplicaAon
Master
2
Node
1
Node
2
Node
3
Container
- 51. 51
Resources
on
a
Node
5
GB
Map
512
MB
Reduce
1536
MB
Map
1024
MB
Map
256
MB
Map
256
MB
Reduce
512
MB
MR
-‐
AM
1024
MB
- 52. 52
FairScheduler
(FS)
• When
space
becomes
available
to
run
a
task
on
the
cluster,
which
applicaAon
do
we
give
it
to?
• Find
the
job
that
is
using
the
least
space.
- 53. 53
FS:
Apps
and
Queues
• Apps
go
in
“queues”
• Share
fairly
between
queues
• Share
fairly
between
apps
within
queues
- 54. 54
FS: Hierarchical Queues
Root
Mem
Capacity:
12
GB
CPU
Capacity:
24
cores
MarkeJng
Fair
Share
Mem:
4
GB
Fair
Share
CPU:
8
cores
RD
Fair
Share
Mem:
4
GB
Fair
Share
CPU:
8
cores
Sales
Fair
Share
Mem:
4
GB
Fair
Share
CPU:
8
cores
Jim’s
Team
Fair
Share
Mem:
2
GB
Fair
Share
CPU:
4
cores
Bob’s
Team
Fair
Share
Mem:
2
GB
Fair
Share
CPU:
4
cores
- 55. 55
FS: Fast and Slow Lanes
Root
Mem
Capacity:
12
GB
CPU
Capacity:
24
cores
MarkeJng
Fair
Share
Mem:
4
GB
Fair
Share
CPU:
8
cores
Sales
Fair
Share
Mem:
4
GB
Fair
Share
CPU:
8
cores
Fast
Lane
Max
Share
Mem:
1
GB
Max
Share
CPU:
1
cores
Slow
Lane
Fair
Share
Mem:
3
GB
Fair
Share
CPU:
7
cores
- 56. 56
FS:
Fairness
for
Hierarchies
• Traverse
the
tree
starAng
at
the
root
queue
• Offer
resources
to
subqueues
in
order
of
how
few
resources
they’re
using
- 58. 58
FS:
MulA-‐resource
scheduling
• Scheduling
based
on
mulAple
resources
• CPU,
memory
• Future:
Disk,
Network
• Why
mulAple
resources?
• Be[er
uAlizaAon
• More
fair
- 59. 59
FS:
More
features
©2014
Cloudera,
Inc.
All
rights
reserved.
• PreempAon
• To
avoid
starvaAon,
preempt
tasks
using
more
than
their
fairshare
aTer
the
preempAon
Ameout
• Warn
applicaAons.
ApplicaAon
can
choose
to
kill
any
of
its
containers
• Locality
through
delay
scheduling
• Try
to
give
node-‐local,
rack-‐local
resources
by
waiAng
for
someAme
- 60. 60
Enforcing
resource
limits
©2014
Cloudera,
Inc.
All
rights
reserved.
• Memory
• Monitor
process
usage
and
kill
if
crosses
• Disable
virtual
memory
checking
• Physical
memory
checking
is
being
improved
• CPU
• Cgroups
- 62. 62
MR1
to
MR2
Gotchas
• AMs
can
take
up
all
resources
• Symptom:
Submi[ed
jobs
don’t
run
• Fix
in
progress
-‐
to
limit
number
of
max
applicaAons
• Work
around
–
scheduler
allocaAons
to
limit
number
of
applicaAons
• How
to
run
4
maps
and
2
reduces
per
node?
• Don’t
try
to
tune
number
of
tasks
per
node
• Set
assignMulAple
to
false
to
spread
allocaAons
©2014
Cloudera,
Inc.
All
rights
reserved.
- 63. 63
MR1
to
MR2
Gotchas
• Comparing
MR1
and
MR2
benchmarks
• TestDFSIO
runs
best
on
dedicated
CPU/disk,
harder
to
pin
• TeraSort
changed:
less
compressible
==
more
network
xfer
• Resource
AllocaAon
vs
Resource
ConsumpAon
• RM
allocates
resources,
heap
specified
elsewhere
• JVM
overhead
not
included
• Mind
your
mapred.[map|reduce].child.java.opts
©2014
Cloudera,
Inc.
All
rights
reserved.
- 64. 64
MR1
to
MR2
Gotchas
• Changes
in
logs,
tracing
problems
harder
• MR1:
distributed
grep
on
JobId
• YARN
logs
more
generic,
deal
with
containers
not
apps
©2014
Cloudera,
Inc.
All
rights
reserved.
- 66. 66
Job
History
• Job
History
Viewing
was
moved
to
its
own
server:
Job
History
Server
• Helps
with
load
on
RM
(JT
equivalent)
• Helps
separate
MR
from
YARN
- 67. 67
How
History
Flows?
• AM
• While
running,
keeps
track
of
all
events
during
execuAon
• On
success,
before
finishing
up
• Writes
the
history
informaAon
to
done_intermediate_dir
• The
JHS
• periodically
scans
the
done_intermediate
dir
• moves
the
files
to
done_dir
• starts
showing
the
history
- 68. 68
History:
Important
ConfiguraAon
ProperAes
• yarn.app.mapreduce.am.staging-dir
• Default
(CM):
/user
←
Want
this
also
for
security
• Default
(CDH):
/tmp/hadoop-‐yarn/staging
• Staging
directory
for
MapReduce
applicaAons
• mapreduce.jobhistory.done-dir
• Default:
${yarn.app.mapreduce.am.staging-‐dir}/history/done
• Final
locaAon
in
HDFS
for
history
files
• mapreduce.jobhistory.intermediate-done-dir
• Default:
${yarn.app.mapreduce.am.staging-‐dir}/history/done_intermediate
• LocaAon
in
HDFS
where
AMs
dump
history
files
- 69. 69
History:
Important
ConfiguraAon
ProperAes
• mapreduce.jobhistory.max-age-ms
• Default
604800000
(7
days)
• Max
age
before
JHS
deletes
history
• mapreduce.jobhistory.move.interval-ms
• Default:
180000
(3
min)
• Frequency
at
which
JHS
scans
the
intermediate_done
dir
- 70. 70
History:
Miscellaneous
• The
JHS
runs
as
‘mapred’,
the
AM
run
as
the
user
who
submi[ed
the
job,
and
the
RM
runs
as
‘yarn’
• The
done-‐intermediate
dir
needs
to
be
writable
by
the
user
who
submi[ed
the
job
and
readable
by
‘mapred’
• The
RM,
AM,
and
JHS
should
have
idenAcal
versions
of
the
jobhistory-‐related
properAes
so
they
all
“agree”
- 71. 71
ApplicaAon
History
Server
/
Timeline
Server
• Work
in
progress
to
capture
history
and
other
informaAon
for
non-‐MR
YARN
applicaAons
©2014
Cloudera,
Inc.
All
rights
reserved.
- 72. 72
YARN
Container
Logs
• While
applicaAon
is
running
• Local
to
the
NM.
yarn.nodemanager.log-‐dirs
• ATer
applicaAon
finishes
• Logs
aggregated
to
HDFS
• yarn.nodemanager.remote-‐app-‐log-‐dir
©2014
Cloudera,
Inc.
All
rights
reserved.
• Disable
aggregaAon?
• yarn.log-‐aggregaAon-‐enable
- 75. 75
InteracAng
with
a
YARN
cluster
©2014
Cloudera,
Inc.
All
rights
reserved.
• Java
API
• MR1
–
MR2
APIs
are
compaAble
• REST
API
• RM,
NM,
JHS
–
all
have
REST
APIs
that
are
very
useful
• Llama
(Long-‐Lived
ApplicaAon
Master)
• Cloudera
Impala
can
reserve,
use,
and
release
resource
allocaAons
without
using
YARN-‐managed
container
processes
• CLI
• yarn
rmadmin,
applicaAon,
etc.
• Web
UI
• New
and
“improved”
–
need
Ame
to
get
used
to
- 77. 77
©2014
Cloudera,
Inc.
All
rights
reserved.
YARN
ApplicaAons
• MR2
• Cloudera
Impala
• Apache
Spark
• Others?
Custom?
• Apache
Slider
(incubaAng);
not
producAon-‐ready
• Accumulo
• HBase
• Storm
- 79. 79
The
Cloudera
View
of
YARN
©2014
Cloudera,
Inc.
All
rights
reserved.
• Shipping
• Enabled
by
default
on
CDH5+
• Included
for
past
two
years,
not
enabled
• Supported
• Recommended
- 80. 80
• Benchmarking
is
harder
• different
uAlizaAon
paradigm
• “whole
cluster”
benchmarks
more
important,
e.g.
SWIM
• Tuning
sAll
largely
trial/error
• MR1
was
the
same
originally
• YARN/MR2
will
get
there
eventually
©2014
Cloudera,
Inc.
All
rights
reserved.
Growing
Pains
- 81. 81
What
Are
Customers
Doing?
• A
few
are
using
in
producAon
• Many
are
exploring
©2014
Cloudera,
Inc.
All
rights
reserved.
• Spark
• Impala
via
Llama
• Most
are
waiAng
- 82. 82
©2014
Cloudera,
Inc.
All
rights
reserved.
Why
not
Mesos?
• Mesos
• designed
to
be
completely
general
purpose
• more
burden
on
app
developer
(offer
model
vs
app
request)
• YARN
• designed
with
Hadoop
in
mind
• supports
Kerberos
• more
robust/familiar
scheduling
• rack/machine
locality,
out
of
box
• Supportability
• all
commercial
Hadoop
vendors
support
YARN
• support
for
Mesos
limited
to
startup
Mesosphere
- 83. 83
Is
This
the
End
for
MapReduce?
©2014
Cloudera,
Inc.
All
rights
reserved.
- 85. 85
• CC
BY
2.0
flik
h[ps://flic.kr/p/4RVoUX
• CC
BY
2.0
Ian
Sane
h[ps://flic.kr/p/nRyHxd
• CC
BY-‐NC
2.0
lollyknit
h[ps://flic.kr/p/49C1Xi
• CC
BY-‐ND
2.0
jankunst
h[ps://flic.kr/p/deU71s
• CC
BY-‐SA
2.0
pierrepocs
h[ps://flic.kr/p/9mgdMd
• CC
BY-‐SA
2.0
bekathwia
h[ps://flic.kr/p/4FpABU
• CC
BY-‐NC-‐ND
2.0
digitalnc
h[ps://flic.kr/p/dxyTt1
• CC
BY-‐NC-‐ND
2.0
arselectronica
h[ps://flic.kr/p/7yw8z2
• CC
BY-‐NC-‐ND
2.0
yum9me
h[ps://flic.kr/p/81hQ49
• CC
BY-‐NC-‐SA
2.0
jimnix
h[ps://flic.kr/p/gsqpWC
• MicrosoT
Office
EULA
(really)
©2014
Cloudera,
Inc.
All
rights
reserved.
Image
Credits
- 86. 86
Thank
You!
Alex
Moundalexis
@technmsg
Insert
wi[y
tagline
here.