DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

1
©
Cloudera,
Inc.
All
rights
reserved.

Todd
Lipcon
(Kudu
team
lead)
–
todd@cloudera.com

@tlipcon

Tweet
about
this
talk:
@apachekudu
or
#kudu

*
IncubaEng
at
the
Apache
SoGware
FoundaEon

Apache
Kudu*:
Fast
AnalyEcs
on

Fast
Data

2
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Kudu

Storage
for
Fast
AnalyEcs
on
Fast
Data

• New
updatable
column

store
for
Hadoop

• Apache-‐licensed
open

source

• Beta
now
available

Columnar
Store

Kudu

3
©
Cloudera,
Inc.
All
rights
reserved.

Why
Kudu?

3

4
©
Cloudera,
Inc.
All
rights
reserved.

Current
Storage
Landscape
in
Hadoop
Ecosystem

HDFS
(GFS)
excels
at:

•  Batch
ingest
only
(eg
hourly)

•  Efficiently
scanning
large
amounts

of
data
(analyEcs)

HBase
(BigTable)
excels
at:

•  Efficiently
finding
and
wriEng

individual
rows

•  Making
data
mutable

Gaps
exist
when
these
properEes

are
needed
simultaneously

5
©
Cloudera,
Inc.
All
rights
reserved.

•  High
throughput
for
big
scans

Goal:
Within
2x
of
Parquet

•  Low-‐latency
for
short
accesses

Goal:
1ms
read/write
on
SSD

•  Database-‐like
semanEcs
(iniEally
single-‐row

ACID)

•  Rela>onal
data
model

•  SQL
queries
are
easy

•  “NoSQL”
style
scan/insert/update
(Java/C++
client)

Kudu
Design
Goals

6
©
Cloudera,
Inc.
All
rights
reserved.

Changing
Hardware
landscape

•  Spinning
disk
-‐>
solid
state
storage

• NAND
ﬂash:
Up
to
450k
read
250k
write
iops,
about
2GB/sec
read
and
1.5GB/
sec
write
throughput,
at
a
price
of
less
than
$3/GB
and
dropping

• 3D
XPoint
memory
(1000x
faster
than
NAND,
cheaper
than
RAM)

•  RAM
is
cheaper
and
more
abundant:

• 64-‐>128-‐>256GB
over
last
few
years

•  Takeaway:
The
next
boIleneck
is
CPU,
and
current
storage
systems
weren’t

designed
with
CPU
eﬃciency
in
mind.

7
©
Cloudera,
Inc.
All
rights
reserved.

What’s
Kudu?

7

8
©
Cloudera,
Inc.
All
rights
reserved.

Scalable
and
fast
tabular
storage

•  Scalable

• Tested
up
to
275
nodes
(~3PB
cluster)

• Designed
to
scale
to
1000s
of
nodes,
tens
of
PBs

•  Fast

• Millions
of
read/write
operaEons
per
second
across
cluster

• Mul>ple
GB/second
read
throughput
per
node

•  Tabular

• SQL-‐like
schema:
ﬁnite
number
of
typed
columns
(unlike
HBase/Cassandra)

• Fast
ALTER
TABLE

• “NoSQL”
APIs:
Java/C++/Python

or
SQL
(Impala/Spark/etc)

9
©
Cloudera,
Inc.
All
rights
reserved.

Use
cases
and
architectures

10
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
Use
Cases

Kudu
is
best
for
use
cases
requiring
a
simultaneous
combina>on
of

sequen>al
and
random
reads
and
writes

● Time
Series

○  Examples:
Stream
market
data;
fraud
detecEon
&
prevenEon;
network
monitoring

○  Workload:
Insert,
updates,
scans,
lookups

● Online
Repor>ng

○  Examples:
ODS

○  Workload:
Inserts,
updates,
scans,
lookups

11
©
Cloudera,
Inc.
All
rights
reserved.

Real-‐Time
AnalyEcs
in
Hadoop
Today

Fraud
DetecEon
in
the
Real
World
=
Storage
Complexity

Considera>ons:

●  How
do
I
handle
failure

during
this
process?

●  How
oGen
do
I
reorganize

data
streaming
in
into
a

format
appropriate
for

reporEng?

●  When
reporEng,
how
do
I
see

data
that
has
not
yet
been

reorganized?

●  How
do
I
ensure
that

important
jobs
aren’t

interrupted
by
maintenance?

New
ParEEon

Most
Recent
ParEEon

Historic
Data

HBase

Parquet

File

Have
we

accumulated

enough
data?

Reorganize

HBase
file

into
Parquet

•  Wait
for
running
operaEons
to
complete

•  Define
new
Impala
parEEon
referencing

the
newly
wriyen
Parquet
file

Incoming
Data

(Messaging

System)

ReporEng

Request

Impala
on
HDFS

12
©
Cloudera,
Inc.
All
rights
reserved.

Real-‐Time
AnalyEcs
in
Hadoop
with
Kudu

Improvements:

●  One
system
to
operate

●  No
cron
jobs
or
background

processes

●  Handle
late
arrivals
or
data

correc>ons
with
ease

●  New
data
available

immediately
for
analy>cs
or

opera>ons

Historical
and
Real-‐Eme

Data

Incoming
Data

(Messaging

System)

ReporEng

Request

Storage
in
Kudu

13
©
Cloudera,
Inc.
All
rights
reserved.

Xiaomi
use
case

•  World’s
4th
largest
smart-‐phone
maker
(most
popular
in
China)

•  Gather
important
RPC
tracing
events
from
mobile
app
and
backend
service.

•  Service
monitoring
&
troubleshooEng
tool.

u  High
write
throughput

•  >20
Billion
records/day
and
growing

u  Query
latest
data
and
quick
response

•  IdenEfy
and
resolve
issues
quickly

u  Can
search
for
individual
records

•  Easy
for
troubleshooEng

14
©
Cloudera,
Inc.
All
rights
reserved.

Xiaomi
Big
Data
Analy>cs
Pipeline

Before
Kudu
•  Long
pipeline

high
latency(1
hour
~
1
day),
data
conversion
pains

•  No
ordering

Log
arrival(storage)
order
not
exactly
logical
order

e.g.
read
2-‐3
days
of
log
for
data
in
1
day

15
©
Cloudera,
Inc.
All
rights
reserved.

Xiaomi
Big
Data
Analysis
Pipeline

Simpliﬁed
With
Kudu
•  ETL
Pipeline(0~10s
latency)

Apps
that
need
to
prevent
backpressure
or
require
ETL

•  Direct
Pipeline(no
latency)

Apps
that
don’t
require
ETL
and
no
backpressure
issues

OLAP
scan

Side
table
lookup

Result
store

16
©
Cloudera,
Inc.
All
rights
reserved.

How
it
works

ReplicaEon
and
fault
tolerance

16

17
©
Cloudera,
Inc.
All
rights
reserved.

Tables,
Tablets,
and
Tablet
Servers

•  Table
is
horizontally
par>>oned
into
tablets

• Range
or
hash
parEEoning

• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
•  bucketNumber = hashCode(row[‘timestamp’]) % 100
•  Each
tablet
has
N
replicas
(3
or
5),
with
Ra]
consensus

• AutomaEc
fault
tolerance

• MTTR:
~5
seconds

•  Tablet
servers
host
tablets
on
local
disk
drives

17

18
©
Cloudera,
Inc.
All
rights
reserved.

Metadata
and
the
Master

•  Replicated
master

• Acts
as
a
tablet
directory

• Acts
as
a
catalog
(which
tables
exist,
etc)

• Acts
as
a
load
balancer
(tracks
TS
liveness,
re-‐replicates
under-‐replicated

tablets)

•  Not
a
boIleneck

• super
fast
in-‐memory
lookups

18

19
©
Cloudera,
Inc.
All
rights
reserved.

Client

Hey
Master!
Where
is
the
row
for

‘tlipcon’
in
table
“T”?

It’s
part
of
tablet
2,
which
is
on
servers
{Z,Y,X}.

BTW,
here’s
info
on
other
tablets
you
might

care
about:
T1,
T2,
T3,
…

UPDATE
tlipcon

SET
col=foo

Meta
Cache

T1:
…

T2:
…

T3:
…

20
©
Cloudera,
Inc.
All
rights
reserved.

How
it
works

Columnar
storage

20

21
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
storage

{25059873,

22309487,

23059861,

23010982}

Tweet_id

{newsycbot,

RideImpala,

fastly,

llvmorg}

User_name

{1442865158,

1442828307,

1442865156,

1442865155}

Created_at

{Visual
exp…,

Introducing
..,

Missing
July…,

LLVM
3.7….}

text

22
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
storage

{25059873,

22309487,

23059861,

23010982}

Tweet_id

{newsycbot,

RideImpala,

fastly,

llvmorg}

User_name

{1442865158,

1442828307,

1442865156,

1442865155}

Created_at

{Visual
exp…,

Introducing
..,

Missing
July…,

LLVM
3.7….}

text

SELECT
COUNT(*)
FROM
tweets
WHERE
user_name
=
‘newsycbot’;

Only
read
1
column

1GB
2GB
1GB
200GB

23
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
compression

{1442865158,

1442828307,

1442865156,

1442865155}

Created_at

Created_at
Diﬀ(created_at)

1442865158
n/a

1442828307
-‐36851

1442865156
36849

1442865155
-‐1

64
bits
each
17
bits
each

•  Many
columns
can
compress
to

a
few
bits
per
row!

•  Especially:

• Timestamps

• Time
series
values

• Low-‐cardinality
strings

•  Massive
space
savings
and

throughput
increase!

24
©
Cloudera,
Inc.
All
rights
reserved.

IntegraEons

25
©
Cloudera,
Inc.
All
rights
reserved.

Spark
DataSource
integraEon
(WIP)

sqlContext.load("org.kududb.spark",
Map("kudu.table" -> “foo”,
"kudu.master" -> “master.example.com”))
.registerTempTable(“mytable”)
df = sqlContext.sql(
“select col_a, col_b from mytable “ +
“where col_c = 123”)
Available
in
Kudu
0.7.0,
but
sEll
being
improved

26
©
Cloudera,
Inc.
All
rights
reserved.

Impala
integraEon

• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16
BUCKETS AS SELECT … FROM …
• INSERT/UPDATE/DELETE
• Not an Impala user? Community working on other integrations (Hive,
Drill, Presto, Phoenix)

27
©
Cloudera,
Inc.
All
rights
reserved.

MapReduce
integraEon

• MulE-‐framework
cluster
(MR
+
HDFS
+
Kudu
on
the
same
disks)

• KuduTableInputFormat
/
KuduTableOutputFormat

• Support
for
pushing
predicates,
column
projecEons,
etc

28
©
Cloudera,
Inc.
All
rights
reserved.

Performance

28

29
©
Cloudera,
Inc.
All
rights
reserved.

TPC-‐H
(AnalyEcs
benchmark)

•  75
server
cluster

• 12
(spinning)
disk
each,
enough
RAM
to
ﬁt
dataset

• TPC-‐H
Scale
Factor
100
(100GB)

•  Example
query:

•  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
29

30
©
Cloudera,
Inc.
All
rights
reserved.

-‐
Kudu
outperforms
Parquet
by
31%
(geometric
mean)
for
RAM-‐resident
data

31
©
Cloudera,
Inc.
All
rights
reserved.

Versus
other
NoSQL
storage

•  Phoenix:
SQL
layer
on
HBase

•  10
node
cluster
(9
worker,
1
master)

•  TPC-‐H
LINEITEM
table
only
(6B
rows)

31

2152

219

76

131

0.04

1918

13.2

1.7

0.7

0.15

155

9.3

1.4
1.5
1.37

0.01

0.1

1

10

100

1000

10000

Load
TPCH
Q1
COUNT(*)

COUNT(*)

WHERE…

single-‐row

lookup

Time
(sec)

Phoenix

Kudu

Parquet

32
©
Cloudera,
Inc.
All
rights
reserved.

What
about
NoSQL-‐style
random
access?
(YCSB)

•  YCSB
0.5.0-‐snapshot

•  10
node
cluster

(9
worker,
1
master)

•  100M
row
data
set

•  10M
operaEons
each

workload

32

33
©
Cloudera,
Inc.
All
rights
reserved.

Ge…ng
started

33

34
©
Cloudera,
Inc.
All
rights
reserved.

Project
status

•  Open
source
beta
released
in
September

•  Latest
release
0.7.1
hot
oﬀ
the
presses

• Usable
for
many
applicaEons
(Xiaomi
in
producEon)

• Have
not
experienced
unrecoverable
data
loss,
reasonably
stable
(almost
no

crashes
reported).
Users
tesEng
up
to
200
nodes
so
far.

• SEll
requires
some
expert
assistance,
and
you’ll
probably
ﬁnd
some
bugs

•  Part
of
the
Apache
So]ware
Founda>on
Incubator

• Community-‐driven
open
source
process
��

36
©
Cloudera,
Inc.
All
rights
reserved.

Ge…ng
started
as
a
user

•  hyp://getkudu.io

•  user@kudu.incubator.apache.org

•  hyp://getkudu-‐slack.herokuapp.com/

•  Quickstart
VM

• Easiest
way
to
get
started

• Impala
and
Kudu
in
an
easy-‐to-‐install
VM

•  CSD
and
Parcels

• For
installaEon
on
a
Cloudera
Manager-‐managed
cluster

36

37
©
Cloudera,
Inc.
All
rights
reserved.

Ge…ng
started
as
a
developer

•  hyp://github.com/apache/incubator-‐kudu

•  Code
reviews:
hyp://gerrit.cloudera.org

•  Public
JIRA:
hyp://issues.apache.org/jira/browse/KUDU

• Includes
bugs
going
back
to
2013.
Come
see
our
dirty
laundry!

•  Mailing
list:
dev@kudu.incubator.apache.org

•  Apache
2.0
license
open
source

•  ContribuEons
are
welcome
and
encouraged!

37

43
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
-‐
Updates

43

MemRowSet

name
pay
role

DiskRowSet
1

name
pay
role

DiskRowSet
2

Delta
MS

Delta
MS

Each
DiskRowSet
has
its
own

DeltaMemStore
to

accumulate
updates

base
data

base
data

44
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
-‐
Updates

44

MemRowSet

name
pay
role

DiskRowSet
1

name
pay
role

DiskRowSet
2

Delta
MS

Delta
MS

UPDATE
set
pay=“$1M”

WHERE
name=“todd”

Is
the
row
in
DiskRowSet
2?

(check
bloom
filters)

Is
the
row
in
DiskRowSet
1?

(check
bloom
filters)

Bloom
says:
no!

Bloom
says:
maybe!

Search
key
column
to
find

offset:
rowid
=
150

150:
pay=$1M

@
Eme
T

base
data

45
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
–
Read
path

45

MemRowSet

name
pay
role

DiskRowSet
1

name
pay
role

DiskRowSet
2

Delta
MS

Delta
MS

150:
pay=$1M

@
Eme
T

Read
rows
in
DiskRowSet
2

Then,
read
rows
in

DiskRowSet
1

Updates
are
applied
based
on
ordinal

oﬀset
within
DRS:
array
indexing
=
fast

base
data

base
data

46
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
–
Delta
ﬂushes

46

MemRowSet

name
pay
role

DiskRowSet
1

name
pay
role

DiskRowSet
2

Delta
MS

Delta
MS

150:
pay=$1M

@
Eme
T

REDO
DeltaFile

Flush

A
REDO
delta
indicates
how
to

transform
between
the
‘base

data’
(columnar)
and
a
later
version

base
data

base
data

47
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
–
Major
delta
compacEon

47

name
pay
role

DiskRowSet(pre-‐compacEon)

Delta
MS

REDO
DeltaFile
REDO
DeltaFile
REDO
DeltaFile

Many
deltas
accumulate:
lots
of
delta
applicaEon

work
on
reads

Merge
updates
for
columns
with
high
update

percentage

name
pay
role

DiskRowSet(post-‐compacEon)

Delta
MS

Unmerged
REDO

deltas
UNDO
deltas

base
data

If
a
column
has
few
updates,
doesn’t
need
to
be
re-‐
wriyen:
those
deltas
maintained
in
new
DeltaFile

48
©
Cloudera,
Inc.
All
rights
reserved.

Kudu
storage
–
RowSet
CompacEons

DRS
1
(32MB)

[PK=alice],

[PK=joe],

[PK=linda],

[PK=zach]

DRS
2
(32MB)

[PK=bob],

[PK=jon],

[PK=mary]

[PK=zeke]

DRS
3
(32MB)

[PK=carl],

[PK=julie],

[PK=omar]

[PK=zoe]

DRS
4
(32MB)
DRS
5
(32MB)
DRS
6
(32MB)

[alice,
bob,
carl,

joe]

[jon,
julie,
linda,

mary]

[omar,
zach,

zeke,
zoe]

Reorganize
rows
to
avoid
rowsets

with
overlapping
key
ranges

49
©
Cloudera,
Inc.
All
rights
reserved.

RaG
consensus

49

TS
A

Tablet
1

(LEADER)

Client

TS
B

Tablet
1

(FOLLOWER)

TS
C

Tablet
1

(FOLLOWER)

WAL

WAL
WAL

2b.
Leader
writes
local
WAL

1a.
Client-‐>Leader:
Write()
RPC

2a.
Leader-‐>Followers:

UpdateConsensus()
RPC

3.
Follower:
write
WAL

4.
Follower-‐>Leader:
success

3.
Follower:
write
WAL

5.
Leader
has
achieved
majority

6.
Leader-‐>Client:
Success!

50
©
Cloudera,
Inc.
All
rights
reserved.

Handling
inserts
and
updates

•  Inserts
go
to
an
in-‐memory
row
store
(MemRowSet)

• Durable
due
to
write-‐ahead
logging

• Later
flush
to
columnar
format
on
disk

•  Updates
go
to
in-‐memory
“delta
store”

• Later
flush
to
“delta
files”
on
disk

• Eventually
“compact”
into
the
previously-‐wriyen
columnar
data
files

•  Details
elided
here
due
to
Eme
constraints

• available
in
other
slide
decks
online,
or
our
academic-‐style
paper

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

Related slideshows

More Related Content

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data