DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
- 1. 1
©
Cloudera,
Inc.
All
rights
reserved.
Todd
Lipcon
(Kudu
team
lead)
–
todd@cloudera.com
@tlipcon
Tweet
about
this
talk:
@apachekudu
or
#kudu
*
IncubaEng
at
the
Apache
SoGware
FoundaEon
Apache
Kudu*:
Fast
AnalyEcs
on
Fast
Data
- 2. 2
©
Cloudera,
Inc.
All
rights
reserved.
Apache
Kudu
Storage
for
Fast
AnalyEcs
on
Fast
Data
• New
updatable
column
store
for
Hadoop
• Apache-‐licensed
open
source
• Beta
now
available
Columnar
Store
Kudu
- 4. 4
©
Cloudera,
Inc.
All
rights
reserved.
Current
Storage
Landscape
in
Hadoop
Ecosystem
HDFS
(GFS)
excels
at:
• Batch
ingest
only
(eg
hourly)
• Efficiently
scanning
large
amounts
of
data
(analyEcs)
HBase
(BigTable)
excels
at:
• Efficiently
finding
and
wriEng
individual
rows
• Making
data
mutable
Gaps
exist
when
these
properEes
are
needed
simultaneously
- 5. 5
©
Cloudera,
Inc.
All
rights
reserved.
• High
throughput
for
big
scans
Goal:
Within
2x
of
Parquet
• Low-‐latency
for
short
accesses
Goal:
1ms
read/write
on
SSD
• Database-‐like
semanEcs
(iniEally
single-‐row
ACID)
• Rela>onal
data
model
• SQL
queries
are
easy
• “NoSQL”
style
scan/insert/update
(Java/C++
client)
Kudu
Design
Goals
- 6. 6
©
Cloudera,
Inc.
All
rights
reserved.
Changing
Hardware
landscape
• Spinning
disk
-‐>
solid
state
storage
• NAND
flash:
Up
to
450k
read
250k
write
iops,
about
2GB/sec
read
and
1.5GB/
sec
write
throughput,
at
a
price
of
less
than
$3/GB
and
dropping
• 3D
XPoint
memory
(1000x
faster
than
NAND,
cheaper
than
RAM)
• RAM
is
cheaper
and
more
abundant:
• 64-‐>128-‐>256GB
over
last
few
years
• Takeaway:
The
next
boIleneck
is
CPU,
and
current
storage
systems
weren’t
designed
with
CPU
efficiency
in
mind.
- 8. 8
©
Cloudera,
Inc.
All
rights
reserved.
Scalable
and
fast
tabular
storage
• Scalable
• Tested
up
to
275
nodes
(~3PB
cluster)
• Designed
to
scale
to
1000s
of
nodes,
tens
of
PBs
• Fast
• Millions
of
read/write
operaEons
per
second
across
cluster
• Mul>ple
GB/second
read
throughput
per
node
• Tabular
• SQL-‐like
schema:
finite
number
of
typed
columns
(unlike
HBase/Cassandra)
• Fast
ALTER
TABLE
• “NoSQL”
APIs:
Java/C++/Python
or
SQL
(Impala/Spark/etc)
- 9. 9
©
Cloudera,
Inc.
All
rights
reserved.
Use
cases
and
architectures
- 10. 10
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
Use
Cases
Kudu
is
best
for
use
cases
requiring
a
simultaneous
combina>on
of
sequen>al
and
random
reads
and
writes
● Time
Series
○ Examples:
Stream
market
data;
fraud
detecEon
&
prevenEon;
network
monitoring
○ Workload:
Insert,
updates,
scans,
lookups
● Online
Repor>ng
○ Examples:
ODS
○ Workload:
Inserts,
updates,
scans,
lookups
- 11. 11
©
Cloudera,
Inc.
All
rights
reserved.
Real-‐Time
AnalyEcs
in
Hadoop
Today
Fraud
DetecEon
in
the
Real
World
=
Storage
Complexity
Considera>ons:
● How
do
I
handle
failure
during
this
process?
● How
oGen
do
I
reorganize
data
streaming
in
into
a
format
appropriate
for
reporEng?
● When
reporEng,
how
do
I
see
data
that
has
not
yet
been
reorganized?
● How
do
I
ensure
that
important
jobs
aren’t
interrupted
by
maintenance?
New
ParEEon
Most
Recent
ParEEon
Historic
Data
HBase
Parquet
File
Have
we
accumulated
enough
data?
Reorganize
HBase
file
into
Parquet
• Wait
for
running
operaEons
to
complete
• Define
new
Impala
parEEon
referencing
the
newly
wriyen
Parquet
file
Incoming
Data
(Messaging
System)
ReporEng
Request
Impala
on
HDFS
- 12. 12
©
Cloudera,
Inc.
All
rights
reserved.
Real-‐Time
AnalyEcs
in
Hadoop
with
Kudu
Improvements:
● One
system
to
operate
● No
cron
jobs
or
background
processes
● Handle
late
arrivals
or
data
correc>ons
with
ease
● New
data
available
immediately
for
analy>cs
or
opera>ons
Historical
and
Real-‐Eme
Data
Incoming
Data
(Messaging
System)
ReporEng
Request
Storage
in
Kudu
- 13. 13
©
Cloudera,
Inc.
All
rights
reserved.
Xiaomi
use
case
• World’s
4th
largest
smart-‐phone
maker
(most
popular
in
China)
• Gather
important
RPC
tracing
events
from
mobile
app
and
backend
service.
• Service
monitoring
&
troubleshooEng
tool.
u High
write
throughput
• >20
Billion
records/day
and
growing
u Query
latest
data
and
quick
response
• IdenEfy
and
resolve
issues
quickly
u Can
search
for
individual
records
• Easy
for
troubleshooEng
- 14. 14
©
Cloudera,
Inc.
All
rights
reserved.
Xiaomi
Big
Data
Analy>cs
Pipeline
Before
Kudu
• Long
pipeline
high
latency(1
hour
~
1
day),
data
conversion
pains
• No
ordering
Log
arrival(storage)
order
not
exactly
logical
order
e.g.
read
2-‐3
days
of
log
for
data
in
1
day
- 15. 15
©
Cloudera,
Inc.
All
rights
reserved.
Xiaomi
Big
Data
Analysis
Pipeline
Simplified
With
Kudu
• ETL
Pipeline(0~10s
latency)
Apps
that
need
to
prevent
backpressure
or
require
ETL
• Direct
Pipeline(no
latency)
Apps
that
don’t
require
ETL
and
no
backpressure
issues
OLAP
scan
Side
table
lookup
Result
store
- 16. 16
©
Cloudera,
Inc.
All
rights
reserved.
How
it
works
ReplicaEon
and
fault
tolerance
16
- 17. 17
©
Cloudera,
Inc.
All
rights
reserved.
Tables,
Tablets,
and
Tablet
Servers
• Table
is
horizontally
par>>oned
into
tablets
• Range
or
hash
parEEoning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• bucketNumber = hashCode(row[‘timestamp’]) % 100
• Each
tablet
has
N
replicas
(3
or
5),
with
Ra]
consensus
• AutomaEc
fault
tolerance
• MTTR:
~5
seconds
• Tablet
servers
host
tablets
on
local
disk
drives
17
- 18. 18
©
Cloudera,
Inc.
All
rights
reserved.
Metadata
and
the
Master
• Replicated
master
• Acts
as
a
tablet
directory
• Acts
as
a
catalog
(which
tables
exist,
etc)
• Acts
as
a
load
balancer
(tracks
TS
liveness,
re-‐replicates
under-‐replicated
tablets)
• Not
a
boIleneck
• super
fast
in-‐memory
lookups
18
- 19. 19
©
Cloudera,
Inc.
All
rights
reserved.
Client
Hey
Master!
Where
is
the
row
for
‘tlipcon’
in
table
“T”?
It’s
part
of
tablet
2,
which
is
on
servers
{Z,Y,X}.
BTW,
here’s
info
on
other
tablets
you
might
care
about:
T1,
T2,
T3,
…
UPDATE
tlipcon
SET
col=foo
Meta
Cache
T1:
…
T2:
…
T3:
…
- 20. 20
©
Cloudera,
Inc.
All
rights
reserved.
How
it
works
Columnar
storage
20
- 21. 21
©
Cloudera,
Inc.
All
rights
reserved.
Columnar
storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual
exp…,
Introducing
..,
Missing
July…,
LLVM
3.7….}
text
- 22. 22
©
Cloudera,
Inc.
All
rights
reserved.
Columnar
storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual
exp…,
Introducing
..,
Missing
July…,
LLVM
3.7….}
text
SELECT
COUNT(*)
FROM
tweets
WHERE
user_name
=
‘newsycbot’;
Only
read
1
column
1GB
2GB
1GB
200GB
- 23. 23
©
Cloudera,
Inc.
All
rights
reserved.
Columnar
compression
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
Created_at
Diff(created_at)
1442865158
n/a
1442828307
-‐36851
1442865156
36849
1442865155
-‐1
64
bits
each
17
bits
each
• Many
columns
can
compress
to
a
few
bits
per
row!
• Especially:
• Timestamps
• Time
series
values
• Low-‐cardinality
strings
• Massive
space
savings
and
throughput
increase!
- 25. 25
©
Cloudera,
Inc.
All
rights
reserved.
Spark
DataSource
integraEon
(WIP)
sqlContext.load("org.kududb.spark",
Map("kudu.table" -> “foo”,
"kudu.master" -> “master.example.com”))
.registerTempTable(“mytable”)
df = sqlContext.sql(
“select col_a, col_b from mytable “ +
“where col_c = 123”)
Available
in
Kudu
0.7.0,
but
sEll
being
improved
- 26. 26
©
Cloudera,
Inc.
All
rights
reserved.
Impala
integraEon
• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16
BUCKETS AS SELECT … FROM …
• INSERT/UPDATE/DELETE
• Not an Impala user? Community working on other integrations (Hive,
Drill, Presto, Phoenix)
- 27. 27
©
Cloudera,
Inc.
All
rights
reserved.
MapReduce
integraEon
• MulE-‐framework
cluster
(MR
+
HDFS
+
Kudu
on
the
same
disks)
• KuduTableInputFormat
/
KuduTableOutputFormat
• Support
for
pushing
predicates,
column
projecEons,
etc
- 29. 29
©
Cloudera,
Inc.
All
rights
reserved.
TPC-‐H
(AnalyEcs
benchmark)
• 75
server
cluster
• 12
(spinning)
disk
each,
enough
RAM
to
fit
dataset
• TPC-‐H
Scale
Factor
100
(100GB)
• Example
query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
29
- 30. 30
©
Cloudera,
Inc.
All
rights
reserved.
-‐
Kudu
outperforms
Parquet
by
31%
(geometric
mean)
for
RAM-‐resident
data
- 31. 31
©
Cloudera,
Inc.
All
rights
reserved.
Versus
other
NoSQL
storage
• Phoenix:
SQL
layer
on
HBase
• 10
node
cluster
(9
worker,
1
master)
• TPC-‐H
LINEITEM
table
only
(6B
rows)
31
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4
1.5
1.37
0.01
0.1
1
10
100
1000
10000
Load
TPCH
Q1
COUNT(*)
COUNT(*)
WHERE…
single-‐row
lookup
Time
(sec)
Phoenix
Kudu
Parquet
- 32. 32
©
Cloudera,
Inc.
All
rights
reserved.
What
about
NoSQL-‐style
random
access?
(YCSB)
• YCSB
0.5.0-‐snapshot
• 10
node
cluster
(9
worker,
1
master)
• 100M
row
data
set
• 10M
operaEons
each
workload
32
- 34. 34
©
Cloudera,
Inc.
All
rights
reserved.
Project
status
• Open
source
beta
released
in
September
• Latest
release
0.7.1
hot
off
the
presses
• Usable
for
many
applicaEons
(Xiaomi
in
producEon)
• Have
not
experienced
unrecoverable
data
loss,
reasonably
stable
(almost
no
crashes
reported).
Users
tesEng
up
to
200
nodes
so
far.
• SEll
requires
some
expert
assistance,
and
you’ll
probably
find
some
bugs
• Part
of
the
Apache
So]ware
Founda>on
Incubator
• Community-‐driven
open
source
process
- 36. 36
©
Cloudera,
Inc.
All
rights
reserved.
Ge…ng
started
as
a
user
• hyp://getkudu.io
• user@kudu.incubator.apache.org
• hyp://getkudu-‐slack.herokuapp.com/
• Quickstart
VM
• Easiest
way
to
get
started
• Impala
and
Kudu
in
an
easy-‐to-‐install
VM
• CSD
and
Parcels
• For
installaEon
on
a
Cloudera
Manager-‐managed
cluster
36
- 37. 37
©
Cloudera,
Inc.
All
rights
reserved.
Ge…ng
started
as
a
developer
• hyp://github.com/apache/incubator-‐kudu
• Code
reviews:
hyp://gerrit.cloudera.org
• Public
JIRA:
hyp://issues.apache.org/jira/browse/KUDU
• Includes
bugs
going
back
to
2013.
Come
see
our
dirty
laundry!
• Mailing
list:
dev@kudu.incubator.apache.org
• Apache
2.0
license
open
source
• ContribuEons
are
welcome
and
encouraged!
37
- 38. 38
©
Cloudera,
Inc.
All
rights
reserved.
hyp://getkudu.io/
@getkudu
- 40. 40
©
Cloudera,
Inc.
All
rights
reserved.
How
it
works
Write
and
read
paths
40
- 41. 41
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
Inserts
and
Flushes
41
MemRowSet
INSERT
(“todd”,
“$1000”,”engineer”)
name
pay
role
DiskRowSet
1
flush
- 42. 42
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
Inserts
and
Flushes
42
MemRowSet
name
pay
role
DiskRowSet
1
name
pay
role
DiskRowSet
2
INSERT
(“doug”,
“$1B”,
“Hadoop
man”)
flush
- 43. 43
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
-‐
Updates
43
MemRowSet
name
pay
role
DiskRowSet
1
name
pay
role
DiskRowSet
2
Delta
MS
Delta
MS
Each
DiskRowSet
has
its
own
DeltaMemStore
to
accumulate
updates
base
data
base
data
- 44. 44
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
-‐
Updates
44
MemRowSet
name
pay
role
DiskRowSet
1
name
pay
role
DiskRowSet
2
Delta
MS
Delta
MS
UPDATE
set
pay=“$1M”
WHERE
name=“todd”
Is
the
row
in
DiskRowSet
2?
(check
bloom
filters)
Is
the
row
in
DiskRowSet
1?
(check
bloom
filters)
Bloom
says:
no!
Bloom
says:
maybe!
Search
key
column
to
find
offset:
rowid
=
150
150:
pay=$1M
@
Eme
T
base
data
- 45. 45
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
Read
path
45
MemRowSet
name
pay
role
DiskRowSet
1
name
pay
role
DiskRowSet
2
Delta
MS
Delta
MS
150:
pay=$1M
@
Eme
T
Read
rows
in
DiskRowSet
2
Then,
read
rows
in
DiskRowSet
1
Updates
are
applied
based
on
ordinal
offset
within
DRS:
array
indexing
=
fast
base
data
base
data
- 46. 46
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
Delta
flushes
46
MemRowSet
name
pay
role
DiskRowSet
1
name
pay
role
DiskRowSet
2
Delta
MS
Delta
MS
150:
pay=$1M
@
Eme
T
REDO
DeltaFile
Flush
A
REDO
delta
indicates
how
to
transform
between
the
‘base
data’
(columnar)
and
a
later
version
base
data
base
data
- 47. 47
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
Major
delta
compacEon
47
name
pay
role
DiskRowSet(pre-‐compacEon)
Delta
MS
REDO
DeltaFile
REDO
DeltaFile
REDO
DeltaFile
Many
deltas
accumulate:
lots
of
delta
applicaEon
work
on
reads
Merge
updates
for
columns
with
high
update
percentage
name
pay
role
DiskRowSet(post-‐compacEon)
Delta
MS
Unmerged
REDO
deltas
UNDO
deltas
base
data
If
a
column
has
few
updates,
doesn’t
need
to
be
re-‐
wriyen:
those
deltas
maintained
in
new
DeltaFile
- 48. 48
©
Cloudera,
Inc.
All
rights
reserved.
Kudu
storage
–
RowSet
CompacEons
DRS
1
(32MB)
[PK=alice],
[PK=joe],
[PK=linda],
[PK=zach]
DRS
2
(32MB)
[PK=bob],
[PK=jon],
[PK=mary]
[PK=zeke]
DRS
3
(32MB)
[PK=carl],
[PK=julie],
[PK=omar]
[PK=zoe]
DRS
4
(32MB)
DRS
5
(32MB)
DRS
6
(32MB)
[alice,
bob,
carl,
joe]
[jon,
julie,
linda,
mary]
[omar,
zach,
zeke,
zoe]
Reorganize
rows
to
avoid
rowsets
with
overlapping
key
ranges
- 49. 49
©
Cloudera,
Inc.
All
rights
reserved.
RaG
consensus
49
TS
A
Tablet
1
(LEADER)
Client
TS
B
Tablet
1
(FOLLOWER)
TS
C
Tablet
1
(FOLLOWER)
WAL
WAL
WAL
2b.
Leader
writes
local
WAL
1a.
Client-‐>Leader:
Write()
RPC
2a.
Leader-‐>Followers:
UpdateConsensus()
RPC
3.
Follower:
write
WAL
4.
Follower-‐>Leader:
success
3.
Follower:
write
WAL
5.
Leader
has
achieved
majority
6.
Leader-‐>Client:
Success!
- 50. 50
©
Cloudera,
Inc.
All
rights
reserved.
Handling
inserts
and
updates
• Inserts
go
to
an
in-‐memory
row
store
(MemRowSet)
• Durable
due
to
write-‐ahead
logging
• Later
flush
to
columnar
format
on
disk
• Updates
go
to
in-‐memory
“delta
store”
• Later
flush
to
“delta
files”
on
disk
• Eventually
“compact”
into
the
previously-‐wriyen
columnar
data
files
• Details
elided
here
due
to
Eme
constraints
• available
in
other
slide
decks
online,
or
our
academic-‐style
paper