High concurrency, Low latency analytics using Spark/Kudu

High concurrency,
Low latency analytics
using Spark/Kudu
Chris George

Tech we will talk about:
Kudu
Spark
Spark Job Server
Spark Thrift Server

Columnar vs other
types of storage

What if you could update
parquet/ORC easily?

HDFS vs Kudu vs
HBase/Cassandra/xyz

Kudu is purely a storage
engine accessible
through api

To add sql queries/more
advanced sql like
operations

Number of cores =
number of partitions

Partitioning can be on
1+ columns

Composite primary keys important to filter
on the key in order
A, B, C
i.e. don’t scan for just B if possible
it will be expensive

Scans on a tablet is single
threaded but you can do 200+
scans on a tablet concurrently

To find your scale... load up a single
tablet
Insertion, Update, Deletes.. concurrently
Until it doesn't meet your performance

Partitioning is
extremely important

Kudu client is java
Python connectors coming
C++ client

Java Client loops through
tablets, but not
concurrently

But you can code the
multithread or contribute

Predicates/Projections on
any column very quickly
at scale

Reads CSV
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as
header
.option("inferSchema", "true") // Automatically infer
data types
.load("cars.csv")
Writes CSV
val selectedData = df.select("year", "model")
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")

But these are often simlified as:
val parquetDataframe =
sqlContext.read.parquet(“people.parquet")
parquetDataframe.write.parquet("people.parquet")

I wrote the current version
of the Kudu
Datasource/Spark
Integration

There are limitations
with the datasource api

Save Modes for datasource api:
append, overwrite, ignore, error

error = throw exception
if the data exists

What if I want to
update? Nope

What about deletes?
Not individually

So how do you support
updates/deletes?

By not using the
datasource api.. but I'll talk
more about that in a minute

Because it's smarter
than it appears for reads

Pushdown predicates
and projections

Pushdown predicates:
val df =
sqlContext.read.options(Map("kudu.master" ->
"kudu.master:7051","kudu.table" ->
“kudu_table")).kudu
df.filter("id" >= 5).show()

Datasource has knowledge of what
can be pushed down to underlying
store
and what can not.

Cause if you want things to
be fast you need to know
what is not pushed down!

EqualTo
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
And
https://github.com/cloudera/kudu/blob/master/j
ava/kudu-
spark/src/main/scala/org/apache/kudu/spark/k
udu/DefaultSource.scala#L159

Did you notice whats missing?
EqualTo
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
And

"OR"
So spark will use it's optimizer to run two
separate kudu scans for the OR
"IN" is coming very soon, nuanced
performance details

btw if you register the dataframe as
a temp table in spark
"select * from someDF where
id>=5" will also do pushdowns

"select * from someDF
where id>=5" will also do
pushdowns

things like select * from someDF
where lower(name)="joe"
will pull the entire table into memory
probably a bad thing

Projections will also be pushed down to
kudu so your not retrieving the entire row
df.select("id", "name")
select id, name from someDf

Looked at lots of existing
datasource’s to design
kudu’s

How does kudu do
updates/deletes in spark?

// Use KuduContext to
create, delete, or write to
Kudu tables
val kuduContext = new
KuduContext("kudu.master:7051")

// Insert data
kuduContext.insertRows(df, "test_table")
// Delete data
kuduContext.deleteRows(filteredDF, "test_table")
// Upsert data
kuduContext.upsertRows(df, "test_table")
// Update data
val alteredDF = df.select("id", $"count" + 1)
kuduContext.updateRows(alteredDF, “test_table")
http://kudu.apache.org/docs/developing.html

Upserts are handled server side for performance
Upserts can also be handled through datasource api:
df.write.options(Map("kudu.master"->
"kudu.master:7051", "kudu.table"-
>"test_table")).mode("append").kudu

You can also create,
check existence and
delete tables through api

Additional notes:
Kudu datasource currently works with spark 1.x
Next release it will support both 1.x and 2.x
It's being improved on regular basis

Number of partitions on the dataframe is
related to how many tablets/partitions are
related to the filter.
Partition scans are parallel and have locality
awareness in spark

Be sure to set spark locality wait to
something other small for low latency
(3 seconds is the spark default)

Created for low
latency jobs on spark

Persistent contexts
Reduces runtime of hello world
type of job from 1 second to 10 ms

Rest based api to:
Run Jobs
Create contexts
Check status of job both async/sync

Creating a context calls spark
submit (in separate jvm mode)
Uses akka to communicate
between rest and spark driver

To create a persistent context you need:
cpu cores + memory footprint
name to reference it by
factory to use for the context.
ie HiveContextFactory vs SqlContextFactory

Our average job time is
30ms when coming through
api for simpler retrievals

Jobs need to implement an interface
context will be passed in
DON’T CREATE YOUR OWN
SQLCONTEXT!!

Currently only supports
spark 1.x
2.x is coming soonish

Keeps track of job
runtimes in nice ui along
with additional metrics

You can cache data and it
will be available to later
jobs

You can also load objects and
they are available to later jobs
via NamedObject interface

Persistent context can be
run in seperate JVM or
within SJS

It does have some
sharp edges though...

Due to jvm classloader
contexts need to be restarted
on deploy to pick up new code

Some settings:
spark.files.overwrite = true
context-per-jvm = true
spray-can: parsing.max-content-length = 256m
spray-can: idle-timeout = 600 s
spray-can: request-timeout = 540 s
spark.serializer =
"org.apache.spark.serializer.KryoSerializer"
filedao vs sqldao backend
have to build from source/no binary for SJS

hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:memory:myDB;create=true</value>
<description>JDBC connect string for a JDBC
metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>

SparkThrift Server
Extended/Reused hive
thrift server

I run the following on a persistent context:
sc.getConf.set("spark.sql.hive.thriftServer.singleSes
sion", "true")
sqlContext.setConf("hive.server2.thrift.port", port) //
port to run thrift server on
HiveThriftServer2.startWithContext(sqlContext)

Now I can connect using
hive-jdbc
odbc (microsoft or simba)

Run a job with joins/ or even just a
basic dataframe through
datasource api and
registerTempTable

val df =
sqlContext.read.options(Map("kudu.master"
-> "kudu.master:7051","kudu.table" ->
“kudu_table")).kudu
df.registerTempTable

You could also potentially
cache/persist via spark and
register that way assuming joins
are expensive

Now you can run queries
as if it was a traditional
database

Hey thats great, but how fast?
500 ms average response time
200 concurrent complex queries
1+ Billion rows with 200+ columns
sql queries with 5 predicates, min,max,count some
values and group by on 5 columns
No spark caching

We take this a step farther and do
complex dataframes and it is made
available as a registered temp
table

Questions… If we run out
of time send me
questions on slack

High concurrency, Low latency analytics using Spark/Kudu

More Related Content