SlideShare a Scribd company logo
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
Who is this guy?
Tech we will talk about:
Kudu
Spark
Spark Job Server
Spark Thrift Server
What was the
problem?
Apache Kudu
History of Kudu
Columnar vs other
types of storage
What if you could update
parquet/ORC easily?
HDFS vs Kudu vs
HBase/Cassandra/xyz
Kudu is purely a storage
engine accessible
through api
To add sql queries/more
advanced sql like
operations
Impala vs Spark
Kudu Slack Channel
Master and Tablets in
Kudu
Range and Hash
Partitioning
Number of cores =
number of partitions
Partitioning can be on
1+ columns
Composite primary keys important to filter
on the key in order
A, B, C
i.e. don’t scan for just B if possible
it will be expensive
Scans on a tablet is single
threaded but you can do 200+
scans on a tablet concurrently
To find your scale... load up a single
tablet
Insertion, Update, Deletes.. concurrently
Until it doesn't meet your performance
Partitioning is
extremely important
Kudu client is java
Python connectors coming
C++ client
Java Client loops through
tablets, but not
concurrently
But you can code the
multithread or contribute
Predicates on any
column
Summary of why
Kudu?
Predicates/Projections on
any column very quickly
at scale
Spark
Spark Datasource api:
Reads CSV
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as
header
.option("inferSchema", "true") // Automatically infer
data types
.load("cars.csv")
Writes CSV
val selectedData = df.select("year", "model")
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
But these are often simlified as:
val parquetDataframe =
sqlContext.read.parquet(“people.parquet")
parquetDataframe.write.parquet("people.parquet")
I wrote the current version
of the Kudu
Datasource/Spark
Integration
There are limitations
with the datasource api
Save Modes for datasource api:
append, overwrite, ignore, error
append = insert
overwrite = truncate + insert
ignore = create if not exists
error = throw exception
if the data exists
What if I want to
update? Nope
What about deletes?
Not individually
So how do you support
updates/deletes?
By not using the
datasource api.. but I'll talk
more about that in a minute
Immutability of
dataframes
So why use
datasource api?
Because it's smarter
than it appears for reads
Pushdown predicates
and projections
Pushdown predicates:
val df =
sqlContext.read.options(Map("kudu.master" ->
"kudu.master:7051","kudu.table" ->
“kudu_table")).kudu
df.filter("id" >= 5).show()
Datasource has knowledge of what
can be pushed down to underlying
store
and what can not.
Why am I telling you
this?
Cause if you want things to
be fast you need to know
what is not pushed down!
EqualTo
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
And
https://github.com/cloudera/kudu/blob/master/j
ava/kudu-
spark/src/main/scala/org/apache/kudu/spark/k
udu/DefaultSource.scala#L159
Did you notice whats missing?
EqualTo
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
And
"OR"
So spark will use it's optimizer to run two
separate kudu scans for the OR
"IN" is coming very soon, nuanced
performance details
btw if you register the dataframe as
a temp table in spark
"select * from someDF where
id>=5" will also do pushdowns
"select * from someDF
where id>=5" will also do
pushdowns
things like select * from someDF
where lower(name)="joe"
will pull the entire table into memory
probably a bad thing
Projections will also be pushed down to
kudu so your not retrieving the entire row
df.select("id", "name")
select id, name from someDf
Looked at lots of existing
datasource’s to design
kudu’s
How does kudu do
updates/deletes in spark?
// Use KuduContext to
create, delete, or write to
Kudu tables
val kuduContext = new
KuduContext("kudu.master:7051")
// Insert data
kuduContext.insertRows(df, "test_table")
// Delete data
kuduContext.deleteRows(filteredDF, "test_table")
// Upsert data
kuduContext.upsertRows(df, "test_table")
// Update data
val alteredDF = df.select("id", $"count" + 1)
kuduContext.updateRows(alteredDF, “test_table")
http://kudu.apache.org/docs/developing.html
Upserts are handled server side for performance
Upserts can also be handled through datasource api:
df.write.options(Map("kudu.master"->
"kudu.master:7051", "kudu.table"-
>"test_table")).mode("append").kudu
You can also create,
check existence and
delete tables through api
Additional notes:
Kudu datasource currently works with spark 1.x
Next release it will support both 1.x and 2.x
It's being improved on regular basis
Number of partitions on the dataframe is
related to how many tablets/partitions are
related to the filter.
Partition scans are parallel and have locality
awareness in spark
Be sure to set spark locality wait to
something other small for low latency
(3 seconds is the spark default)
Spark Job Server
(SJS)
Created for low
latency jobs on spark
Persistent contexts
Reduces runtime of hello world
type of job from 1 second to 10 ms
Rest based api to:
Run Jobs
Create contexts
Check status of job both async/sync
Creating a context calls spark
submit (in separate jvm mode)
Uses akka to communicate
between rest and spark driver
To create a persistent context you need:
cpu cores + memory footprint
name to reference it by
factory to use for the context.
ie HiveContextFactory vs SqlContextFactory
Our average job time is
30ms when coming through
api for simpler retrievals
Jobs need to implement an interface
context will be passed in
DON’T CREATE YOUR OWN
SQLCONTEXT!!
Currently only supports
spark 1.x
2.x is coming soonish
Keeps track of job
runtimes in nice ui along
with additional metrics
You can cache data and it
will be available to later
jobs
You can also load objects and
they are available to later jobs
via NamedObject interface
Persistent context can be
run in seperate JVM or
within SJS
It does have some
sharp edges though...
Due to jvm classloader
contexts need to be restarted
on deploy to pick up new code
Some settings:
spark.files.overwrite = true
context-per-jvm = true
spray-can: parsing.max-content-length = 256m
spray-can: idle-timeout = 600 s
spray-can: request-timeout = 540 s
spark.serializer =
"org.apache.spark.serializer.KryoSerializer"
filedao vs sqldao backend
have to build from source/no binary for SJS
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:memory:myDB;create=true</value>
<description>JDBC connect string for a JDBC
metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC
metastore</description>
</property>
SparkThrift Server
Extended/Reused hive
thrift server
I run the following on a persistent context:
sc.getConf.set("spark.sql.hive.thriftServer.singleSes
sion", "true")
sqlContext.setConf("hive.server2.thrift.port", port) //
port to run thrift server on
HiveThriftServer2.startWithContext(sqlContext)
Now I can connect using
hive-jdbc
odbc (microsoft or simba)
Run a job with joins/ or even just a
basic dataframe through
datasource api and
registerTempTable
val df =
sqlContext.read.options(Map("kudu.master"
-> "kudu.master:7051","kudu.table" ->
“kudu_table")).kudu
df.registerTempTable
You could also potentially
cache/persist via spark and
register that way assuming joins
are expensive
Now you can run queries
as if it was a traditional
database
Hey thats great, but how fast?
500 ms average response time
200 concurrent complex queries
1+ Billion rows with 200+ columns
sql queries with 5 predicates, min,max,count some
values and group by on 5 columns
No spark caching
We take this a step farther and do
complex dataframes and it is made
available as a registered temp
table
Questions… If we run out
of time send me
questions on slack

More Related Content

High concurrency,
Low latency analytics
using Spark/Kudu