TensorFrames: Google Tensorflow on Apache Spark

TensorFrames:
Google Tensorflow on
Apache Spark
Tim Hunter
Meetup 08/2016 - Salesforce

How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I
contribute to its development
2

How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3

Founded by the team who
created Apache Spark
Offers a hosted service:
- Apache Spark in the
cloud
- Notebooks
- Cluster management
- Production environment
About Databricks
4

Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine
Learning
(and Spark user since Spark 0.5)
About me
5

Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
6

Numerical computing for Data
Science
• Queries are data-heavy
• However algorithms are computation-heavy
• They operate on simple data types: integers,
floats, doubles, vectors, matrices
7

The case for speed
• Numerical bottlenecks are good targets for
optimization
• Let data scientists get faster results
• Faster turnaround for experimentations
• How can we run these numerical algorithms
faster?
8

Evolution of computing power
9
Failure is not an option:
it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scaleup

10
NLTK
Theano
Today’s talk:
Spark + TensorFlow

• Processor speed cannot keep up with memory
and network improvements
• Access to the processor is the new bottleneck
• Project Tungsten in Spark: leverage the
processor’s heuristics for executing code and
fetching memory
• Does not account for the fact that the problem is
numerical
11

Asynchronous vs. synchronous
• Asynchronous algorithms perform updates concurrently
• Spark is synchronous model, deep learning frameworks
usually asynchronous
• A large number of ML computations are synchronous
• Even deep learning may benefit from synchronous
updates
12

Outline
• The future
13

GPGPUs
14
• Graphics Processing Units for General Purpose
computations
6000
Theoretical peak
throughput
GPU CPU
Theoretical peak
bandwidth
GPU CPU

• Library for writing “machine intelligence”
algorithms
• Very popular for deep learning and neural
networks
• Can also be used for general purpose
numerical computations
• Interface in C++ and Python
15
Google TensorFlow

Numerical dataflow with Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z

Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
x:
int32
y:
int32
mul 3
z

Outline
• The future
19

20
It is a communication problem
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object

21
TensorFrames: native embedding of
TensorFlow
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object

• Estimation of
distribution from
samples
• Non-parametric
• Unknown bandwidth
parameter
• Can be evaluated with
goodness of fit
An example: kernel density scoring
22

• In practice, compute:
with:
• In a nutshell: a complex numerical function
An example: kernel density scoring
23

24
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Runtime(sec)
def score(x: Double): Double = {
val dis = points.map { z_k =>
- (x - z_k) * (x - z_k) / ( 2 * b * b)
}
val minDis = dis.min
val exps = dis.map(d => math.exp(d - minDis))
minDis - math.log(b * N) + math.log(exps.sum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()

25
Speedup
0
60
120
180
Runtime(sec)
def score(x: Double): Double = {
val dis = new Array[Double](N)
var idx = 0
while(idx < N) {
val z_k = points(idx)
dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b)
idx += 1
}
val minDis = dis.min
var expSum = 0.0
idx = 0
while(idx < N) {
expSum += math.exp(dis(idx) - minDis)
idx += 1
}
minDis - math.log(b * N) + math.log(expSum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()

26
Speedup
0
60
120
180
Runtime(sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()

27
Speedup
0
60
120
180
Runtime(sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
with device("/gpu"):
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()

Outline
• The future
30

31
Improving communication
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
Direct memory copy
Columnar
storage

The future
• Integration with Tungsten:
• Direct memory copy
• Columnar storage
• Better integration with MLlib data types
• GPU instances in Databricks:
Official support coming this fall
32

Recap
• Spark: an efficient framework for running
computations on thousands of computers
• TensorFlow: high-performance numerical
framework
• Get the best of both with TensorFrames:
• Simple API for distributed numerical computing
• Can leverage the hardware of the cluster
33

Try these demos yourself
• TensorFrames source code and documentation:
github.com/databricks/tensorframes
spark-packages.org/package/databricks/tensorframes
• Demo notebooks available on Databricks
• The official TensorFlow website:
www.tensorflow.org
34

Spark Summit EU 2016
15% Discount Code: DatabricksEU16
35

TensorFrames: Google Tensorflow on Apache Spark

More Related Content

TensorFrames: Google Tensorflow on Apache Spark

Editor's Notes