Introduction to Spark 
Glenn K. Lockwood 
July 2014 
I. Hadoop/MapReduce Recap and Limitations 
II. Complex Workflows and RDDs 
III. The Spark Framework 
IV. Spark on Gordon 
V. Practical Limitations of Spark 
Map/Reduce Parallelism 
Data Data 
taDsakt a0 
task 5 task 4 
task 3 
task 1 task 2
Magic of HDFS 

Hadoop Workflow 
MapReduce Disk 
1. Map – convert raw input into 
key/value pairs. Output to 
local disk ("spill") 
2. Shuffle/Sort – All reducers 
retrieve all spilled records 
from all mappers over 
3. Reduce – For each unique 
key, do something with all 
the corresponding values. 
Output to HDFS 
Map Map Map 
Reduce Reduce Reduce
2. Full* data dump to disk 
MapReduce: Two 
Fundamental Limitations 
1. MapReduce prescribes 
• You map, then you reduce. 
• You cannot reduce, then map... 
• ...or anything else. See first 
Map Map Map 
Reduce Reduce Reduce 
between workflow steps. 
• Mappers deliver output on local 
disk (mapred.local.dir) 
• Reducers pull input over network 
from other nodes' local disks 
• Output goes right back to local 
* Combiners do local reductions to prevent a full, unreduced 
dump of data to local disk 
disks via HDFS 
Beyond MapReduce 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 

Beyond MapReduce: Complex 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
How can you do this without flushing intermediate 
results to disk after every operation? 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
How can you ensure fault tolerance for all of these 
baked-in operations? 
MapReduce Fault 
Map Map Map 
Reduce Reduce Reduce 
Mapper Failure: 
1. Re-run map task 
and spill to disk 
2. Block until finished 
3. Reducers proceed 
as normal 
Reducer Failure: 
1. Re-fetch spills from 
all mappers' disks 
2. Re-run reducer task
Performing Complex Workflows 
How can you do complex workflows without 
flushing intermediate results to disk after every 
1. Cache intermediate results in-memory 
2. Allow users to specify persistence in memory and 
partitioning of dataset across nodes 
How can you ensure fault tolerance? 
1. Coarse-grained atomicity via partitions (transform 
chunks of data, not record-by-record) 
2. Use transaction logging--forget replication 
Resilient Distributed Dataset (RDD) 
• Comprised of distributed, atomic partitions of elements 
• Apply transformations to generate new RDDs 
• RDDs are immutable (read-only) 
• RDDs can only be created from persistent storage (e.g., 
HDFS, POSIX, S3) or by transforming other RDDs 
# Create an RDD from a file on HDFS 
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') 
# Transform the RDD of lines into an RDD of words (one word per element) 
words = text.flatMap( lambda line: line.split() ) 
# Transform the RDD of words into an RDD of key/value pairs 
keyvals = lambda word: (word, 1) ) 
sc is a SparkContext object that describes our Spark cluster 
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) 

Potential RDD Workflow 
RDD Transformation vs. Action 
• Transformations are lazy: nothing actually happens when 
this code is evaluated 
• RDDs are computed only when an action is called on 
them, e.g., 
• Calculate statistics over the elements of an RDD (count, mean) 
• Save the RDD to a file (saveAsTextFile) 
• Reduce elements of an RDD into a single object or value (reduce) 
• Allows you to define partitioning/caching behavior after 
defining the RDD but before calculating its contents 
RDD Transformation vs. Action 
• Must insert an action here to get pipeline to execute. 
• Actions create files or objects: 
# The saveAsTextFile action dumps the contents of an RDD to disk 
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') 
# The count action returns the number of elements in an RDD 
>>> num_elements = rdd.count(); 
<type 'int'>
Resiliency: The 'R' in 'RDD' 
• No replication of in-memory data 
• Restrict transformations to coarse granularity 
• Partition-level operations simplifies data lineage 

Resiliency: The 'R' in 'RDD' 
• Reconstruct missing data from its lineage 
• Data in RDDs are deterministic since partitions 
are immutable and atomic 
Resiliency: The 'R' in 'RDD' 
• Long lineages or complex interactions 
(reductions, shuffles) can be checkpointed 
• RDD immutability  nonblocking (background) 
Introduction to Spark 
Spark Framework 
• Master/worker Model 
• Spark Master is analogous to Hadoop Jobtracker (MRv1) 
or Application Master (MRv2) 
• Spark Worker is analogous to Hadoop Tasktracker 
• Relies on "3rd party" storage for RDD generation 
(hdfs://, s3n://, file://, http://) 
• Spark clusters take three forms: 
• Standalone mode - workers communicate directly with 
master via spark://master:7077 URI 
• Mesos - mesos://master:5050 URI 
• YARN - no HA; complicated job launch 

Spark on Gordon: Configuration 
1. Standalone mode is the simplest configuration 
and execution model (similar to MRv1) 
2. Leverage existing HDFS support in myHadoop 
for storage 
3. Combine #1 and #2 to extend myHadoop to 
support Spark: 
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf 
myHadoop: Enabling experimental Spark support 
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark 
To use Spark, you will want to type the following commands:" 
source /home/glock/hadoop.conf/spark/ 
myspark start 
Spark on Gordon: Storage 
• Spark can use HDFS 
$ # after you run, of course 
$ pyspark 
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') 
>>> mydata.count() 
• Spark can use POSIX file systems too 
$ pyspark 
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') 
>>> mydata.count() 
• S3 Native (s3n://) and HTTP (http://) also work 
• file:// input will be served in chunks to Spark 
workers via the Spark driver's built-in httpd 
Spark on Gordon: Running 
Spark treats several languages as first-class 
Feature Scala Java Python 
Interactive YES NO YES 
Streaming YES YES NO 
R is a second-class citizen; basic RDD API is 
available outside of CRAN 
myHadoop/Spark on Gordon (1/2) 
#PBS -l nodes=2:ppn=16:native:flash 
#PBS -l walltime=00:30:00 
#PBS -q normal 
### Environment setup for Hadoop 
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH 
module load hadoop/2.2.0 
export HADOOP_CONF_DIR=$HOME/mycluster.conf 
### Start HDFS. Starting YARN isn't necessary since Spark will be running in 
### standalone mode on our cluster. 
### Load in the necessary Spark environment variables 
source $HADOOP_CONF_DIR/spark/ 
### Start the Spark masters and workers. Do NOT use the provided 
### by Spark, as they do not correctly honor $SPARK_CONF_DIR 
myspark start 

myHadoop/Spark on Gordon (2/2) 
### Run our example problem. 
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home 
### dir by default which is different from Hadoop 1.x!) 
hdfs dfs -mkdir -p /user/$USER 
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt 
### Step 2. Run our Python Spark job. Note that Spark implicitly requires 
### Python 2.6 (some features, like MLLib, require 2.7) 
module load python scipy 
### Step 3. Copy output back out 
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ 
### Shut down Spark and HDFS 
myspark stop 
### Clean up 
Wordcount submit script and Python code online:
Introduction to Spark 
Major Problems with Spark 
1. Still smells like a CS project 
2. Debugging is a dark art 
3. Not battle-tested at scale 
#1: Spark Smells Like CS 
• Components are constantly breaking 
• Graph.partitionBy broken in 1.0.0 (SPARK-1931) 
• Some components never worked 
• SPARK_CONF_DIR ( doesn't work (SPARK-2058) 
• doesn't work 
• Spark with YARN will break with large data sets (SPARK-2398) 
• spark-submit for standalone mode doesn't work (SPARK-2260) 

#1: Spark Smells Like CS 
• Really obvious usability issues: 
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') 
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 
scala.MatchError: 0 (of class java.lang.Integer) 
at org.apache.spark.api.python.PythonRDD$$anon$ 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) 
at org.apache.spark.executor.Executor$ 
at java.util.concurrent.ThreadPoolExecutor.runWorker( 
at java.util.concurrent.ThreadPoolExecutor$ 
Read an RDD, then write it out = unhandled exception with 
cryptic Scala errors from Python (SPARK-1690)
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/", line 682, in 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-", line 537, in __call__ 
File "/N/u/glock/apps/spark-0.9.0/python/lib/", 
line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. 
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with 
client version 4 
at org.apache.hadoop.ipc.RPC$Invoker.invoke( 
at $Proxy7.getProtocolVersion(Unknown Source) 
at org.apache.hadoop.ipc.RPC.getProxy( 
at org.apache.hadoop.ipc.RPC.getProxy( 
Cause: Spark built against Hadoop 2 DFS trying to access data 
on Hadoop 1 DFS 
#2: Debugging is a Dark Art 
>>> data.count() 
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
self.serializer.dump_stream(self._batched(iterator), stream) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for item in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/", lin 
if acc is None: 
TypeError: an integer is required 
at org.apache.spark.api.python.PythonRDD$$anon$ 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/", lin 
if not isinstance(x, basestring): 
SystemError: unknown opcode 
at org.apache.spark.api.python.PythonRDD$$anon$ 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 

#2: Spark Debugging Tips 
• $SPARK_LOG_DIR/app-* contains master/worker 
logs with failure information 
• Try to find the salient error amidst the stack traces 
• Google that error--odds are, it is a known issue 
• Stick any required environment variables ($PATH, 
$SPARK_CONF_DIR/ to rule out 
these problems 
• All else fails, look at Spark source code 
#3: Spark Isn't Battle Tested 
• Companies (Cloudera, SAP, etc) jumping on the 
Spark bandwagon with disclaimers about scaling 
• Spark does not handle multitenancy well at all. 
Wait scheduling is considered best way to achieve 
memory/disk data locality 
• Largest Spark clusters ~ hundreds of nodes 
Spark Take-Aways 
• Data is represented as resilient distributed datasets 
(RDDs) which remain in-memory and read-only 
• RDDs are comprised of elements 
• Elements are distributed across physical nodes in user-defined 
groups called partitions 
• RDDs are subject to transformations and actions 
• Fault tolerance achieved by lineage, not replication 
• Opinions 
• Spark is still in its infancy but its progress is promising 
• Good for evaluating--good for Gordon, Comet
Introduction to Spark 

Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
Start every webpage with a rank R = 1.0 
1. For each webpage linking in N neighbor webpages, 
have it "contribute" R/N to each of its N neighbors 
2. Then, for each webpage, set its rank R to (0.15 + 
0.85 * contributions) 
3. Repeat 
insert flow diagram here
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') 
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such 
# that all common keys are kept together. Then retain this RDD in memory. 
links = urls: urls.split()).distinct().groupByKey().cache() 
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 
ranks = (url, neighbors): (url, 1.0)) 
# Calculate and update URL rank 
for iteration in range(10): 
# Calculate URL contributions to their neighbors 
contribs = links.join(ranks).flatMap( 
lambda (url, (urls, rank)): computeContribs(urls, rank)) 
# Recalculate URL ranks based on neighbor contributions 
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) 
# Print all URLs and their ranks 
for (link, rank) in ranks.collect(): 
print '%s has rank %s' % (link, rank) 

  • 1. Introduction to Spark Glenn K. Lockwood July 2014 SAN DIEGO SUPERCOMPUTER CENTER
  • 2. Outline I. Hadoop/MapReduce Recap and Limitations II. Complex Workflows and RDDs III. The Spark Framework IV. Spark on Gordon V. Practical Limitations of Spark SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Map/Reduce Parallelism Data Data SAN DIEGO SUPERCOMPUTER CENTER Data Data Data taDsakt a0 task 5 task 4 task 3 task 1 task 2
  • 6. Shuffle/Sort SAN DIEGO SUPERCOMPUTER CENTER MapReduce Disk Spill 1. Map – convert raw input into key/value pairs. Output to local disk ("spill") 2. Shuffle/Sort – All reducers retrieve all spilled records from all mappers over network 3. Reduce – For each unique key, do something with all the corresponding values. Output to HDFS Map Map Map Reduce Reduce Reduce
  • 7. 2. Full* data dump to disk SAN DIEGO SUPERCOMPUTER CENTER MapReduce: Two Fundamental Limitations 1. MapReduce prescribes workflow. • You map, then you reduce. • You cannot reduce, then map... • ...or anything else. See first point. Map Map Map Reduce Reduce Reduce between workflow steps. • Mappers deliver output on local disk (mapred.local.dir) • Reducers pull input over network from other nodes' local disks • Output goes right back to local * Combiners do local reductions to prevent a full, unreduced dump of data to local disk disks via HDFS Shuffle/Sort
  • 8. Beyond MapReduce • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator SAN DIEGO SUPERCOMPUTER CENTER
  • 9. Beyond MapReduce: Complex Workflows • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce How can you do this without flushing intermediate results to disk after every operation? • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator How can you ensure fault tolerance for all of these baked-in operations? SAN DIEGO SUPERCOMPUTER CENTER
  • 10. SAN DIEGO SUPERCOMPUTER CENTER MapReduce Fault Tolerance Map Map Map Reduce Reduce Reduce Mapper Failure: 1. Re-run map task and spill to disk 2. Block until finished 3. Reducers proceed as normal Reducer Failure: 1. Re-fetch spills from all mappers' disks 2. Re-run reducer task
  • 11. Performing Complex Workflows How can you do complex workflows without flushing intermediate results to disk after every operation? 1. Cache intermediate results in-memory 2. Allow users to specify persistence in memory and partitioning of dataset across nodes How can you ensure fault tolerance? 1. Coarse-grained atomicity via partitions (transform chunks of data, not record-by-record) 2. Use transaction logging--forget replication SAN DIEGO SUPERCOMPUTER CENTER
  • 12. Resilient Distributed Dataset (RDD) • Comprised of distributed, atomic partitions of elements • Apply transformations to generate new RDDs • RDDs are immutable (read-only) • RDDs can only be created from persistent storage (e.g., HDFS, POSIX, S3) or by transforming other RDDs # Create an RDD from a file on HDFS text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') # Transform the RDD of lines into an RDD of words (one word per element) words = text.flatMap( lambda line: line.split() ) # Transform the RDD of words into an RDD of key/value pairs keyvals = lambda word: (word, 1) ) sc is a SparkContext object that describes our Spark cluster lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) SAN DIEGO SUPERCOMPUTER CENTER
  • 14. RDD Transformation vs. Action • Transformations are lazy: nothing actually happens when this code is evaluated • RDDs are computed only when an action is called on them, e.g., • Calculate statistics over the elements of an RDD (count, mean) • Save the RDD to a file (saveAsTextFile) • Reduce elements of an RDD into a single object or value (reduce) • Allows you to define partitioning/caching behavior after defining the RDD but before calculating its contents SAN DIEGO SUPERCOMPUTER CENTER
  • 15. RDD Transformation vs. Action • Must insert an action here to get pipeline to execute. • Actions create files or objects: # The saveAsTextFile action dumps the contents of an RDD to disk >>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') # The count action returns the number of elements in an RDD >>> num_elements = rdd.count(); num_elements; type(num_elements) SAN DIEGO SUPERCOMPUTER CENTER 215136 <type 'int'>
  • 16. Resiliency: The 'R' in 'RDD' • No replication of in-memory data • Restrict transformations to coarse granularity • Partition-level operations simplifies data lineage SAN DIEGO SUPERCOMPUTER CENTER
  • 17. Resiliency: The 'R' in 'RDD' • Reconstruct missing data from its lineage • Data in RDDs are deterministic since partitions are immutable and atomic SAN DIEGO SUPERCOMPUTER CENTER
  • 18. Resiliency: The 'R' in 'RDD' • Long lineages or complex interactions (reductions, shuffles) can be checkpointed • RDD immutability  nonblocking (background) SAN DIEGO SUPERCOMPUTER CENTER
  • 20. Spark Framework • Master/worker Model • Spark Master is analogous to Hadoop Jobtracker (MRv1) or Application Master (MRv2) • Spark Worker is analogous to Hadoop Tasktracker • Relies on "3rd party" storage for RDD generation (hdfs://, s3n://, file://, http://) • Spark clusters take three forms: • Standalone mode - workers communicate directly with master via spark://master:7077 URI • Mesos - mesos://master:5050 URI • YARN - no HA; complicated job launch SAN DIEGO SUPERCOMPUTER CENTER
  • 21. Spark on Gordon: Configuration 1. Standalone mode is the simplest configuration and execution model (similar to MRv1) 2. Leverage existing HDFS support in myHadoop for storage 3. Combine #1 and #2 to extend myHadoop to support Spark: $ export HADOOP_CONF_DIR=/home/glock/hadoop.conf $ ... myHadoop: Enabling experimental Spark support myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark myHadoop: To use Spark, you will want to type the following commands:" source /home/glock/hadoop.conf/spark/ myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 22. Spark on Gordon: Storage • Spark can use HDFS $ # after you run, of course ... $ pyspark >>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') >>> mydata.count() 982394 • Spark can use POSIX file systems too $ pyspark >>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') >>> mydata.count() 982394 • S3 Native (s3n://) and HTTP (http://) also work • file:// input will be served in chunks to Spark workers via the Spark driver's built-in httpd SAN DIEGO SUPERCOMPUTER CENTER
  • 23. Spark on Gordon: Running Spark treats several languages as first-class citizens: Feature Scala Java Python Interactive YES NO YES Shark (SQL) YES YES YES Streaming YES YES NO MLlib YES YES YES GraphX YES YES NO R is a second-class citizen; basic RDD API is available outside of CRAN ( SAN DIEGO SUPERCOMPUTER CENTER
  • 24. myHadoop/Spark on Gordon (1/2) #!/bin/bash #PBS -l nodes=2:ppn=16:native:flash #PBS -l walltime=00:30:00 #PBS -q normal ### Environment setup for Hadoop export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH module load hadoop/2.2.0 export HADOOP_CONF_DIR=$HOME/mycluster.conf ### Start HDFS. Starting YARN isn't necessary since Spark will be running in ### standalone mode on our cluster. ### Load in the necessary Spark environment variables source $HADOOP_CONF_DIR/spark/ ### Start the Spark masters and workers. Do NOT use the provided ### by Spark, as they do not correctly honor $SPARK_CONF_DIR myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 25. myHadoop/Spark on Gordon (2/2) ### Run our example problem. ### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home ### dir by default which is different from Hadoop 1.x!) hdfs dfs -mkdir -p /user/$USER hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt ### Step 2. Run our Python Spark job. Note that Spark implicitly requires ### Python 2.6 (some features, like MLLib, require 2.7) module load python scipy /home/glock/hadoop/run/ ### Step 3. Copy output back out hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ ### Shut down Spark and HDFS myspark stop ### Clean up SAN DIEGO SUPERCOMPUTER CENTER Wordcount submit script and Python code online:
  • 27. Major Problems with Spark 1. Still smells like a CS project 2. Debugging is a dark art 3. Not battle-tested at scale SAN DIEGO SUPERCOMPUTER CENTER
  • 28. #1: Spark Smells Like CS • Components are constantly breaking • Graph.partitionBy broken in 1.0.0 (SPARK-1931) • Some components never worked • SPARK_CONF_DIR ( doesn't work (SPARK-2058) • doesn't work • Spark with YARN will break with large data sets (SPARK-2398) • spark-submit for standalone mode doesn't work (SPARK-2260) SAN DIEGO SUPERCOMPUTER CENTER
  • 29. #1: Spark Smells Like CS • Really obvious usability issues: >>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') >>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$ at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) SAN DIEGO SUPERCOMPUTER CENTER ... at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$ at java.util.concurrent.ThreadPoolExecutor.runWorker( at java.util.concurrent.ThreadPoolExecutor$ at Read an RDD, then write it out = unhandled exception with cryptic Scala errors from Python (SPARK-1690)
  • 30. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/N/u/glock/apps/spark-0.9.0/python/pyspark/", line 682, in saveAsTextFile File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-", line 537, in __call__ File "/N/u/glock/apps/spark-0.9.0/python/lib/", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at at org.apache.hadoop.ipc.RPC$Invoker.invoke( at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy( at org.apache.hadoop.ipc.RPC.getProxy( Cause: Spark built against Hadoop 2 DFS trying to access data on Hadoop 1 DFS SAN DIEGO SUPERCOMPUTER CENTER
  • 31. #2: Debugging is a Dark Art >>> data.count() 14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. self.serializer.dump_stream(self._batched(iterator), stream) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for item in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/", lin if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$ at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 32. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/", lin if not isinstance(x, basestring): SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$ at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 33. #2: Spark Debugging Tips • $SPARK_LOG_DIR/app-* contains master/worker logs with failure information • Try to find the salient error amidst the stack traces • Google that error--odds are, it is a known issue • Stick any required environment variables ($PATH, $PYTHONPATH, $JAVA_HOME) in $SPARK_CONF_DIR/ to rule out these problems • All else fails, look at Spark source code SAN DIEGO SUPERCOMPUTER CENTER
  • 34. #3: Spark Isn't Battle Tested • Companies (Cloudera, SAP, etc) jumping on the Spark bandwagon with disclaimers about scaling • Spark does not handle multitenancy well at all. Wait scheduling is considered best way to achieve memory/disk data locality • Largest Spark clusters ~ hundreds of nodes SAN DIEGO SUPERCOMPUTER CENTER
  • 35. Spark Take-Aways SAN DIEGO SUPERCOMPUTER CENTER • FACTS • Data is represented as resilient distributed datasets (RDDs) which remain in-memory and read-only • RDDs are comprised of elements • Elements are distributed across physical nodes in user-defined groups called partitions • RDDs are subject to transformations and actions • Fault tolerance achieved by lineage, not replication • Opinions • Spark is still in its infancy but its progress is promising • Good for evaluating--good for Gordon, Comet
  • 37. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations Start every webpage with a rank R = 1.0 1. For each webpage linking in N neighbor webpages, have it "contribute" R/N to each of its N neighbors 2. Then, for each webpage, set its rank R to (0.15 + 0.85 * contributions) SAN DIEGO SUPERCOMPUTER CENTER 3. Repeat insert flow diagram here
  • 38. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') # Load key/value pairs of (url, link), eliminate duplicates, and partition them such # that all common keys are kept together. Then retain this RDD in memory. links = urls: urls.split()).distinct().groupByKey().cache() # Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 ranks = (url, neighbors): (url, 1.0)) # Calculate and update URL rank for iteration in range(10): # Calculate URL contributions to their neighbors contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Recalculate URL ranks based on neighbor contributions ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) # Print all URLs and their ranks for (link, rank) in ranks.collect(): print '%s has rank %s' % (link, rank) SAN DIEGO SUPERCOMPUTER CENTER

