SlideShare a Scribd company logo
Introduction to Spark 
Glenn K. Lockwood 
July 2014 
SAN DIEGO SUPERCOMPUTER CENTER
Outline 
I. Hadoop/MapReduce Recap and Limitations 
II. Complex Workflows and RDDs 
III. The Spark Framework 
IV. Spark on Gordon 
V. Practical Limitations of Spark 
SAN DIEGO SUPERCOMPUTER CENTER
Map/Reduce Parallelism 
Data Data 
SAN DIEGO SUPERCOMPUTER CENTER 
Data 
Data 
Data 
taDsakt a0 
task 5 task 4 
task 3 
task 1 task 2
Magic of HDFS 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic

MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.

mapreducehadoopbig data
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial

This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.

apache_hadoop mapreduce big_data cloud_computing
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture

This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark

apache sparkdistributed systemtungsten
Hadoop Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
Shuffle/Sort 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Disk 
Spill 
1. Map – convert raw input into 
key/value pairs. Output to 
local disk ("spill") 
2. Shuffle/Sort – All reducers 
retrieve all spilled records 
from all mappers over 
network 
3. Reduce – For each unique 
key, do something with all 
the corresponding values. 
Output to HDFS 
Map Map Map 
Reduce Reduce Reduce
2. Full* data dump to disk 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce: Two 
Fundamental Limitations 
1. MapReduce prescribes 
workflow. 
• You map, then you reduce. 
• You cannot reduce, then map... 
• ...or anything else. See first 
point. 
Map Map Map 
Reduce Reduce Reduce 
between workflow steps. 
• Mappers deliver output on local 
disk (mapred.local.dir) 
• Reducers pull input over network 
from other nodes' local disks 
• Output goes right back to local 
* Combiners do local reductions to prevent a full, unreduced 
dump of data to local disk 
disks via HDFS 
Shuffle/Sort
Beyond MapReduce 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm

This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.

parallel programmingparallel programmingapache hadoop
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo

This document summarizes Viadeo's use of Apache Spark. It discusses how Spark is used to build models for job offer click prediction and member segmentation. Spark jobs process event log data from HDFS and HBase to cluster job titles, build relationship graphs, compute input variables for regression models, and evaluate segments. The models improve click-through rates and allow flexible, fast member targeting. Future work includes indexing segmentations and exposing them for analytics and online campaign building.

ad targetingsegmentationspark
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

Beyond MapReduce: Complex 
Workflows 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
How can you do this without flushing intermediate 
results to disk after every operation? 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
How can you ensure fault tolerance for all of these 
baked-in operations? 
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Fault 
Tolerance 
Map Map Map 
Reduce Reduce Reduce 
Mapper Failure: 
1. Re-run map task 
and spill to disk 
2. Block until finished 
3. Reducers proceed 
as normal 
Reducer Failure: 
1. Re-fetch spills from 
all mappers' disks 
2. Re-run reducer task
Performing Complex Workflows 
How can you do complex workflows without 
flushing intermediate results to disk after every 
operation? 
1. Cache intermediate results in-memory 
2. Allow users to specify persistence in memory and 
partitioning of dataset across nodes 
How can you ensure fault tolerance? 
1. Coarse-grained atomicity via partitions (transform 
chunks of data, not record-by-record) 
2. Use transaction logging--forget replication 
SAN DIEGO SUPERCOMPUTER CENTER
Resilient Distributed Dataset (RDD) 
• Comprised of distributed, atomic partitions of elements 
• Apply transformations to generate new RDDs 
• RDDs are immutable (read-only) 
• RDDs can only be created from persistent storage (e.g., 
HDFS, POSIX, S3) or by transforming other RDDs 
# Create an RDD from a file on HDFS 
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') 
# Transform the RDD of lines into an RDD of words (one word per element) 
words = text.flatMap( lambda line: line.split() ) 
# Transform the RDD of words into an RDD of key/value pairs 
keyvals = words.map( lambda word: (word, 1) ) 
sc is a SparkContext object that describes our Spark cluster 
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...

Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage. Speaker: Omkar Joshi (Uber) Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

omkar joshiuberchester chen
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...

This document provides an overview of MapReduce programming and best practices for Apache Hadoop. It describes the key components of Hadoop including HDFS, MapReduce, and the data flow. It also discusses optimizations that can be made to MapReduce jobs, such as using combiners, compression, and speculation. Finally, it outlines some anti-patterns to avoid and tips for debugging MapReduce applications.

hadoopindiasummithadoophadoopsummit
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig

Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.

Potential RDD Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Transformations are lazy: nothing actually happens when 
this code is evaluated 
• RDDs are computed only when an action is called on 
them, e.g., 
• Calculate statistics over the elements of an RDD (count, mean) 
• Save the RDD to a file (saveAsTextFile) 
• Reduce elements of an RDD into a single object or value (reduce) 
• Allows you to define partitioning/caching behavior after 
defining the RDD but before calculating its contents 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Must insert an action here to get pipeline to execute. 
• Actions create files or objects: 
# The saveAsTextFile action dumps the contents of an RDD to disk 
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') 
# The count action returns the number of elements in an RDD 
>>> num_elements = rdd.count(); 
num_elements; 
type(num_elements) 
SAN DIEGO SUPERCOMPUTER CENTER 
215136 
<type 'int'>
Resiliency: The 'R' in 'RDD' 
• No replication of in-memory data 
• Restrict transformations to coarse granularity 
• Partition-level operations simplifies data lineage 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained

The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.

apache hadoopmapreduce
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers

If you are interviewing for a Hadoop jobs, here are few frequently asked Question and Answers which can be asked in an interview.

apache hadoopinterviewquestion answering
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

Resiliency: The 'R' in 'RDD' 
• Reconstruct missing data from its lineage 
• Data in RDDs are deterministic since partitions 
are immutable and atomic 
SAN DIEGO SUPERCOMPUTER CENTER
Resiliency: The 'R' in 'RDD' 
• Long lineages or complex interactions 
(reductions, shuffles) can be checkpointed 
• RDD immutability  nonblocking (background) 
SAN DIEGO SUPERCOMPUTER CENTER
Introduction to Spark 
SPARK: AN IMPLEMENTATION 
OF RDDS 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Framework 
• Master/worker Model 
• Spark Master is analogous to Hadoop Jobtracker (MRv1) 
or Application Master (MRv2) 
• Spark Worker is analogous to Hadoop Tasktracker 
• Relies on "3rd party" storage for RDD generation 
(hdfs://, s3n://, file://, http://) 
• Spark clusters take three forms: 
• Standalone mode - workers communicate directly with 
master via spark://master:7077 URI 
• Mesos - mesos://master:5050 URI 
• YARN - no HA; complicated job launch 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications

This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.

sparkarchitecturebig data
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2

This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.

Spark on Gordon: Configuration 
1. Standalone mode is the simplest configuration 
and execution model (similar to MRv1) 
2. Leverage existing HDFS support in myHadoop 
for storage 
3. Combine #1 and #2 to extend myHadoop to 
support Spark: 
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf 
$ myhadoop-configure.sh 
... 
myHadoop: Enabling experimental Spark support 
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark 
myHadoop: 
To use Spark, you will want to type the following commands:" 
source /home/glock/hadoop.conf/spark/spark-env.sh 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Storage 
• Spark can use HDFS 
$ start-dfs.sh # after you run myhadoop-configure.sh, of course 
... 
$ pyspark 
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') 
>>> mydata.count() 
982394 
• Spark can use POSIX file systems too 
$ pyspark 
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') 
>>> mydata.count() 
982394 
• S3 Native (s3n://) and HTTP (http://) also work 
• file:// input will be served in chunks to Spark 
workers via the Spark driver's built-in httpd 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Running 
Spark treats several languages as first-class 
citizens: 
Feature Scala Java Python 
Interactive YES NO YES 
Shark (SQL) YES YES YES 
Streaming YES YES NO 
MLlib YES YES YES 
GraphX YES YES NO 
R is a second-class citizen; basic RDD API is 
available outside of CRAN 
(http://amplab-extras.github.io/SparkR-pkg/) 
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop/Spark on Gordon (1/2) 
#!/bin/bash 
#PBS -l nodes=2:ppn=16:native:flash 
#PBS -l walltime=00:30:00 
#PBS -q normal 
### Environment setup for Hadoop 
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH 
module load hadoop/2.2.0 
export HADOOP_CONF_DIR=$HOME/mycluster.conf 
myhadoop-configure.sh 
### Start HDFS. Starting YARN isn't necessary since Spark will be running in 
### standalone mode on our cluster. 
start-dfs.sh 
### Load in the necessary Spark environment variables 
source $HADOOP_CONF_DIR/spark/spark-env.sh 
### Start the Spark masters and workers. Do NOT use the start-all.sh provided 
### by Spark, as they do not correctly honor $SPARK_CONF_DIR 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

The document provides an overview of various Apache Pig features including: - The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS. - Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data. - Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined. - Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.

myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30

Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features. myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)

hadoophpctorque
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon

A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.

bioinformaticsgenomicsbig data
myHadoop/Spark on Gordon (2/2) 
### Run our example problem. 
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home 
### dir by default which is different from Hadoop 1.x!) 
hdfs dfs -mkdir -p /user/$USER 
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt 
### Step 2. Run our Python Spark job. Note that Spark implicitly requires 
### Python 2.6 (some features, like MLLib, require 2.7) 
module load python scipy 
/home/glock/hadoop/run/wordcount-spark.py 
### Step 3. Copy output back out 
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ 
### Shut down Spark and HDFS 
myspark stop 
stop-dfs.sh 
### Clean up 
myhadoop-cleanup.sh 
SAN DIEGO SUPERCOMPUTER CENTER 
Wordcount submit script and Python code online: 
https://github.com/glennklockwood/sparktutorial
Introduction to Spark 
PRACTICAL LIMITATIONS 
SAN DIEGO SUPERCOMPUTER CENTER
Major Problems with Spark 
1. Still smells like a CS project 
2. Debugging is a dark art 
3. Not battle-tested at scale 
SAN DIEGO SUPERCOMPUTER CENTER
#1: Spark Smells Like CS 
• Components are constantly breaking 
• Graph.partitionBy broken in 1.0.0 (SPARK-1931) 
• Some components never worked 
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) 
• stop-master.sh doesn't work 
• Spark with YARN will break with large data sets (SPARK-2398) 
• spark-submit for standalone mode doesn't work (SPARK-2260) 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)

A transpiler is a type of compiler that takes source code from one programming language and outputs source code in another programming language, while a compiler converts source code directly into machine code. Transpilers allow code to be translated between languages at similar levels of abstraction, such as C++ to C, while compilers translate to a lower level like C to assembly code. Transpilers are useful for porting codebases to new languages, translating between language versions, or implementing domain-specific languages. Popular transpilers include Babel, TypeScript, and Emscripten.

source-to-source compilers
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments

Presented at the Oak Ridge Interconnects Workshop in 1999. A fun historical perpsective on where the HPC industry in 1999 thought we would be going forward into the petascale industry.

Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation

This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.

proteomicsopen sourvespark
#1: Spark Smells Like CS 
• Really obvious usability issues: 
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') 
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 
scala.MatchError: 0 (of class java.lang.Integer) 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
SAN DIEGO SUPERCOMPUTER CENTER 
... 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:722) 
Read an RDD, then write it out = unhandled exception with 
cryptic Scala errors from Python (SPARK-1690)
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in 
saveAsTextFile 
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- 
src.zip/py4j/java_gateway.py", line 537, in __call__ 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. 
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with 
client version 4 
at org.apache.hadoop.ipc.Client.call(Client.java:1070) 
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) 
at $Proxy7.getProtocolVersion(Unknown Source) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) 
Cause: Spark built against Hadoop 2 DFS trying to access data 
on Hadoop 1 DFS 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.count() 
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
self.serializer.dump_stream(self._batched(iterator), stream) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for item in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin 
if acc is None: 
TypeError: an integer is required 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin 
if not isinstance(x, basestring): 
SystemError: unknown opcode 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

C++ AMPを使ってみよう
C++ AMPを使ってみようC++ AMPを使ってみよう
C++ AMPを使ってみよう

C++ AMPの概要と簡単なチュートリアルです。まだ使いだしなので詳しいことは書けませんが、導入には十分でしょうか?

Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS

Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Методы распределенной обработки больших объемов данных в Hadoop" Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD

технопарк мейл.рууроки программированияhadoop
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章

スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ)の勉強会の資料です。 1章のスライドです。

machine learningスパーススパース性
#2: Spark Debugging Tips 
• $SPARK_LOG_DIR/app-* contains master/worker 
logs with failure information 
• Try to find the salient error amidst the stack traces 
• Google that error--odds are, it is a known issue 
• Stick any required environment variables ($PATH, 
$PYTHONPATH, $JAVA_HOME) in 
$SPARK_CONF_DIR/spark-env.sh to rule out 
these problems 
• All else fails, look at Spark source code 
SAN DIEGO SUPERCOMPUTER CENTER
#3: Spark Isn't Battle Tested 
• Companies (Cloudera, SAP, etc) jumping on the 
Spark bandwagon with disclaimers about scaling 
• Spark does not handle multitenancy well at all. 
Wait scheduling is considered best way to achieve 
memory/disk data locality 
• Largest Spark clusters ~ hundreds of nodes 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Take-Aways 
SAN DIEGO SUPERCOMPUTER CENTER 
• FACTS 
• Data is represented as resilient distributed datasets 
(RDDs) which remain in-memory and read-only 
• RDDs are comprised of elements 
• Elements are distributed across physical nodes in user-defined 
groups called partitions 
• RDDs are subject to transformations and actions 
• Fault tolerance achieved by lineage, not replication 
• Opinions 
• Spark is still in its infancy but its progress is promising 
• Good for evaluating--good for Gordon, Comet
Introduction to Spark 
PAGERANK EXAMPLE 
(INCOMPLETE) 
SAN DIEGO SUPERCOMPUTER CENTER

Recommended for you

Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark

Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова. Курс "Методы распределенной обработки больших объемов данных в Hadoop" Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD

как программироватьтехнопарктехносфера мейл.ру
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習

機械学習プロフェッショナルシリーズ輪読会資料

machine learning
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...

Glenn K. Lockwood's document summarizes his professional background and experience with data-intensive computing systems. It then discusses the Gordon supercomputer deployed at SDSC in 2012, which was one of the world's first systems to use flash storage. The document analyzes Gordon's architecture using burst buffers and SSDs, experiences using the flash file system, and lessons learned. It also compares Gordon's proto-burst buffer approach to the dedicated burst buffer nodes on the Cori supercomputer.

flashhpcburst buffer
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
Start every webpage with a rank R = 1.0 
1. For each webpage linking in N neighbor webpages, 
have it "contribute" R/N to each of its N neighbors 
2. Then, for each webpage, set its rank R to (0.15 + 
0.85 * contributions) 
SAN DIEGO SUPERCOMPUTER CENTER 
3. Repeat 
insert flow diagram here
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') 
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such 
# that all common keys are kept together. Then retain this RDD in memory. 
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() 
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 
ranks = links.map(lambda (url, neighbors): (url, 1.0)) 
# Calculate and update URL rank 
for iteration in range(10): 
# Calculate URL contributions to their neighbors 
contribs = links.join(ranks).flatMap( 
lambda (url, (urls, rank)): computeContribs(urls, rank)) 
# Recalculate URL ranks based on neighbor contributions 
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) 
# Print all URLs and their ranks 
for (link, rank) in ranks.collect(): 
print '%s has rank %s' % (link, rank) 
SAN DIEGO SUPERCOMPUTER CENTER

More Related Content

What's hot

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
Tushar B Kute
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
Subhas Kumar Ghosh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
Cepoi Eugen
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
Jason Shao
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 

What's hot (19)

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 

Viewers also liked

myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
Glenn K. Lockwood
 
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon
Glenn K. Lockwood
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
Shivang Bajaniya
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
Glenn K. Lockwood
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
lordjoe
 
C++ AMPを使ってみよう
C++ AMPを使ってみようC++ AMPを使ってみよう
C++ AMPを使ってみよう
Osamu Masutani
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS
Technopark
 
スパース性に基づく��械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
Hakky St
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
Technopark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
hagino 3000
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
narumikanno0918
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
Hideo Terada
 
Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習
NTT DATA OSS Professional Services
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
Brendan Gregg
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 

Viewers also liked (20)

myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
C++ AMPを使ってみよう
C++ AMPを使ってみようC++ AMPを使ってみよう
C++ AMPを使ってみよう
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS
 
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
 
Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 

Similar to Overview of Spark for HPC

Spark
SparkSpark
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
arunkumar sadhasivam
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?
Knoldus Inc.
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 

Similar to Overview of Spark for HPC (20)

Spark
SparkSpark
Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 

Recently uploaded

Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 

Recently uploaded (20)

Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 

Overview of Spark for HPC

  • 1. Introduction to Spark Glenn K. Lockwood July 2014 SAN DIEGO SUPERCOMPUTER CENTER
  • 2. Outline I. Hadoop/MapReduce Recap and Limitations II. Complex Workflows and RDDs III. The Spark Framework IV. Spark on Gordon V. Practical Limitations of Spark SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Map/Reduce Parallelism Data Data SAN DIEGO SUPERCOMPUTER CENTER Data Data Data taDsakt a0 task 5 task 4 task 3 task 1 task 2
  • 4. Magic of HDFS SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Hadoop Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Shuffle/Sort SAN DIEGO SUPERCOMPUTER CENTER MapReduce Disk Spill 1. Map – convert raw input into key/value pairs. Output to local disk ("spill") 2. Shuffle/Sort – All reducers retrieve all spilled records from all mappers over network 3. Reduce – For each unique key, do something with all the corresponding values. Output to HDFS Map Map Map Reduce Reduce Reduce
  • 7. 2. Full* data dump to disk SAN DIEGO SUPERCOMPUTER CENTER MapReduce: Two Fundamental Limitations 1. MapReduce prescribes workflow. • You map, then you reduce. • You cannot reduce, then map... • ...or anything else. See first point. Map Map Map Reduce Reduce Reduce between workflow steps. • Mappers deliver output on local disk (mapred.local.dir) • Reducers pull input over network from other nodes' local disks • Output goes right back to local * Combiners do local reductions to prevent a full, unreduced dump of data to local disk disks via HDFS Shuffle/Sort
  • 8. Beyond MapReduce • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator SAN DIEGO SUPERCOMPUTER CENTER
  • 9. Beyond MapReduce: Complex Workflows • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce How can you do this without flushing intermediate results to disk after every operation? • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator How can you ensure fault tolerance for all of these baked-in operations? SAN DIEGO SUPERCOMPUTER CENTER
  • 10. SAN DIEGO SUPERCOMPUTER CENTER MapReduce Fault Tolerance Map Map Map Reduce Reduce Reduce Mapper Failure: 1. Re-run map task and spill to disk 2. Block until finished 3. Reducers proceed as normal Reducer Failure: 1. Re-fetch spills from all mappers' disks 2. Re-run reducer task
  • 11. Performing Complex Workflows How can you do complex workflows without flushing intermediate results to disk after every operation? 1. Cache intermediate results in-memory 2. Allow users to specify persistence in memory and partitioning of dataset across nodes How can you ensure fault tolerance? 1. Coarse-grained atomicity via partitions (transform chunks of data, not record-by-record) 2. Use transaction logging--forget replication SAN DIEGO SUPERCOMPUTER CENTER
  • 12. Resilient Distributed Dataset (RDD) • Comprised of distributed, atomic partitions of elements • Apply transformations to generate new RDDs • RDDs are immutable (read-only) • RDDs can only be created from persistent storage (e.g., HDFS, POSIX, S3) or by transforming other RDDs # Create an RDD from a file on HDFS text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') # Transform the RDD of lines into an RDD of words (one word per element) words = text.flatMap( lambda line: line.split() ) # Transform the RDD of words into an RDD of key/value pairs keyvals = words.map( lambda word: (word, 1) ) sc is a SparkContext object that describes our Spark cluster lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) SAN DIEGO SUPERCOMPUTER CENTER
  • 13. Potential RDD Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 14. RDD Transformation vs. Action • Transformations are lazy: nothing actually happens when this code is evaluated • RDDs are computed only when an action is called on them, e.g., • Calculate statistics over the elements of an RDD (count, mean) • Save the RDD to a file (saveAsTextFile) • Reduce elements of an RDD into a single object or value (reduce) • Allows you to define partitioning/caching behavior after defining the RDD but before calculating its contents SAN DIEGO SUPERCOMPUTER CENTER
  • 15. RDD Transformation vs. Action • Must insert an action here to get pipeline to execute. • Actions create files or objects: # The saveAsTextFile action dumps the contents of an RDD to disk >>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') # The count action returns the number of elements in an RDD >>> num_elements = rdd.count(); num_elements; type(num_elements) SAN DIEGO SUPERCOMPUTER CENTER 215136 <type 'int'>
  • 16. Resiliency: The 'R' in 'RDD' • No replication of in-memory data • Restrict transformations to coarse granularity • Partition-level operations simplifies data lineage SAN DIEGO SUPERCOMPUTER CENTER
  • 17. Resiliency: The 'R' in 'RDD' • Reconstruct missing data from its lineage • Data in RDDs are deterministic since partitions are immutable and atomic SAN DIEGO SUPERCOMPUTER CENTER
  • 18. Resiliency: The 'R' in 'RDD' • Long lineages or complex interactions (reductions, shuffles) can be checkpointed • RDD immutability  nonblocking (background) SAN DIEGO SUPERCOMPUTER CENTER
  • 19. Introduction to Spark SPARK: AN IMPLEMENTATION OF RDDS SAN DIEGO SUPERCOMPUTER CENTER
  • 20. Spark Framework • Master/worker Model • Spark Master is analogous to Hadoop Jobtracker (MRv1) or Application Master (MRv2) • Spark Worker is analogous to Hadoop Tasktracker • Relies on "3rd party" storage for RDD generation (hdfs://, s3n://, file://, http://) • Spark clusters take three forms: • Standalone mode - workers communicate directly with master via spark://master:7077 URI • Mesos - mesos://master:5050 URI • YARN - no HA; complicated job launch SAN DIEGO SUPERCOMPUTER CENTER
  • 21. Spark on Gordon: Configuration 1. Standalone mode is the simplest configuration and execution model (similar to MRv1) 2. Leverage existing HDFS support in myHadoop for storage 3. Combine #1 and #2 to extend myHadoop to support Spark: $ export HADOOP_CONF_DIR=/home/glock/hadoop.conf $ myhadoop-configure.sh ... myHadoop: Enabling experimental Spark support myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark myHadoop: To use Spark, you will want to type the following commands:" source /home/glock/hadoop.conf/spark/spark-env.sh myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 22. Spark on Gordon: Storage • Spark can use HDFS $ start-dfs.sh # after you run myhadoop-configure.sh, of course ... $ pyspark >>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') >>> mydata.count() 982394 • Spark can use POSIX file systems too $ pyspark >>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') >>> mydata.count() 982394 • S3 Native (s3n://) and HTTP (http://) also work • file:// input will be served in chunks to Spark workers via the Spark driver's built-in httpd SAN DIEGO SUPERCOMPUTER CENTER
  • 23. Spark on Gordon: Running Spark treats several languages as first-class citizens: Feature Scala Java Python Interactive YES NO YES Shark (SQL) YES YES YES Streaming YES YES NO MLlib YES YES YES GraphX YES YES NO R is a second-class citizen; basic RDD API is available outside of CRAN (http://amplab-extras.github.io/SparkR-pkg/) SAN DIEGO SUPERCOMPUTER CENTER
  • 24. myHadoop/Spark on Gordon (1/2) #!/bin/bash #PBS -l nodes=2:ppn=16:native:flash #PBS -l walltime=00:30:00 #PBS -q normal ### Environment setup for Hadoop export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH module load hadoop/2.2.0 export HADOOP_CONF_DIR=$HOME/mycluster.conf myhadoop-configure.sh ### Start HDFS. Starting YARN isn't necessary since Spark will be running in ### standalone mode on our cluster. start-dfs.sh ### Load in the necessary Spark environment variables source $HADOOP_CONF_DIR/spark/spark-env.sh ### Start the Spark masters and workers. Do NOT use the start-all.sh provided ### by Spark, as they do not correctly honor $SPARK_CONF_DIR myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 25. myHadoop/Spark on Gordon (2/2) ### Run our example problem. ### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home ### dir by default which is different from Hadoop 1.x!) hdfs dfs -mkdir -p /user/$USER hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt ### Step 2. Run our Python Spark job. Note that Spark implicitly requires ### Python 2.6 (some features, like MLLib, require 2.7) module load python scipy /home/glock/hadoop/run/wordcount-spark.py ### Step 3. Copy output back out hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ ### Shut down Spark and HDFS myspark stop stop-dfs.sh ### Clean up myhadoop-cleanup.sh SAN DIEGO SUPERCOMPUTER CENTER Wordcount submit script and Python code online: https://github.com/glennklockwood/sparktutorial
  • 26. Introduction to Spark PRACTICAL LIMITATIONS SAN DIEGO SUPERCOMPUTER CENTER
  • 27. Major Problems with Spark 1. Still smells like a CS project 2. Debugging is a dark art 3. Not battle-tested at scale SAN DIEGO SUPERCOMPUTER CENTER
  • 28. #1: Spark Smells Like CS • Components are constantly breaking • Graph.partitionBy broken in 1.0.0 (SPARK-1931) • Some components never worked • SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) • stop-master.sh doesn't work • Spark with YARN will break with large data sets (SPARK-2398) • spark-submit for standalone mode doesn't work (SPARK-2260) SAN DIEGO SUPERCOMPUTER CENTER
  • 29. #1: Spark Smells Like CS • Really obvious usability issues: >>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') >>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) SAN DIEGO SUPERCOMPUTER CENTER ... at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Read an RDD, then write it out = unhandled exception with cryptic Scala errors from Python (SPARK-1690)
  • 30. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in saveAsTextFile keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- src.zip/py4j/java_gateway.py", line 537, in __call__ File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) Cause: Spark built against Hadoop 2 DFS trying to access data on Hadoop 1 DFS SAN DIEGO SUPERCOMPUTER CENTER
  • 31. #2: Debugging is a Dark Art >>> data.count() 14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. self.serializer.dump_stream(self._batched(iterator), stream) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for item in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 32. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin if not isinstance(x, basestring): SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 33. #2: Spark Debugging Tips • $SPARK_LOG_DIR/app-* contains master/worker logs with failure information • Try to find the salient error amidst the stack traces • Google that error--odds are, it is a known issue • Stick any required environment variables ($PATH, $PYTHONPATH, $JAVA_HOME) in $SPARK_CONF_DIR/spark-env.sh to rule out these problems • All else fails, look at Spark source code SAN DIEGO SUPERCOMPUTER CENTER
  • 34. #3: Spark Isn't Battle Tested • Companies (Cloudera, SAP, etc) jumping on the Spark bandwagon with disclaimers about scaling • Spark does not handle multitenancy well at all. Wait scheduling is considered best way to achieve memory/disk data locality • Largest Spark clusters ~ hundreds of nodes SAN DIEGO SUPERCOMPUTER CENTER
  • 35. Spark Take-Aways SAN DIEGO SUPERCOMPUTER CENTER • FACTS • Data is represented as resilient distributed datasets (RDDs) which remain in-memory and read-only • RDDs are comprised of elements • Elements are distributed across physical nodes in user-defined groups called partitions • RDDs are subject to transformations and actions • Fault tolerance achieved by lineage, not replication • Opinions • Spark is still in its infancy but its progress is promising • Good for evaluating--good for Gordon, Comet
  • 36. Introduction to Spark PAGERANK EXAMPLE (INCOMPLETE) SAN DIEGO SUPERCOMPUTER CENTER
  • 37. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations Start every webpage with a rank R = 1.0 1. For each webpage linking in N neighbor webpages, have it "contribute" R/N to each of its N neighbors 2. Then, for each webpage, set its rank R to (0.15 + 0.85 * contributions) SAN DIEGO SUPERCOMPUTER CENTER 3. Repeat insert flow diagram here
  • 38. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') # Load key/value pairs of (url, link), eliminate duplicates, and partition them such # that all common keys are kept together. Then retain this RDD in memory. links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() # Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 ranks = links.map(lambda (url, neighbors): (url, 1.0)) # Calculate and update URL rank for iteration in range(10): # Calculate URL contributions to their neighbors contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Recalculate URL ranks based on neighbor contributions ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) # Print all URLs and their ranks for (link, rank) in ranks.collect(): print '%s has rank %s' % (link, rank) SAN DIEGO SUPERCOMPUTER CENTER

Editor's Notes

  1. groupByKey: group the values for each key in the RDD into a single sequence mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning) collect: return a list containing all elements of the RDD def computeContribs(urls, rank): """Calculates URL contributions to the rank of other URLs.""" num_urls = len(urls) for url in urls: yield (url, rank / num_urls)