Using Spark ML on Spark Errors - What do the clusters tell us?

Using Spark ML on Spark Errors
What Do the Clusters Tell Us?

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos

Normally I’d introduce my co-speaker
● However she was organizing the Apache Beam Summit and is just too
drained to be able to make it.
● I did have to cut a few corners (and re-use a few cat pictures) as a result
Sylvie burr

Some links (slides & recordings will be at):
Today’s talk:
http://bit.ly/2QoZuKz
Yesterday’s talk (Validating Pipelines):
https://bit.ly/2QqQUea
CatLoversShow

Who do I think you all are?
● Nice people*
● Familiar-ish to very familiar with Spark
● Possibly a little bit jaded (but also maybe not)
Amanda

What we are going to explore together!
● The Spark Mailing Lists
○ Yes, even user@
● My desire to be lazy
● The suspicion that srowen has a robot army to help
● A look at how much work it would be to build that robot
army
● The depressing realization “heuristics” are probably
better anyways (and some options)

Some of the reasons my employer cares*
● We have a hoted Spark/Hadoop solution (called Dataproc)
● We also have hosted pipeline management tools (based on Airflow called
Cloud Composer)
● Being good open source community members
*Probably, it’s not like I go to all of the meetings I’m invited to.
Khairil Zhafri

The Spark Mailing Lists & friends
● user@
○ Where people to to ask questions about using Spark
● dev@
○ Discussion about developing Spark
○ Also where people sometimes go when no one answers user@
● Stackoverflow
○ Some very active folks here as well
● Books/etc.
Petfu
l

~ unanswered Spark posts
8536
:(
Richard J

Stack overflow growth over time
Petfu
l
Khalid Abduljaleel
*Done with bigquery. Sorry!

Discoverability might matter
Petfu
l

Anyone have an outstanding PR? koi ko

So how do we handle this?
● Get more community volunteers
○ (hard & burn out)
● Answer more questions
○ (hard & burn out)
● Answer less questions?
○ (idk maybe someone will buy a support contract)
● Make robots!
○ Hard and doesn’t work entirely
Helen Olney

How many of you have had?
● Java OOM
● Application memory overhead exceeded
● Serialization exception
● Value is bigger than integer exception
● etc.
Helen Olney

Maaaaybe robots could help?
● It certainly seems like some folks have common issues
● Everyone loves phone trees right?
○ Press 1 if you’ve had an out-of-memory exception press 2 if you’re
running Python
● Although more seriously some companies are building
recommendation systems on top of Spark to solve this
for their customers
Matthew Hurst

Ok well, let’s try and build some clusters?
● Not those cluster :p
● Lets try k=4, we had 4 common errors right?
_torne

I’m lazy so let’s use Spark:
body_hashing = HashingTF(inputCol="body_tokens",
outputCol="raw_body_features", numFeatures=10000)
body_idf =IDF(inputCol="raw_body_features",
outputCol="body_features")
assembler = VectorAssembler(
inputCols=["body_features", "contains_python_stack_trace",
"contains_java_stack_trace",
"contains_exception_in_task", "is_thread_start",
"domain_features"],
outputCol="features")
kmeans = KMeans(featuresCol="features", k=4, seed=42)
_torne

Damn not quite one slide :(
dataprep_pipeline = Pipeline(stages=[tokenizer, body_hashing,
body_idf, domains_hashing, domains_idf, assembler])
pipeline = Pipeline(stages=[dataprep_pipeline, kmeans])
_torne

Let’s see what the results
● Let’s grab an email or two from each cluster and take a
peek
Rikki's Refuge

Oh hmm. Well maybe 4*4 right?
● 159 non group-zero messages…
Sherrie Thai

Well when do we start to see something?
*Not actually a great way to pick K
w.vandervet

Let’s look at some of the records - 73
1 {plain=Hi All,n Greetings ! I needed some help to read a Hive
tablenvia Pyspark for which the transactional property is set to 'True' (In
othernwords ACID property is enabled). Following is the entire stacktrace and
thendescription of the hive table. Would you please be able to help me
resolventhe error:nn18/03/01 11:06:22 INFO BlockManagerMaster: Registered
BlockManagern18/03/01 11:06:22 INFO EventLoggingListener: Logging events
tonhdfs:///spark-history/local-1519923982155nWelcome ton ____
__n / __/__ ___ _____/ /__n _ / _ / _ `/ __/ '_/n /__ / .__/_,_/_/ /_/_
version 1.6.3n /_/nnUsing Python version 2.7.12 (default, Jul 2 2016
17:42:40)nSparkContext available as sc, HiveContext available as
sqlContext.n>>> from pyspark.sql import HiveContextn>>> hive_context =
HiveContext(sc)n>>> hive_context.sql("select count(*)
Susan Young

5 {plain=Hi Gourav,nnMy answers are below.nnCheers,nBennnn> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta
<gourav.sengupta@gmail.com> wrote:n> n> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for
yourself in AWS? Our cluster in on premise in our data center.n> n> Also I have really never seen use s3a before, that was used way long
before when writing s3 files took a long time, but I think that you are reading it. n> n> Anyideas why you are not migrating to Spark 2.1,
besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that
you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever
Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.n> n> And besides that would you not want to work
on a platform which is at least 10 times faster What would that be?n> n> Regards,n> Gourav Senguptan> n> On Thu, Feb 23, 2017 at
6:23 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>> wrote:n> We are trying to use Spark 1.6 within CDH 5.7.1 to
retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but
when we try to do some operations, such as count, we get this error below.n> n> com.cloudera.com.amazonaws.AmazonClientException:
Unable to load AWS credentials from any provider in the chainn> at
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)n> at
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)n> at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)n> at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)n> at
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)n> at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)n> at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)n> at
nagy.tamas

6 {plain=I see, that’s quite interesting. For problem 2, I think the issue is that Akka 2.0.5 *always* kept TCP connections open between
nodes, so these messages didn’t get lost. It looks like Akka 2.2 occasionally disconnects them and loses messages. If this is the case, and
this behavior can’t be disabled with a flag, then it’s a problem for other parts of the code too. Most of our code assumes that messages will
make it through unless the destination node dies, which is what you’d usually hope for TCP.nnMateinnOn Oct 31, 2013, at 1:33 PM, Imran
Rashid <imran@quantifind.com> wrote:nn> pretty sure I found the problem -- two problems actually. And I think one of them has been a
general lurking problem w/ spark for a while.n> n> 1) we should ignore disassociation events, as you suggested earlier. They seem to just
indicate a temporary problem, and can generally be ignored. I've found that they're regularly followed by AssociatedEvents, and it seems
communication really works fine at that point.n> n> 2) Task finished messages get lost. When this message gets sent, we dont' know it
actually gets there:n> n>
https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala
#L90n> n> (this is so incredible, I feel I must be overlooking something -- but there is no ack somewhere else that I'm overlooking, is
there??) So, after the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent. It hangs b/c the executor has sent some
taskFinished messages that never get received by the driver. So the driver is waiting for some tasks to finish, but the executors think they are
all done.n> n> I'm gonna add the reliable proxy pattern for this particular interaction and see if its fixes the problemn>
http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxyn> n> imrann> n> n> n> On Thu, Oct 31, 2013
at 1:17 PM, Imran Rashid <imran@quantifind.com> wrote:n> Hi Prashant,n> n> thanks for looking into this. I don't have any answers yet,
but just wanted to send you an update. I finally figured out how to get all the akka logging turned on, so I'm looking at those for more info.
One thing immediately jumped out at me -- the Disassociation is actually immediatley followed by an Association! so maybe I came to the
wrong conclusion of our test of ignoring the DisassociatedEvent. I'm going to try it again -- hopefully w/ the logging on, I can find out more
about what is going on. I might ask on akka list for help w/ what to look for. also this thread makes me think that it really should just
re-associate:n> https://groups.google.com/forum/#!searchin/akka-user/Disassociated/akka-user/SajwwbyTriQ/8oxjbZtawxoJn> n> also, I've
翮郡陳

*Problem Description*:nnThe program running in stand-alone spark cluster (1
master, 6 workers withn8g ram and 2 cores).nInput: a 468MB file with 133433
records stored in HDFS.nOutput: just 2MB file will stored in HDFSnThe program
has two map operations and one reduceByKey operation.nFinally I save the result
to HDFS using "*saveAsTextFile*".n*Problem*: if I don't add "saveAsTextFile",
the program runs very fast(a fewnseconds), otherwise extremely slow until about
30 mins.nn*My program (is very Simple)*ntpublic static void main(String[] args)
throws IOException{ntt/**Parameter Setting***********/ntt String localPointPath
= "/home/hduser/skyrock/skyrockImageFeatures.csv";ntt String remoteFilePath
=n"hdfs://HadoopV26Master:9000/user/skyrock/skyrockImageIndexedFeatures.cs
v";ntt String outputPath =
"hdfs://HadoopV26Master:9000/user/sparkoutput/";ntt final int row =
Марья

{plain=I'm glad that I could help :)n19 sie 2015 8:52 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>nnapisał(a):nn> +1n>n> I
wish I have read this blog earlier. I am using Java and have justn> implemented a singleton producer per executor/JVM during the day.n>
Yes, I did see that NonSerializableException when I was debugging the coden> ...n>n> Thanks for sharing.n>n> On Tue, Aug 18, 2015 at
10:59 PM, Tathagata Das <tdas@databricks.com>n> wrote:n>n>> Its a cool blog post! Tweeted it!n>> Broadcasting the configuration
necessary for lazily instantiating then>> producer is a good idea.n>>n>> Nitpick: The first code example has an extra `}` ;)n>>n>> On Tue,
Aug 18, 2015 at 10:49 PM, Marcin Kuthan <marcin.kuthan@gmail.com>n>> wrote:n>>n>>> As long as Kafka producent is thread-safe you
don't need any pool atn>>> all. Just share single producer on every executor. Please look at my blogn>>> post for more details.
http://allegro.tech/spark-kafka-integration.htmln>>> 19 sie 2015 2:00 AM "Shenghua(Daniel) Wan" <wanshenghua@gmail.com>n>>>
napisał(a):n>>>n>>>> All of you are right.n>>>>n>>>> I was trying to create too many producers. My idea was to create an>>>> pool(for
now the pool contains only one producer) shared by all then>>>> executors.n>>>> After I realized it was related to the serializable issues
(though In>>>> did not find clear clues in the source code to indicate the broacastn>>>> template type parameter must be implement
serializable), I followed sparkn>>>> cassandra connector design and created a singleton of Kafka producer pools.n>>>> There is not
exception noticed.n>>>>n>>>> Thanks for all your comments.n>>>>n>>>>n>>>> On Tue, Aug 18, 2015 at 4:28 PM, Tathagata Das
<tdas@databricks.com>n>>>> wrote:n>>>>n>>>>> Why are you even trying to broadcast a producer? A broadcast variablen>>>>> is
some immutable piece of serializable DATA that can be used forn>>>>> processing on the executors. A Kafka producer is neither DATA
norn>>>>> immutable, and definitely not serializable.n>>>>> The right way to do this is to create the producer in the executors.n>>>>>
Please see the discussion in the programming guiden>>>>>n>>>>>
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreamsn>>>>>n>>>>> On Tue, Aug 18,
2015 at 3:08 PM, Cody Koeninger <cody@koeninger.org>n>>>>> wrote:n>>>>>n>>>>>> I wouldn't expect a kafka producer to be
serializable at all... amongn>>>>>> other things, it has a background threadn>>>>>>n>>>>>> On Tue, Aug 18, 2015 at 4:55 PM,
Shenghua(Daniel) Wan <n>>>>>> wanshenghua@gmail.com> wrote:n>>>>>>n>>>>>>> Hi,n>>>>>>> Did anyone see

An idea of some of the clusters “meaning”
● 74 - (more or less) answers
● 53 - hive errors (more or less)
● 155 - non-hive stack traces (mostly map partitions)
● 126 - PR comments
● 183 - streaming
● etc.
● 0 - Everything else
w.vandervet

This probably isn’t good enough :(
● But maaaybe we don’t want to give folks an IVR response
● How about if we took un-answered questions and pointed
to similar questions (ideally ones with answers…)
● Human-in-the-loop?
○ Idk anyone want to volunteer for this?
Lisa Zins

What else could we do?
● Transfer learning with the TF github summarizer?
● Explore elasticsearch
● Label some data with fiverr or similar
● Give up
● Go for a drink
● Explore network graphs on the Spark Mailing list
○ Like Paco did -
https://www.slideshare.net/pacoid/graph-analytics-in-s
park

Oooor
● Spark-lint
● Dr elephant
● etc.

Sign up for the mailing list @
http://www.distributedcomputing4kids.com

High Performance Spark!
You can buy it today! Several copies!
Really not that much on ML
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.

And some upcoming talks:
● October
○ Reversim
○ Scylladb summit
○ Twilio’s Signal in SF
● November
○ PyCon Canada
○ Big Data Spain
● December
○ ScalaX

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I want us to build better Spark
testing support in 3+
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
I’m always trying to get better at giving talks so feedback is
welcome: http://bit.ly/holdenTalkFeedback

Using Spark ML on Spark Errors - What do the clusters tell us?

Related slideshows

More Related Content

Using Spark ML on Spark Errors - What do the clusters tell us?