52

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

I am running as below

command on ec2 instance :

 ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1  /home/hadoop/test.jar 

I have installed spark on EMR.

EMR details
Master instance group - 1   Running MASTER  m1.medium   
1

Core instance group - 2 Running CORE    m1.medium

I am getting below INFO and it never ends.

15/06/14 11:33:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
15/06/14 11:33:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
15/06/14 11:33:23 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
15/06/14 11:33:23 INFO yarn.Client: Setting up container launch context for our AM
15/06/14 11:33:23 INFO yarn.Client: Preparing resources for our AM container
15/06/14 11:33:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.1.e/lib/spark-assembly-1.3.1-hadoop2.4.0.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/spark-assembly-1.3.1-hadoop2.4.0.jar
15/06/14 11:33:29 INFO yarn.Client: Uploading resource file:/home/hadoop/test.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/test.jar
15/06/14 11:33:31 INFO yarn.Client: Setting up the launch environment for our AM container
15/06/14 11:33:31 INFO spark.SecurityManager: Changing view acls to: hadoop
15/06/14 11:33:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/06/14 11:33:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/14 11:33:31 INFO yarn.Client: Submitting application 23 to ResourceManager
15/06/14 11:33:31 INFO impl.YarnClientImpl: Submitted application application_1434263747091_0023
15/06/14 11:33:32 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:32 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1434281611893
         final status: UNDEFINED
         tracking URL: http://172.31.13.68:9046/proxy/application_1434263747091_0023/
         user: hadoop
15/06/14 11:33:33 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:34 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:35 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:36 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:37 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:38 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:39 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:40 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
15/06/14 11:33:41 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)

Could somebody please let me know as why it's not working ?

1
  • Alexei51: "Maybe remove setMaster("local[*]")"
    – Hille
    Commented May 8, 2018 at 14:05

13 Answers 13

24

I had this exact problem when multiple users were trying to run on our cluster at once. The fix was to change setting of the scheduler.

In the file /etc/hadoop/conf/capacity-scheduler.xml we changed the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

Changing this setting increases the fraction of the resources that is made available to be allocated to application masters, increasing the number of masters possible to run at once and hence increasing the number of possible concurrent applications.

4
  • 1
    Just want to add that this is a very important parameter to set, if you are running cluster on a single machine where your resources are small
    – Michael
    Commented Nov 10, 2015 at 8:21
  • i am working in HUE, i also have same problem. where i can find capacity-scheduler.xml file
    – user5227388
    Commented Jan 24, 2018 at 10:58
  • worked for me. its possible to run multiple Spark applications now Commented Aug 21, 2018 at 9:17
  • @user5227388 sudo find / -name capacity-scheduler.xml
    – user1
    Commented Jul 6, 2020 at 11:32
14

I got this error in this situation:

  1. MASTER=yarn (or yarn-client)
  2. spark-submit runs on a computer outside of the cluster and there is no route from the cluster to it because it's hidden by a router

Logs for container_1453825604297_0001_02_000001 (from ResourceManager web UI):

16/01/26 08:30:38 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable.
16/01/26 08:31:41 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:44 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:45 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484) 

I workaround it by using yarn cluster mode: MASTER=yarn-cluster.

On another computer which is configured in the similar way, but is's IP is reachable from the cluster, both yarn-client and yarn-cluster work.

Others may encounter this error for different reasons, and my point is that checking error logs (not seen from terminal, but ResourceManager web UI in this case) almost always helps.

2
  • same issue as your case. should be a route table issue.
    – Keith
    Commented Jul 25, 2017 at 6:34
  • Thanks for this explanation!
    – frb
    Commented Jul 25, 2017 at 9:14
10

There are three ways we can try to fix this issue.

  1. Check for spark process on your machine and kill it.

Do

ps aux | grep spark

Take all the process id's with spark processes and kill them, like

sudo kill -9 4567 7865
  1. Check for number of spark applications running on your cluster.

To check this, do

yarn application -list

you will get an output similar to this:

Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1496703976885_00567       ta da                SPARK        cloudera       default             RUNNING           UNDEFINED              20%             http://10.0.52.156:9090

Check for the application id's, if they are more than 1, or more than 2, kill them. Your cluster cannot run more than 2 spark applications at the same time. I am not 100% sure about this, but on cluster if you run more than two spark applications, it will start complaining. So, kill them Do this to kill them:

yarn application -kill application_1496703976885_00567
  1. Check for your spark config parameters. For example, if you have set more executor memory or driver memory or number of executors on your spark application that may also cause an issue. So, reduce of any of them and run your spark application, that might resolve it.
0
6

This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.

Another thing to try is check if YARN works properly as a service:

sudo service hadoop-yarn-nodemanager status
sudo service hadoop-yarn-resourcemanager status
0
3

I had a small cluster where the resources were limited (~3GB per node). Solved this problem by changing the minimum memory allocation to a sufficiently low number.

From:

yarn.scheduler.minimum-allocation-mb: 1g
yarn.scheduler.increment-allocation-mb: 512m

To:

yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m
1
  • i am working in HUE, i also have same problem. where i have to change above code?
    – user5227388
    Commented Jan 24, 2018 at 10:33
1

I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.

I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.

e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.

1

In my case, I see some old spark processes (which are stopped by Ctrl+Z) are still running and their AppMasters (drivers) probably still occupying memory. So, the new AppMaster's from new spark command may be waiting indefinitely to get registered by YarnScheduler, as spark.driver.memory cannot be allocated in respective core nodes. This can also occur when Max resource allocation is true and if the driver is set to use Max resources available for a core-node.

So, I identified all those stale spark client processes and killed them (which may had killed their Drivers and released memory).

ps -aux | grep spark

hadoop    3435  1.4  3.0 3984908 473520 pts/1  Tl   Feb17   0:12  .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10

hadoop   32630  0.9  3.0 3984908 468928 pts/1  Tl   Feb17   0:14 .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 1000

    kill -9 3435 32630

After that I do not see those messages.

1
  • If you are running Yarn, you can use yarn application -kill <appilication-ID>. You can check the status of your jobs on port 8088, if running locally, direct your browser to localhost:8088. When I develop something locally, I sometimes have Hive sessions or Zeppelin jobs running as well and I need to kill those first before the Spark job executes.
    – Hendrik F
    Commented Sep 21, 2016 at 15:01
0

When running with yarn-cluster all the application logging and stdout will be located in the assigned yarn application master and will not appear to spark-submit. Also being streaming the application usually does not exit. Check the Hadoop resource manager web interface and look at the Spark web ui and logs that will be available from the Hadoop ui.

1
  • This issue seem to persists even if the application has an end (non streaming batch job).
    – marios
    Commented Jun 30, 2015 at 23:43
0

I had the same problem on a local hadoop cluster with spark 1.4 and yarn, trying to run spark-shell. It had more then enough resources.

What helped was running the same thing from an interactive lsf job on the cluster. So perhaps there were some network limitations to run yarn from the head node...

0

In one instance, I had this issue because I was asking for too many resources. This was on a small standalone cluster. The original command was

spark-submit --driver-memory 4G --executor-memory 7G -class "my.class" --master yarn --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead my.jar

I succeeded in getting past 'Accepted' and into 'Running' by changing to

spark-submit --driver-memory 1G --executor-memory 3G -class "my.class" --master yarn --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead my.jar

In other instances, I had this problem because of the way the code was written. We instantiated the spark context inside the class where it was used, and it did not get closed. We fixed the problem by instantiating the context first, passing it to the class where data is parallelized etc, then closing the context (sc.close()) in the caller class.

1
  • 3
    --conf spark.yarn.executor.memoryOverhead. No value?
    – tokland
    Commented Dec 22, 2016 at 21:23
0

I hit the same problem MS Azure cluster in their HDinsight spark cluster.
finally found out the issue was the cluster couldn't be able to talk back to the driver. I assume you used client mode when submit the job since you can provide this debug log.

reason why is that spark executors have to talk to driver program, and the TCP connection has to be bi-directional. so if your driver program is running in a VM(ec2 instance) which is not reachable via hostname or IP(you have to specify in spark conf, default to hostname), your status will be accepted forever.

0

Had a similar problem

Like other answer indicate here, it's a resource availability issue

In my case, I was doing an etl process where the old data from the previous run was being trashed each time. However, the newly trashed data was being stored in the controlling user's /user/myuser/.Trash folder. Looking at the Ambari dashboard, I could see that the overall HDFS disk usage was near capacity which was causing the resource issues.

So in this case, used the -skipTrash option to hadoop fs -rm ... old data files (else will take up space in trash roughly equivalent to the size of all data stored in the etl storage dir (effectively doubling total the space used by application and causing resource problems)).

-1

I faced the same issue in clouera quickstart VM when I tried to execute pyspark shell . Whe i see the job logs in resourcemanager , i see

17/02/18 22:20:53 ERROR yarn.ApplicationMaster: Failed to connect to driver at RM IP. 

That means job is not able to connect to RM (resource manager) because by default pyspark try to launch in yarn mode in cloudera VM .

pyspark --master local 

worked for me . Even starting RM s resolved the issue.

Thanks

Not the answer you're looking for? Browse other questions tagged or ask your own question.