we have Jupyterhub installed in server A and JEG installed on server B. JEG is configured to launch the kernel jobs on cdp YARN cluster.
we were successful to launch the kernel spark-scala in client mode but when we launch the kernel in cluster mode the yarn application gets stuck in ACCEPTED state and then it gets time out.
below is the container log where it gets stuck and timeouts, there are enuf resource in the queue to lauch the job. no network connectivity issue between JEG and yarn verified the required ports are open and connecting.
Infact submitted a sample spark job with higher memory it works fine. It fails only for toree launcher, we tried to submit the spark command which jeg submits from backend.
Kindly share any inputs to fix this.
25/05/30 12:59:16 INFO actor.EmptyLocalActorRef: [spark-kernel-actor-system-akka.actor.default-dispatcher-12]: Message [org.apache.toree.kernel.protocol.v5.KernelMessage] from Actor[akka://spark-kernel-actor-system/user/kernel_message_relay#-1527310745] to Actor[akka://spark-kernel-actor-system/user/status] was not delivered. [1] dead letters encountered. If this is not an expected behavior, then [Actor[akka://spark-kernel-actor-system/user/status]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings ‘akka.log-dead-letters’ and ‘akka.log-dead-letters-during-shutdown’.
25/05/30 12:59:16 INFO relay.KernelMessageRelay: [spark-kernel-actor-system-akka.actor.default-dispatcher-12]: Not ready for messages! Stashing until ready!
25/05/30 12:59:18 WARN toree.Main$$anon$1: [Driver]: No external magics provided to PluginManager!
25/05/30 12:59:20 INFO toree.Main$$anon$1: [Driver]: 12 internal plugins loaded
25/05/30 12:59:20 INFO toree.Main$$anon$1: [Driver]: 0 external plugins loaded
25/05/30 12:59:47 WARN scala.ScalaInterpreter: [Driver]: kernel variable: org.apache.toree.boot.layer.StandardComponentInitialization$$anon$1@17ec46b7
25/05/30 12:59:47 WARN scala.ScalaInterpreter: [Driver]: Binding List(@transient implicit) kernel org.apache.toree.kernel.api.Kernel org.apache.toree.boot.layer.StandardComponentInitialization$$anon$1@17ec46b7
25/05/30 12:59:52 INFO scala.ScalaInterpreter: [Driver]: Binding SparkContext into interpreter as sc
25/05/30 13:00:04 INFO toree.Main$$anon$1: [Driver]: Marking relay as ready for receiving messages
25/05/30 13:00:04 INFO relay.KernelMessageRelay: [spark-kernel-actor-system-akka.actor.default-dispatcher-3]: Unstashing all messages received!
25/05/30 13:00:04 INFO relay.KernelMessageRelay: [spark-kernel-actor-system-akka.actor.default-dispatcher-3]: Relay is now fully ready to receive messages!