Hey everyone, I am trying to integrate docker-stacks with spark, and I suspect I am not understanding something, and it is probably networking based. When I use docker-stacks with a spark standalone cluster, this first statement works well.
from pyspark.sql import SparkSession
#Create Session
spark = SparkSession.builder.master("spark://spark:7077").getOrCreate()
But the following hangs:
rdd = spark.sparkContext.parallelize(range(100 + 1))
rdd.sum()
The logs say
22/09/26 06:33:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:34:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:34:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:34:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:34:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:35:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[I 06:35:23.429 NotebookApp] Saving file at /work/sample.ipynb
22/09/26 06:35:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:35:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:35:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/09/26 06:33:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I believe it has to do with Image Specifics — Docker Stacks documentation. I tried different networking combinations and can’t seem to find how to let them know of one another. Any guidance would be great. Below is my docker-compose.yaml
version: '3'
services:
spark:
image: docker.io/bitnami/spark:3.3
hostname: spark
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
- '4040:4040'
- '4041:4041'
- '7077:7077'
spark-worker:
image: docker.io/bitnami/spark:3.3
hostname: spark-worker
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
minio:
image: docker.io/bitnami/minio:2022
hostname: minio
environment:
- MINIO_ROOT_USER=minio-root-user
- MINIO_ROOT_PASSWORD=minio-root-password
ports:
- '9000:9000'
- '9001:9001'
volumes:
- 'minio_data:/data'
pyspark-notebook:
image: docker.io/jupyter/pyspark-notebook:python-3.8.8
ports:
- '8888:8888'
environment:
- TINI_SUBREAPER=true
volumes:
- 'notebook_data:/home/jovyan/work'
volumes:
minio_data:
driver: local
notebook_data:
driver: local