We are getting ‘Please check environment
variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.’ error when running pyspark code.

Can someone please help us.

  • User Server docker image: tried all below versions

    • pyspark-notebook:python-3.8.8
    • pyspark-notebook:spark-3.2.1
    • pyspark-notebook:ubuntu-20.04
  • Spark Cluster version: 3.2.1

    • Workers python 3.8
  • Spark Code

     import pyspark
     from pyspark.sql import SparkSession
     from pyspark.sql import SQLContext
     from pyspark import SparkConf, SparkContext
     conf = SparkConf()
     conf.set('', socket.gethostbyname(socket.gethostname()))
     conf.set('spark.executor.instances', '2') 
     sc = SparkContext(conf=conf)
    rdd = sc.parallelize(range(0, 2))
  • Error

     An error occurred while calling 
     : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 
     0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4) 
     ( executor 1): org.apache.spark.api.python.PythonException: Traceback 
     (most recent call last):
       File "/opt/bitnami/spark/python/lib/", line 481, in main
         raise RuntimeError(("Python in worker has different version %s than that in " +
     RuntimeError: Python in worker has different version 3.8 than that in driver 3.9, 
     PySpark cannot run with different minor versions. Please check environment 
     variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
     at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
  • The original python version mismatch is resolved with ‘jupyter/pyspark-notebook:python-3.8.8’ container image as the driver (the single user server)
  • But, spark worker nodes weren’t able report back to driver (the single user server)

Has anyone seen this?
Any help to resolve this?