First, for the Spark dataframes have you tried the solutions offered here or here? (Maybe this if you install the almond kernel via this?)
I think as the original question here shows, what you are seeing is expected. Your own expectations mentioned in your original post though may come from working in different kernels? Such as the typical Python 3 kernel?
I just tried running an EMR notebook on a minimal AWS cluster and one of the the choices for kernels to start up a notebook with is Python 3. (I also see PySpark, Spark, and SparkR.) Within a notebook launched using Python 3 my Pandas dataframe rendered with rich display tytpical of a Pandas dataframe in a notebook and your ‘Hello, world’ code that you first posted rendered fine, too. So I think your issues are tied to the kernel you are using and not Jupyter.
Your ‘unexpected behaviour’ mentioned in your original post is maybe based familiarity with a Python-based notebook? But you won’t see that with other kernels/shells. For an extreme example, go to here and launch a binder session and when it comes up you’ll get a list of notebooks to start. Choose the first one, ‘Getting Circos Up and Running’, from the list. That opens with a Bash kernel and is really not happy to see Python code.
Admittedly ‘PySpark’ sounds like it would be a superset of the Python kernel but that doesn’t seem to be the case fully. I think you are going to have to adjust your expectations for dealing with this new kernel. For example, For example, in my limited attempts I saw the PySpark kernel recognizes the cell magics of %%configure
that isn’t recognized by a Python notebook kernel. I can use pwd
in the Python 3 kernel notebook but need the older form %pwd
in PySpark shell.
So maybe while learning about some of the advances PySpark tricks, you may want to run your Spark jobs in the PySpark environment and use the Python notebook to look at your results? If possible. I definitely don’t know how interconvertible things are. Here they seem to convert the RDD to a dataframe and later put it in parquet format. From here it looks like Pandas can read that.