I’m running JupyterHub on an AWS EMR cluster using a Pyspark 3 notebook and am experiencing some unexpected behavior when trying to render objects to HTML. Rather than rendering the object as HTML, I see the following output:
As an example, the following:
from IPython.core.display import display, HTML
results in the above output.
I suspect I’m missing a necessary package or the like but have had no luck running down the source of the issue.
Maybe try running
%pip install IPython; import IPython and see if it fixes it when you restart the kernel after that?
Thanks for the suggestion. I cannot run pip from a cell but I already have IPython installed by the looks of it. Running
Yes, it looks like a fairly new version, too. Does just
import IPython help?
I doubt this will make much difference but since looking at https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/ there are some differences in the environments, have you tried without the
.core part? I put some examples with just
Also you didn’t say if you are using the classic interface or the JupyterLab interface. If using JupyterLab, I suggest switching to the classic interface or vice-versa and see what happens? If you can. I don’t know how universal the ability to switch I discuss in the middle paragraph here . This usually fixes widgets that often show up as these
<object> tags in JupyterLab.
Thanks for all suggestions. I tried both changing the import and switching from classic to JupyterLab and neither seemed to make a difference.
Not sure if this is helpful but I can render HTML using a sparkmagic command with the example you provided:
Just to be clear, I just provided the info about sparkmagic to show that HTML rendering works there. I still can’t get dataframes to render as HTML.
Has the display ability improved for Spark Dataframes? This post at https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html said the display was lacking in notebooks compared to Pandas. Maybe you are seeing the default handling?
I can’t say about Spark dataframes but the snippet I posted originally has no dependency on Spark so it seems to be an issue aside from that question. FWIW, I’ve tried using Pandas dataframes as well and haven’t been successful in getting those to display as HTML either.
First, for the Spark dataframes have you tried the solutions offered here or here? (Maybe this if you install the almond kernel via this?)
I think as the original question here shows, what you are seeing is expected. Your own expectations mentioned in your original post though may come from working in different kernels? Such as the typical Python 3 kernel?
I just tried running an EMR notebook on a minimal AWS cluster and one of the the choices for kernels to start up a notebook with is Python 3. (I also see PySpark, Spark, and SparkR.) Within a notebook launched using Python 3 my Pandas dataframe rendered with rich display tytpical of a Pandas dataframe in a notebook and your ‘Hello, world’ code that you first posted rendered fine, too. So I think your issues are tied to the kernel you are using and not Jupyter.
Your ‘unexpected behaviour’ mentioned in your original post is maybe based familiarity with a Python-based notebook? But you won’t see that with other kernels/shells. For an extreme example, go to here and launch a binder session and when it comes up you’ll get a list of notebooks to start. Choose the first one, ‘Getting Circos Up and Running’, from the list. That opens with a Bash kernel and is really not happy to see Python code.
Admittedly ‘PySpark’ sounds like it would be a superset of the Python kernel but that doesn’t seem to be the case fully. I think you are going to have to adjust your expectations for dealing with this new kernel. For example, For example, in my limited attempts I saw the PySpark kernel recognizes the cell magics of
%%configure that isn’t recognized by a Python notebook kernel. I can use
pwd in the Python 3 kernel notebook but need the older form
%pwd in PySpark shell.
So maybe while learning about some of the advances PySpark tricks, you may want to run your Spark jobs in the PySpark environment and use the Python notebook to look at your results? If possible. I definitely don’t know how interconvertible things are. Here they seem to convert the RDD to a dataframe and later put it in parquet format. From here it looks like Pandas can read that.