Using the pyspark-notebook docker image. I can open a notebook via the browser and see the pyspark module is installed (help(‘modules’)). But when I try and use what appears to be the same python distribution (looking at the kernelspec) via the container command line, pyspark doesnt appear to be installed.
Aside from installing it as part of the build, am i missing something in terms of how jupyter prepares the environment when notebooks are initialised as a browser instance.
I can run pyspark from command line but ultimately im trying to run a pyspark based notebook from the command line using papermill.
Jupyter doesn’t really prepare environments at all, it runs in the environment you’ve given it.
When you’re working with two Python environments, it’s almost always down to the value of
sys.prefix because the two are operating in different Python environments. You can also check the value of
sys.path, which will show you where
import looks for packages.
If you’re running pyspark from its own command-line entrypoint and it’s not actually installed as a Python package, one of the things the pyspark script does is add itself to sys.path so you don’t really have to install it to use it. This results in exactly the kind of confusion you describe.
A long time ago, I made the tiny findspark, which does exactly this, but the other way around (look for spark and add it to sys.path), so that you can get similar behavior. But for personal use, you can generally replace it with a single call to
sys.path.extend with the right path for your environment prior to importing pyspark.