I have a question about the anaconda distribution in the jupyterhub docker image
I’m trying to use the new delta-sharing library created by databricks
If I use the anaconda distribution in the jupyterhub/singleuser docker image and I try the command “delta_sharing.load_as_pandas(table_url)”, the pyarrow library used throws a FileNotFoundError
If I don’t use the singleuser image and I configure, make, make install my own python then the command “delta_sharing.load_as_pandas(table_url)” works.
So my possible questions are:
What can I add to your jupyterhub/singleuser anaconda distribution for it to work with the delta-sharing library ?
If it’s not possible to change/fix the anaconda distribution, how can I change the juputerhub/singleuser image to point to my configured python version and still work ?
In one case I don’t build anything and just use a predefined jupyterhub singleuser docker image. I’ve tried both of the following (and they both use anaconda) and I haven’t seen any difference (I get the same error)
FROM artifactory.cib.echonet/jupyterhub/singleuser:1.2.0
FROM artifactory.cib.echonet/jupyterhub/k8s-singleuser-sample:0.11.1
In the other case I build python myself using in the Dockerfile from a centos machine the following:
RUN yum -y group install “Development Tools”
RUN yum -y install zlib-devel
RUN yum -y install libffi-devel
RUN yum -y install openssl-devel
RUN yum -y install libsqlite3x-devel
RUN yum -y install bzip2-devel
RUN yum -y install xz-devel
Do I maybe need to add the delta-sharing library with ‘conda install’ rather than ‘pip install’ when I use the anaconda distribution in the singleuser image case ?
I’ve tried the ‘delta_sharing.load_as_spark(table_url)’ - the spark way - and that works with both python installations. It’s just the load_as_pandas() that I’m having issues with
delta-sharing looks pretty simple to build… but has a lot of not-so-trivial dependencies, many of which won’t be (up-to-date) in the anaconda distribution. Mixing pip install and conda install is relatively benign when done as a one-shot in a container… but you never really know.
Getting it on conda-forge (the community-lead upstream of the anaconda distribution) would likely give a tested, compatible, continuously updated solution. I’ve opened up this pull request to kick the tires on it. Feel free to weigh in there!
More broadly: conda-forge’s Miniforge (or Mambaforge) can be a better fit for containerization for size/reproducibility purposes, as it encourages you to only bring what you need (e.g. not a compiler) and document exactly what goes in… in this case, having a pip stanza in an environment.yml is a good way provide a more complete picture.
Also, IANAL, but: depending on your company size, anaconda stock prices, and the phases of the moon, etc. you may be in violation of the ToS for the anaconda distribution. This covers not only distribution, but also just “commercial activity.” The packages and installers created by conda-forge are, however, definitely not encumbered, hence we have all but shifted to them in various Jupyter projects.
As jupyterhub uses the Anaconda distribution I need to install the ‘delta-sharing’ library via ‘conda’ rather than ‘pip’
Yes, conda install -c conda-forge delta-sharing-python will work within the hour.
Would it be possible to run Jupyterhub without Anaconda
It is, of course, but the convenience of having pre-built binaries, especially for nasty things like GDAL, generally leads people to rely on conda for at least the base level of python, as the system package managers generally lag for what data scientists demand.
The biggest win, though, for conda in a container is the ability to add advanced technology at run time as a non-root user.