Delta-sharing library not working with singleuser image anaconda

Hi,

I have a question about the anaconda distribution in the jupyterhub docker image

I’m trying to use the new delta-sharing library created by databricks

  1. If I use the anaconda distribution in the jupyterhub/singleuser docker image and I try the command “delta_sharing.load_as_pandas(table_url)”, the pyarrow library used throws a FileNotFoundError

  2. If I don’t use the singleuser image and I configure, make, make install my own python then the command “delta_sharing.load_as_pandas(table_url)” works.

So my possible questions are:

  • What can I add to your jupyterhub/singleuser anaconda distribution for it to work with the delta-sharing library ?

  • If it’s not possible to change/fix the anaconda distribution, how can I change the juputerhub/singleuser image to point to my configured python version and still work ?

Thank you very much

Hi! Can you show us how you’re installing delta-sharing in your Docker image- do you have a link to your Dockerfile?

Hi Manics,

In both cases I get the ‘delta-sharing’ library from our artifactory using the following:

RUN echo “[global]” > /tmp/pip.conf &&
echo “index-url = https://${artifactory_username}:${artifactory_password}@artifactory.cib.echonet/artifactory/api/pypi/pypi/simple” >> /tmp/pip.conf &&
echo “trusted-host = artifactory.cib.echonet” >> /tmp/pip.conf

and then I do pip install delta-sharing

The difference is how python was built

  1. In one case I don’t build anything and just use a predefined jupyterhub singleuser docker image. I’ve tried both of the following (and they both use anaconda) and I haven’t seen any difference (I get the same error)

FROM artifactory.cib.echonet/jupyterhub/singleuser:1.2.0
FROM artifactory.cib.echonet/jupyterhub/k8s-singleuser-sample:0.11.1

  1. In the other case I build python myself using in the Dockerfile from a centos machine the following:

#####################

ARG artifactory_username
ARG artifactory_password
ARG artifactory_apikey

ARG python_version=“3.7.1”
ARG python_dir="/apps/python"
ARG python_dist_file=“Python-${python_version}.tgz”

ENV LD_LIBRARY_PATH=/usr/local/lib:/usr/local/include

RUN yum -y group install “Development Tools”
RUN yum -y install zlib-devel
RUN yum -y install libffi-devel
RUN yum -y install openssl-devel
RUN yum -y install libsqlite3x-devel
RUN yum -y install bzip2-devel
RUN yum -y install xz-devel

RUN curl “https://artifactory.cib.echonet/artifactory/external-generic-local/python/python/python/linux/${python_dist_file}
-o “/tmp/${python_dist_file}” -u ${artifactory_username}:${artifactory_password} &&
mkdir -p ${python_dir} &&
tar -xzvf /tmp/${python_dist_file} -C ${python_dir} &&
rm “/tmp/${python_dist_file}”

RUN cd ${python_dir}/Python-${python_version} &&
./configure --with-openssl="/usr" --enable-loadable-sqlite-extensions &&
make && make install

RUN echo “[global]” >> /etc/pip.conf &&
echo “index-url = https://${artifactory_username}:${artifactory_apikey}@artifactory.cib.echonet/artifactory/api/pypi/pypi/simple” >> /etc/pip.conf &&
echo “trusted-host = artifactory.cib.echonet” >> /etc/pip.conf

####################

Do I maybe need to add the delta-sharing library with ‘conda install’ rather than ‘pip install’ when I use the anaconda distribution in the singleuser image case ?

I’ve tried the ‘delta_sharing.load_as_spark(table_url)’ - the spark way - and that works with both python installations. It’s just the load_as_pandas() that I’m having issues with

Thank you very much,

Alison

delta-sharing looks pretty simple to build… but has a lot of not-so-trivial dependencies, many of which won’t be (up-to-date) in the anaconda distribution. Mixing pip install and conda install is relatively benign when done as a one-shot in a container… but you never really know.

Getting it on conda-forge (the community-lead upstream of the anaconda distribution) would likely give a tested, compatible, continuously updated solution. I’ve opened up this pull request to kick the tires on it. Feel free to weigh in there!

More broadly: conda-forge’s Miniforge (or Mambaforge) can be a better fit for containerization for size/reproducibility purposes, as it encourages you to only bring what you need (e.g. not a compiler) and document exactly what goes in… in this case, having a pip stanza in an environment.yml is a good way provide a more complete picture.

Also, IANAL, but: depending on your company size, anaconda stock prices, and the phases of the moon, etc. you may be in violation of the ToS for the anaconda distribution. This covers not only distribution, but also just “commercial activity.” The packages and installers created by conda-forge are, however, definitely not encumbered, hence we have all but shifted to them in various Jupyter projects.

1 Like

Thank you very much bollwyvl

So if I’ve understood correctly:

  • As jupyterhub uses the Anaconda distribution I need to install the ‘delta-sharing’ library via ‘conda’ rather than ‘pip’

Would it be possible to run Jupyterhub without Anaconda ? All the singleuser images seem to use Anaconda

Thank you very much,

Alison

As jupyterhub uses the Anaconda distribution I need to install the ‘delta-sharing’ library via ‘conda’ rather than ‘pip’

Yes, conda install -c conda-forge delta-sharing-python will work within the hour.

Would it be possible to run Jupyterhub without Anaconda

It is, of course, but the convenience of having pre-built binaries, especially for nasty things like GDAL, generally leads people to rely on conda for at least the base level of python, as the system package managers generally lag for what data scientists demand.

The biggest win, though, for conda in a container is the ability to add advanced technology at run time as a non-root user.

Thank you very much bollwyvl !

I can see the file is here → Files :: Anaconda.org

I will try this via conda install

Alison