Hello,
I am attempting to use a PySpark kernel inside of an EMR (Jupyter) Notebook. The notebook is provided through a managed service in AWS but I am not sure of the full architecture on where the notebook is hosted.
When attempting to download a package using the command sc.install_pypi_package("pandas","https://<ARTIFACTORY_URL>")
I receive an error regarding
Collecting pandas
Could not fetch URL https://<ARTIFACTORY_URL>: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091) - skipping
The directory '/home/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Could not find a version that satisfies the requirement pandas (from versions: )
No matching distribution found for pandas
We do have a self-signed certificate on the server and there is a pip.conf
located in /etc/pip.conf
that points to the Artifactory URL as an index-url and index along with the location of the self-signed certificate. I have also validated that the server itself is able to reach out to Artifactory and establish a SSL handshake.
I have also ran the following command to determine which pip.conf is being used and this was the result. I do not have a file in xdg
and I am unsure where to find the two home/.pip
directories since those are not showing up for me on the server itself.
from pip import create_main_parser
print(parser.files)
['/etc/xdg/pip/pip.conf', '/etc/pip.conf', '/home/.pip/pip.conf', '/home/.config/pip/pip.conf']
Has anyone run across this or have an idea on how to start troubleshooting?