I am attempting to use a PySpark kernel inside of an EMR (Jupyter) Notebook. The notebook is provided through a managed service in AWS but I am not sure of the full architecture on where the notebook is hosted.
When attempting to download a package using the command
sc.install_pypi_package("pandas","https://<ARTIFACTORY_URL>") I receive an error regarding
Collecting pandas Could not fetch URL https://<ARTIFACTORY_URL>: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091) - skipping The directory '/home/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/home/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Could not find a version that satisfies the requirement pandas (from versions: ) No matching distribution found for pandas
We do have a self-signed certificate on the server and there is a
pip.conf located in
/etc/pip.conf that points to the Artifactory URL as an index-url and index along with the location of the self-signed certificate. I have also validated that the server itself is able to reach out to Artifactory and establish a SSL handshake.
I have also ran the following command to determine which pip.conf is being used and this was the result. I do not have a file in
xdg and I am unsure where to find the two
home/.pip directories since those are not showing up for me on the server itself.
from pip import create_main_parser print(parser.files) ['/etc/xdg/pip/pip.conf', '/etc/pip.conf', '/home/.pip/pip.conf', '/home/.config/pip/pip.conf']
Has anyone run across this or have an idea on how to start troubleshooting?