I have proprietary data that the users can use within the notebook for analysis but cannot download locally. I can’t have the data on each user’s notebook server. Is it possible to have the data on the hub or some shared directory (without users having access to it) or a storage bucket on the cloud and then access the data through an API within the notebook?
So you want to give users access to some data, without giving them access to that data. What level of security are you looking for? If users can access the data through an API, they can also print it as cell output and then save the notebook with that output locally, or copy&paste the output from their browsers. On one occasion, I printed a base64-encoded tar file from a cell to “download” it.
I want to give access to the data but only within the notebook environment. It’s fine if they copy a subset of the data. The data license doesn’t allow users to download all of the data locally. Copying the entire data from notebooks is not feasible due to its size.
Users can use the full data set for any analysis in the notebook. Now I can put the datafile in the user’s server and they can’t download it. But doing that for every user is not efficient. I found out about the shared volume mount which should be fine for now.
Eventually, I think I’d need to move to database.
There are usually many ways for you to design an API to allow access to a subset of the data or permission to analyze the data without seeing the details, though it is tricky to do it right. But beware that the Jupyter and JupyterHub environments typically include access to a shell along with all the power of Python, so the users may be able to work around whatever APIs you provide, unless you set things up very carefully.
See e.g. the discussions and comment by parente How to Disable terminal · Issue #1195 · jupyterhub/jupyterhub
Are you sure? Assuming users can access all the data and make network connections from their Jupyter code, why couldn’t they stream it out over time?
Details on how to proceed depend a lot on the details of the data and the use/subsetting/analysis you want to enable.
There are usually many ways for you to design an API to allow access to a subset of the data or permission to analyze the data without seeing the details, though it is tricky to do it right. But beware that the Jupyter and JupyterHub environments typically include access to a shell along with all the power of Python, so the users may be able to work around whatever APIs you provide, unless you set things up very carefully.
See e.g. the discussions and comment by parente How to Disable terminal · Issue #1195 · jupyterhub/jupyterhub
I understand that better now. Thanks for the links. It’s much trickier than I had initially thought.
Are you sure? Assuming users can access all the data and make network connections from their Jupyter code, why couldn’t they stream it out over time?
That is true. In that case a license restriction on the data would be the way to go.