JupyterHub for 5000+ Users


#1

Hi everyone!

I am a part of the team who is working to create multiple Computer Vision courses. We have a major upgrade in mind which will effectively increase the total number of students enrolled in our course from 500 to 5000+. We have already started shifting our course platform to Open edX. Now, we need something which lets students get a hands-on experience without going through the hassle of installing libraries (though we do provide support for that as well). Since I have already created docker images for that, I was thinking of going for JupyterHub for this. What are your opinions about this? Will Z2JH be able to handle this much crowd without breaking up? The another thing to make a note of is that our codes are primarily in C++ and Python. I was able to setup a TLJH to test the kernel and installation and it works real smooth. But, it’s good for <100 users.

I have tried to setup a Z2JH in past using Google Cloud but being an AWS friendly guy, I find GCloud a bit strange. Will AWS actually be a good choice for Z2JH or should I just stick to GCloud?

Finally, this might be a weird question but I will still ask it. We are a team of just 3 members who have to handle the content, forum, management part and, if we go for Z2JH, that as well. And mostly it will be just me handling the Z2JH part. Is it a one man job? How likely is it for Z2JH to break or cause a critical issue which might require a team to be specifically hired to maintain and setup the hub?

Thanks in advance

Vishwesh


#2

@yuvipanda set up a jupyterhub for data8x that could handle this number of people. I think it was a non trivial amount of work but technically it was doable (though I think with a non standard setup).


#3

Thanks @choldgraf
I am going to try to setup a Z2JH and see if it works :smiley:


#4

I’d definitely go with Zero2JupyterHub. One think to keep in mind is that 5000 students sounds a lot but it seems that if 5000 people sign up to an online course only a small fraction of them ever use the notebook server concurrently. However I don’t know what the factor is? 10? 100?

Maybe someone who has run MOOCs could chime in with some experience.

One thing to make sure is that you turn off/limit the connectivity students have for outgoing network traffic. To still get material into the students home directories you can use nbgitpuller with a “git proxy” installed in the cluster.


#5

Hi Tim!

I am thinking of setting up Z2JH today and let’s see how it works. 5000 users will be the total count. The concurrent users count should not be more than 100.

How can I turn off the connectivity and how do I use the git proxy? Any content I can check out?

Thanks


#6

To limit network access network policies is the tool to use. There are a few examples of it being used in https://github.com/jupyterhub/mybinder.org-deploy/tree/master/mybinder/templates. This is the helm chart that runs mybinder.org, so a bit more complex than just a Z2JH chart. Might be the best place to start if you can’t find anything in the Z2JH guide.

I think @yuvipanda had a git proxy setup for an edX deployment. I’d have to search a bit to find the code. IIRC the idea was to run the git CLI in server mode, serving a repository from a directory that was also being updated every N minutes from a git clone <actualrepolinkhere> being executed in a while loop.

Edit: the Z2JH guide has a section on network policies: https://zero-to-jupyterhub.readthedocs.io/en/latest/security.html?highlight=policy#kubernetes-network-policies


#7

https://github.com/berkeley-dsep-infra/data8xhub/blob/85d6d65b9ab862ffbac3728e542d2b359bbb0898/hub/templates/reposync/deployment.yaml is the “git server sync” I was talking about.