I’ve been toying around with setting up a JupyterHub deployment for a while but many of the technologies involved are not super familiar to me. I’m perfectly comfortable with my local jupyter lab setup but the complexity of JupyterHub seems on another level. I thought I might describe what I’d like to be able to do and see if people thought JupyterHub was the appropriate solution or if it was overkill.
Currently, I run Jupyer Lab locally from my workstation, but I sometimes run into situations where I’d like more power. I’ve considered migrating my set up to an EC2 instance, but would really like to be able to spin up instances on demand when the need arises and have them shut down once I close that particular project. I could stage a number of different EC2 instances of different sizes but that seems awkward. Also, continuously running a kubernetes server on EC2 or using EKS seems like overkill. It would be more justifiable if I got more people on my team using it, but I’m not at that point yet.
I think for my own use case the ideal would be to run JupyterLab from my workstation or a single EC2 instance in such a way that I could use remote kernels that launch different sized EC2 instances in a similar way to the way I can connect different notebooks to different kernels or conda environments. I’ve been intrigued by this description of how Harry’s approached the problem as well as the cloudJHub implementation at Harvard, but again those are both pushing me out of my comfort zone a bit.
I’m happy to try to learn more about some of these technologies, but have trouble prioritizing which ones. Any advice or suggestions? Thanks!
Hi @sterlinm,
I think the hard part is the scalability you require. It is not easy, and might be something where Jupyter itself will not be the right answer. Do you need more CPU ? Or Memory ? Are you trying to parallelize your code ?
If so (and you use Python), you may want to look at dask, which itself should be able to scale.
But you are pretty much starting to hit the current state of the art in scalable computation.
If you are interested in kubernetes and deployments, maybe you want to try contributing to mybinder, at least until you get enough understaning of kubernetes/jupyterhub integration.
I woudl love to spawn EC2 (or whatever) instances on a per-kernel basis. That seem to be the goal of enterprise gateway , and to work better for slow starting kernels (e.g starting an EC2 instance), we need to make all the internal API async. They are blocking right now which leads to a number of problem.
Does some of this make some sens ? Sorry to not have an out of the box answer.
Thanks @carreau,
I appreciate the answer. I’ve looked a bit into dask as well and am at a similar stage where I’m really interested in it but don’t feel I’ve got a handle on it yet. I’ve spent some time playing with it locally and occasionally run into issues where some step (probably a shuffle) ends up being surprisingly expensive where I end up having to force quit the process. I think the issue is my lack of understanding, but that’s prevented me from experimenting with setting up a dask cluster.
Usually the issue I’m wrestling with is memory, although I’d like to get more into parallelizing my code. To be honest, I don’t need the scale most of the time, but there are times (maybe 10-20%) when it would be helpful. That’s why I’ve been wanting to find some sort of EC2 on demand set up that meshes relatively seamlessly with my development environment. It does seem like enterprise gateway is getting at what I want but it might be too bleeding edge for me.
As an alternative to JupyterHub, I’ve wondered if I just need to customize a cloud setup that makes it easy to switch instances. Using AWS as an example, if had my data and development configuration on a drive (EFS? S3? not sure) that was easy to move from one instance to another then I could maybe have some custom scripts that make it easy to switch between instance sizes. It’s very possible this is already a service from one of the big cloud providers and I’m just not aware of it.
I think Google Colab has a UI toggle that turns on GPU access. That’s along the lines of what I’m imagining. While I’d love it if there was some super simple out of the box answer, that would have reflected very poorly on my googling skills. Thanks again for the information!
If you’re ok with stopping the machine, It’s not that hard in EC2. If you store data on EBS or S3, all you have to do is stop the machine. And then if you select the node, you can change the instance type to make it bigger (it’s greyed out because my instances are all running, but if you stop them, it’ll show up)
The downside, is of course, you loose your kernel state, and you’ll have to restart jupyter once you connect. I don’t know of a better way than that.
I also run a paid cloud hosed Jupyter service where you can do something similar, if that is of interest to you. (You can find the link in my profile or message me, I don’t want to advertise in this forum)
Thanks! That may be good enough for my needs. Much appreciated!