Infrastructure Advice for JupyterHub, Dask, and Airflow

Hi,

I’m looking to build out the infrastructure for a new data science team, and I’m looking to use Jupyter Lab/Hub, Dask, and Airflow. It seems like all of these can make use of a kubernetes cluster (with which I have very little experience). I don’t want to end up spending all of my time managing a cluster but am willing to learn enough to get something up and running if it would make sense.

My dream would be to have a kubernetes cluster running on AWS (my company’s choice so not really considering competitors) and set up in such a way that JupyterHub, Dask, and Airflow can scale up and down either automatically or with minimal tweaking from users. I’d also like to design it all using terraform (which I’ve toyed around with) and helm (which I don’t know much about).

Does this seem like a reasonable goal or a pipe dream without a dedicated person to manage it?

Thanks!

Jupyterhub has a Kubernetes deployment guide for AWS, using either EKS or kops:

1 Like

Have you seen this: https://github.com/michaelchanwahyan/datalab ?

From the repo blurb: “datalab is a JupyerLab-based open-source platform for scientific research for people of size mild scale team where sharing of code and dataset is allowed”,

It may provide some pointers? (tho it’s lacking in docs!)

(I haven’t had a chance to try it myself, unfortunately… If it does look relevant, and you give it a try, I’d be keen to hear how you get on…)

Perhaps more useful, some rather more developed tools that make use of Airflow:

1 Like

I’ve seen those and have been debating whether it’s worth learning kubernetes. If knowing kubernetes will help with JupyterHub, Dask, and Airflow I think that would make it worth it.

Thanks! I hadn’t seen any of those, though I do have what feels like a thousand other Jupyter Lab projects that timkpaine has created…

I’ll check them out.