My jupyterhub cluster started floundering, and in an attempt to upgrade kubernetes and increase the number and size of master nodes, the cluster got… Severely Borked (etcd can’t mount it’s volume on 2/3 masters, they have different PKI keys than the working master, which is only working because I ssh’d into it and changed its API server manifest–it’s a mess)
I feel like the only option now is to take off and nuke the whole site from orbit, but the last time I had to do that it clobbered the (Aws ebs) user storage volumes. Is there a way to save those, perhaps by starting a new cluster and moving them over, or some such? I’d hate for people to lose their work… And as much as I love kubernetes, I suspect the ability to recreate the cluster without losing user storage would be a really valuable skill…
I’m sure this is possible, but I’m not sure how manual the steps have to be. The main thing should be that this is a generic kubernetes on AWS question, not specific to JupyterHub, so if you’re Googling for ideas probably omit anything Jupyter related. I have more experience with GKE, where I’ve done things like create snapshots of volumes before taking actions that might destroy them, so that I have the ability to restore data later, even if that means mounting the snapshot and new empty volume on a new node and copying files with rsync.
Ok from a generic AWS standpoint you can snapshot the current list of attached EBS volumes. You can restore these snapshots to new volumes.
As volumes are AZ specific I don’t think there is a straight forward way to remount these new volumes onto a kubernetes JupyterHub cluster as when new containers are provisioned they will be spread across AZ’s, therefore, the new single-user container might end up in a different AZ then the previous container. I would, therefore, suggest you mount each volume and rsync the data into an S3 bucket creating a separate folder for each user. You will then need a sync script to run as new containers are generated syncing the data back down to the new volume.
Lots of this work can be done using python scripts and boto but you may need to do some tagging around the resources to ensure you only work on the Jupyterhub volumes.
A quick question about your current cluster how do you currently get around volumes been tied to an AZ do you simple over provision nodes in your cluster?
It may be worth thinking about EFS instead of EBS volumes but that is entirely up to your use case.