TL;DR I destroyed my claim-usernames on a prod cluster deployed using current z2jh, and have a mob of nervous scientists scared they lost their data… how do I go about rebinding a hundred disk-HEXHEX GCP VM disks to the appropriate jupyterhub users?
In the process of finally deploying / converting our last cluster (prod, naturally, almost a half decade old and filled with custom hacks) to be fully terraform created, I accidentally destroyed our claim-username and pvc-HEXHEX yaml objects, and their snapshots.
I still have the data in GCP VM Disks, each user having a separate gcp disk for their homedir.
As far as I can tell, kubespawner creates the claim-username objects, and then the pvc-HEXHEXHEX are dynamically created by GKE (or any k8s storage provider), binding them to a GCP VM volume.
I can modify the claim-username objects with a fork of kubespawner or z2jh hackery, but they of course aren’t where the disk reference itself is stored, the only ref to the disk is in the pvc-HEXHEXHEX objects created in fulfilling the claim.
Of course, the disks have their metadata labels, including the org.jupyterhub/username field, so its not hard to find the right disk for each user… I just can’t figure out how to recreate dynamic claims that point at existing disks
Any help or “this is where I might start looking” appreciated even if this isn’t your expertise, because I haven’t found as much time to sleep as I’d like, I’m so happy the prod cluster is back, but I’m too dopey right now to figure out new complex k8s things… This has been a rough weekend with 3 days downtime on our prod juphub cluster (our longest in a few years kubespawner + z2jh) at a pretty bad moment.
Because this is a prod emergency, I’ve decided to be annoying and dual post here and on gitter. Appreciate your tolerance If you do have any ideas or are just k8s knowledgable and would be willing to help point me live for 30 minutes on gitter, I would really appreciate another pair of eyes at this moment.
Hey! First, hugops! This sounds like a stressful situation, and I hope you are able to get past this soon. Terraform destroying things it shouldn’t is No. 1 in my nightmare scenarios.
Here’s how you can go about recreating the correct PVCs:
Create a PV for each of the disks. These should look something like:
I think this should do. Persistent Volumes | Kubernetes should give you some more information on this process. By default, z2jh uses Kubernetes’ dynamic provisioning feature to auto create the PVs. In this case, you’re manually creating them.
I really appreciate your response, I’m doing a hu-man dinner hour with my wife, but I will read closely after!
Our last serious prod down was unrelated to z2jh but totally related to terraform (our single s3 bucket backed high bandwidth scales to many TB “hot dir sync” service for z2jh, aka hotflights, which btw if other folks might benefit from I think would be fine to share on gh), but totally related to another terraform apply --make-me-bleed The risk with canons is very real in my hide’s experience… but they sure are nice to have in place when you get sneak attacked by some other part of your stack
What are your thoughts on using a shared file server for <20 high IO bandwidth (models, training) and <100 relatively low IO bandwidth users? I was thinking this might be the moment to switch to my preferred architecture, an NFS export from a high speed GCP filestore, and just get out of the pre-guessing my user disk size needs, etc. We don’t particularly need permissions, and being able to copy things between users homedirs would aid us in debugging, and many of our users would use it while collaborating with one another.
This experience almost has me feeling like tiny-fixed-size-dynamic-disk-mounts is an antipattern at moderate scale, and only makes sense at very low scale (no setup required) and very high scale (no shared IO bottlenecks, greater H/A, tho I believe GCP filestore has a reliable HA option too)
I wonder if there’s a future where its a single bit flip in z2jh to allocate an NFS server as part of the cluster for the cases between “getting started” and “HA at scale” which might typify a lot of heavily adminned installs on here… Would be very curious for your thoughts @yuvipanda, I’ve wanted to do a significant feature contribution back to z2jh to thank ya’ll for all the lift you’ve done, and this would be an area I’m relatively experienced in.
This is exactly right, and I pretty much switched to using a shared home directory space for all my clusters a few years ago. I do recommend that pattern, and as you said, maybe now is the time to switch.
Any chance you could share one of your setup’s configs or link me to a tutorial or example in github you’d recommend following? I kind of have a moment here where I can make storage changes that will be particularly unwelcome in the coming year wiht the memory of this event… I’d like to get a KISS but future smart storage layer in place as a concillation prize thru this process.
Thanks so much for your help @yuvipanda, following your pattern above I was able to switch us successfully amidst the crisis feeling to a fully terraformed shared homedir setup backed by an NFS-exporting GCP VM Filestore, and with my teammates help () we got everyone their data back
Still getting a few auxiliary services back in operation, but at least we have core juphub+gpus on all clusters going again!
Awesome, @Seth_Nickell! Would it be possible to post any scripts or other commands that were helpful to you in the process? You’d definitely not be the last person to accidentally destroy their cluster…