Migrating user data between Hubs

When I first created my JupyterHub (following the z2jh tutorial), I put it in its own project on Google Cloud. I did this because I didn’t know what I was doing and didn’t want to break other systems while experimenting.

I would now like to migrate the original Hub (i.e. all user information and data) to a new Hub in a different Google Cloud project in order to allow better integration with other systems.

What are my options for this, please? On the original Hub, when users logged in for the first time, they were allocated 10 GB of “personal” storage (as a PVC) and then connected to it whenever they logged in again. I can create a new Hub in the new project on GCP, but when my users login they’ll be assigned new, empty PVCs.

I suppose I could ask each user to login to both Hubs, zip their files and transfer their data manually, but I’m hoping there’s a better way. For example, is there a way to transfer the old user database to the new Hub, and create PVCs etc. with the correct metadata so that I can transfer their data for them (and still have them correctly identified when they try to login)?

I guess others must have tackled this, so any advice regarding the workflow or things to watch out for would be appreciated.

Thanks!

1 Like

In theory you can manually create PVCs for each user which should lead to a PV being dynamically created. You can then copy the data across. As long as the PVC matches that expected by Z2JH it should work.

Coincidentally someone recently posted about a data recovery situation:

It’s not the same since the underlying volumes already exist and they were trying to recreate the metadata, but it’s similar principles.

There are also tools such as

though I don’t have any experience with it.

1 Like

Thanks @manics, that’s very helpful!

So, if I’ve understood @yuvipanda’s post on the linked thread correctly, I can create PVCs on the new Hub for each existing user using something like:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim-<escaped-username>
  namespace: <your-namespace>
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: <size>
  storageClassName: standard
  volumeMode: Filesystem

and rely on Kubernetes’ “dynamic provisioning” to create the associated PVs automatically. Then, after I’ve transferred the data, existing users should be able to login and be correctly assigned to their new PVCs?

Have I understood correctly that, when a user logs in, the Hub “just” looks for a PVC in the same namespace called claim-<escaped-username> i.e. that’s the only metadata I need to preserve when I create the new PVCs? And I don’t need to worry about copying the old user database etc.? If so, that seems much easier than I was expecting, which would be great!

Thanks again for the reply :slight_smile:

Something like that!

For minimal risk I’d probably try it this way:

  1. Deploy your new Z2JH
  2. Login as yourself, this will cause Z2JH to generate a new PVC/PV in the usual way
  3. Compare the generated PVC YAML with your above template
  4. Create a PVC using the template for a trusted user who can test things for you
  5. Copy that user’s data across
  6. Ask that user to login
  7. Check they can see their data!
1 Like

That sounds like an excellent suggestion - thanks @manics!

This question originally does not mention the hub.db data, I am wondering, if I use sqlite as my db, and want to move the hub application from one cluster to another in a different environment (for example from a local k8s cluster to a GKE), how should I migrate my db data to the new cluster?

By default, new pvc and pv will be created dynamically when the new application is installed. Can I set hub.db parameters to bind this database to existing pvc and pv, or I can only move data into the new pv after it is created?

I’m sure there are several ways to do this.

We followed the steps outlined by @manics above and everything went smoothly. In other words, we didn’t migrate the old Hub database at all - just transferred the persistent user data to the new cluster and allowed the new Hub to build itself a new database as users logged in.

The most fiddly part for us was creating new PVCs with the correct names. This is because our old Hub used GitHub OAuth for authentication, whereas the new one uses Azure AD. We therefore needed to figure out a mapping between the “sanitised” GitHub user names from the old Hub and the “sanitised” Azure user names on the new one (because otherwise users would be assigned a new, empty PVC at first sign in, rather than being linked to their old data). This was actually pretty easy - it just took a bit of experimentation and there were a few “gotchas” where users with unusual names/e-mail addresses were not correctly assigned first time.

We only have ~100 users on our Hub, so I just created a CSV mapping old PVC names to new ones and we wrote a script to migrate the user data.

Good luck!

1 Like

@JES Thanks for quick reply.
Have you involved in authentication management problems for team work? In our case, we would like to have collaboration works mentioned here, thus I think we need to keep the data in groups, roles, a few other mapping tables.

Sorry, I don’t have any experience with the real-time collaboration features, although they look interesting! But, yes, it looks like it maybe makes things a bit more complicated in terms of user groups, roles etc.

It’s ok. We will use mysql for production. I am just curious if the helm chart offers a way to bind the db to an existing pvc.