Issues with Google Cloud Platform and The Littlest Jupyterhub

Hi all,

I’m running into a problem of gaining access to a GCP instance. I’ve got a few issues compounded here, but I think I’ve narrowed down what’s going on.

Relevant context:
–I set this up on behalf of a student team
–I followed this guide to a T back in July (https://the-littlest-jupyterhub.readthedocs.io/en/latest/install/google.html)
–I created a 20gb 18.04 ubuntu OS
–I did not have or use a domain name to point to the IP Address before following this guide (https://the-littlest-jupyterhub.readthedocs.io/en/latest/howto/admin/https.html); I set it up with the GCP external IP address itself
–Opening a window with “SSH” under the “connect” column in the VM instances page of the GCP will hang and fail to connect me to the VM.

It says this indefinitely:

"Connecting…
Transferring SSH keys to the VM.

The key transfer to project metadata is taking an unusually long time. Transferring instead to instance metadata may be faster, but will transfer the keys only to this VM. If you wish to SSH into other VMs from this VM, you will need to transfer the keys accordingly.

[CLICK HERE] to transfer the key to instance metadata. Note that this setting is persistent and needs to be disabled in the Instance Details page once enabled.

You can drastically improve your key transfer times by migrating to [OS Login.]"


The TLJH that I set up worked up until two days ago. It broke when a student uploaded a 2gb file, likely because the GCP instance had run out of space. This is what the student had to say:

"Don’t know whether this info would help: I tried to upload a 2GB file last night and it got stuck at 63% uploading, so I refreshed the page. It gave me bad gateway at that time. Then I tried to log in again, after I typed in password, it shows a 500: Internal Server Error. This morning, when I go to D-Lab looking for help, I cannot even log in the page, like your state now"

At this point, I intervened trying to fix the issue, but I think I made it worse.
My first reaction was to restart the instance by stopping and starting the GCP instanced. However this caused the external IP to change, and because I had set up HTTPS with the (old) IP Address, I think it caused “scrambled credentials to be sent.”

At this point, I didn’t realize that this is probably what was happening, and restarted the instance another time. Interestingly enough, the error changed from “scrambled connection” to “connection refused” where chrome tells me that there are firewall problems. I did some reading on how to fix this problem, and found that this error can occur for TLJH notebooks if the instance runs out of space, which is more or less what I suspect happened when the student uploaded a large file. So I increased the capacity of the instance and restarted again, but with no luck. Accessing the external IP would again give me a “connection refused” error.

Next, I created a snapshot of the instance in case we might lose any information (unfortunately I do not have a snapshot of the instance before any of these problems, so this was my last ditch effort at saving any data). I tried creating a new instance from this snapshot itself with a larger storage space (in case the earlier try just gave me unformated storage), but this didn’t work.

I then created a new TLJH notebook from scratch, with the hope of loading the snapshot as an additional disk to access its contents. I was able to access the notebook from an HTTP connection and added the snapshot as an additional disk via the GCP. However, navigating through the notebook’s terminal, I was unable to find this disk anywhere to load the data.

I then experimented with giving it HTTPS encryption, and found that adding a Automatic HTTPS with Let’s Encrypt using the external IP alone would also cause “scrambled credentials” to be sent and would lock me out of the instance. What’s interesting is that this seems to be a new problem, because I’ve been able to access the external IP address with encryption directly without a problem in the past (so I need to do more bug testing), but I suspect that I may have lost track and stopped/started the instance which gave it a new external IP address and thereby replicated the first issue that I encountered (note: my issue and not the student’s issue).

So now without being able to access the notebook directly, I tried to ssh directly into the instance to update some configurations, but I had problems with that. I created an SSH key and added it to the GCP under “SSH Keys.” I then tried to ssh into the instance using the command ssh account@some.ip.address.here but kept getting “Permission denied (public key)” as an error.

I’m actually not sure if this is the proper ssh call, so please correct me if I was wrong, but I’m about 80% sure that I should have been able to ssh to the instance through my terminal at that step since I had copied my SSH key into the GCP

In any case, I ended up deleting the newly created instances and am back to square one trying to fix the problem.

Currently, where I’m at is my old instance gives me “connection refused” errors and I can’t ssh into it (I’ve also tried adding my SSH Key in this instance). I’m pretty sure that there are configuration problems in the instance so SSH is a necessary(?) step, but I can’t figure out how to access it.

This issue has been resolved – for future reference for anyone who finds themselves in a similar situation:

There are a couple of approaches you can take to solve the problem. The easiest is that you can load the faulty instance as an additional disk in a new instance entirely. The way I’m familiar with is by using a snapshot and loading that as an additional disk to access, although there seems to be a way to just use the failed instance itself and directly connect that to the new instance. It might take some effort to find the loaded files from the “additional disk,” but they should be there.

Keep in mind that the snapshot doesn’t have to be one before the crash, because presumably you’re having problems retrieving the data because you didn’t set up automatic backups. That is, you can make a snapshot from the failed instance and try accessing files using that.

Another way that you can approach it, which I don’t recommend, is to create an image file, and then copy that over into a Google Bucket Storage that you can then download and reinstall as a local VM. Here’s a guide on how to make a Google Storage Bucket:
https://cloud.google.com/storage/docs/quickstart-console

This will, however, create an image as large as the instance that crashed, so it’s not very practical. There are a few guides on how to copy your image into Google Bucket Storage. This is one that I used:

But Google’s own guide seems to work as well. I’m not linking it here because I can only post two links as a new user, but feel free to look that one up if you’re so inclined. I’d recommend using the above link though.

Also note that this method seems to incur additional costs since making an image and hosting that in a bucket seems to cost money (I’m not actually sure, but that’s my impression as I’m working off credits), so you’re better off just doing it the first way.

Whatever the case, DO NOT give up and delete your instance. This was the advice I got from reputable sources. Until you try either of these steps to retrieve your data, don’t give up.

I will say, if you follow the TLJH guide on how to setup using GCP, then I strongly advise that you also configure automatic backups, which aren’t outlined in that guide. Also don’t try to encrypt your server with HTTPS unless you’ve got a registered domain name handy to attach to your IP. GCP’s IP addresses are dynamic and will change after reset, so if you restart your instance you’re screwed and will cause the error “scrambled credentials” even though it will initially work.

1 Like

Because I cannot edit, this reply is to elaborate on the post above. If you are having problems looking for the mounted drive, it’s because the drive doesn’t automount.

You can use the “blkid” command in your new instance to look at all the attached devices that you have.

After you find the device that you are trying to access, you can use the “mount” command to manually mount it to your instance.

You may receive a google error if the drive is corrupted. I don’t have the exact error message on hand, but you can google it and run a command to mount the corrupted disk. You should be able to access its contents.

1 Like

Hi Everyone,

I’m trying to create an instance and ran into an error message. I’m on Step 1, number 19, of the “Installing on Google Cloud” instructions. After hitting “Create” I ran into this message:
“The following tabs have errors:
Disks”

I’m not sure what I’m doing or did wrong. I thought I followed the instructions properly up to that point.

Also, looking ahead to number 25, it mentions using the admin users name from step 6. Is that supposed to stay step 18?

Any help is much appreciated.

thanks, louie

@howe I had the same issue where I received 500 errors because the disk had run out of space. I am trying to solve this by adding a new zonal persistent disk to the compute instance that runs JupyterHub. I followed all of the Google’s instructions to create and mount a new zonal persistant disk here, successfully (I think). However, I can’t work out how to check if it has actually worked and made additional space available to the JupyterHub server. When I SSH into the instance, the instance still says that it has the standard 20GB drive, whereas I thought it might say it has 220GB (I added a 200GB zonal persistent disk). Do you know how to check if this has worked as intended? Cheers!