Success stories using NFS with Z2JH and K8s?

I’m bringing this question over from gitter because it may be a longer-term discussion.

I have a question for people deploying Z2JH on Google GKE. I’ve deployed an (external) NFS sever using U18.04 on a VM. I can mount the NFS shares on other instances in GCE. However, I can not mount the shared on instances created in GKE node-pools, much less mount them in pods. I can ping the NFS server, but the nfs mount requests appear to just hang. I’m doing this from the U18.04 nodes on which pods are deployed in an attempt to debug why pods themselves can’t mount NFS.

So, my question: If you’ve gotten NFS to work in such a situation, can you share your configurations and/or experience on how you got it to work?

I’m using the configuration at https://github.com/berkeley-dsep-infra/datahub/blob/22022e5cfbf6d610eb01fc49ac2277f9e0645f03/docs/topic/cluster-config.rst and also modified as below (disabling ip-alias and network policy).

In both cases, I can’t mount NFS on the nodes themselves. Clearly there’s a firewall involved, but I can’t seem find a way to either disable it or allow the local connections.

gcloud beta container clusters create
–enable-ip-alias
–enable-autoscaling
–num-nodes 1
–max-nodes=2 --min-nodes=1
–region=us-central1 --node-locations=us-central1-b
–image-type=ubuntu
–disk-size=100 --disk-type=pd-ssd
–machine-type=n1-standard-2
–release-channel regular
–enable-autoupgrade
–enable-autorepair
–no-enable-network-policy
–create-subnetwork=""
–tags=hub-cluster
–node-labels hub.jupyter.org/node-purpose=core
jhub2

gcloud container node-pools create
–machine-type n1-standard-4
–num-nodes 1
–enable-autoscaling
–min-nodes 0 --max-nodes 20
–node-labels hub.jupyter.org/node-purpose=user
–node-taints hub.jupyter.org_dedicated=user:NoSchedule
–region=us-central1
–image-type=ubuntu
–disk-size=100 --disk-type=pd-ssd
–enable-autoupgrade \

Thought I would follow up on this. I don’t know if it’s useful to have a “best practices” section of the Z2JH docs, but I think that attaching information about practical deployment details there would save people a lot of time.

In our case, we’re trying to deploy JH to support general computing classes and light computing classes. Our default notebook for students has Python, C++, etc and Microsoft Visual Code. We’ve been using a per-student PV solution since May 2018 but the costs are mounting. The motivation for moving to NFS was cost and improved startup times, which NFS appears to solve. We think cost will drop from $380/mo to $80/mo for storage with similar/better performance.

We’re still working out a full solution, but some things we’ve found useful for our GCE / GKE deployment:

  • We’re using an external NFS server using e.g. 2TB of standard PV
  • we switched to using network tag firewall rules where the NFS server is tagged with “nfs-server” and the JH cluster is tagged as “nfs-client”. The firewall rule then allows access to nfs-server from nfs-client. This is much easier to manage than a CIDR based firewall rule
  • We used Berkeley’s method of using a priv’d daemon set to mount NFS share once per node ( https://github.com/berkeley-dsep-infra/datahub/blob/a3f40164e3a1ea86d49d134d2f68adeb0d78ed67/hub/templates/nfs-mounter.yaml )
  • The NFS server exports using all_squash and sets anonuid=1000, anongid=100 which is is the default user/group in our docker-stacks derived containers. This simplifies the container startup because you don’t need to use an initcontainer running as root to chown the directory since all file I/O is then as the specified user. This also eliminates need to use no_root_squash . However, it also means we can’t enforce per-user file system quota using NFS quotas.

We’re not certain this is the best way forward, but we want to roll this out before start of 2020 term.

1 Like

Having more best-practices/“this is how we did it” content would be great. There is http://z2jh.jupyter.org/en/latest/community/index.html which is meant as a lightweight way to link to resources created by community members.

The reasoning for linking to other people’s work instead of incorporating it in the docs directly is that it will reduce the load on the Z2JH maintainers and that several deployment setups require access to the setup you are describing. Like you need access to AWS to work on the AWS instructions.

I think we can even link to this thread (and make a wiki) as a quick way to get the content into the docs. It would probably need some more words/step-by-step guidance.