Unable to provision JupyterHub Helm Chart into multiple GKE namespaces with Terraform

I have been having issues with attempting to provision the JupyterHub 0.11.1 Helm chart into multiple namespaces at one time with Terraform.

Currently I have a Terraform module which loops through a map of namespaces and external IP address in order to set each namespace to one IP and have that used as an external Load Balancer. If I attempt to provision the chart in 3 namespaces only 1 succeeds. If I re-run my Terraform pipeline I can get the other 2 namespaces to provision without any issues so I am not sure why this doesn’t work on the first pass. I am keeping the proxy.SecretToken value as a constant for each deployment.

I am not sure of where to start troubleshooting this and also unsure of where any logs may be for why the chart didn’t provision into the other namespaces.

Since it’s working with one deployment it sounds like your problem is with Terraform rather than the JupyterHub Helm chart, though it’s possible your Z2JH configuration has an effect, especially if you’ve enabled cross-namespace features. Please could you:

  • Show us your full Z2JH configs with secrets redacted
  • Provide as much information as you can on how your K8s cluster is setup
  • Provide a link to your terraform files- this is a Jupyter forum, but some people may have enough experience with Terraform to spot some issues

Z2JH Helm Config.yaml:

  secretToken: <TOKEN>
  service:
    loadBalancerIP: <IP>
hub:
  config:
    Authenticator:
      admin_users:
      allowed_users:
    DummyAuthenticator:
    JupyterHub:
      authenticator_class: dummy
singleuser:
  profileList:
    - display_name: "Minimal environment"
      description: "To avoid too much bells and whistles: Python."
      default: true
    - display_name: "Datascience environment"
      description: "If you want the additional bells and whistles: Python, R, and Julia."
      kubespawner_override:
        image: jupyter/datascience-notebook:latest
    - display_name: "Spark environment"
      description: "The Jupyter Stacks spark image!"
      kubespawner_override:
        image: jupyter/all-spark-notebook:latest  
  memory:
    limit: 1G
    guarantee: 1G
  cpu:
    limit: .5
    guarantee: .5
  image:
    # You should replace the "latest" tag with a fixed version from:
    # https://hub.docker.com/r/jupyter/datascience-notebook/tags/
    # Inspect the Dockerfile at:
    # https://github.com/jupyter/docker-stacks/tree/master/datascience-notebook/Dockerfile
    name: jupyter/datascience-notebook
    pullPolicy: Always
    tag: latest
  defaultUrl: "/lab"

scheduling:
#   userScheduler:
#     enabled: true
#   podPriority:
#     enabled: true
#   userPlaceholder:
#     enabled: true
#     replicas: 2
  userPods:
    nodeAffinity:
      matchNodePurpose: require
  corePods:
    nodeAffinity:
      matchNodePurpose: require

cull:
  enabled: true
  timeout: 3600
  every: 3600

K8 Cluster Setup:

cluster_autoscaling = {
  enabled             = true
  autoscaling_profile = "BALANCED"
  min_cpu_cores       = 2
  max_cpu_cores       = 8
  min_memory_gb       = 8
  max_memory_gb       = 32
}

node_pools = [
  {
    "name" : "core-pool",
    "auto_repair" : true,
    "auto_upgrade" : true,
    "autoscaling" : true
    "disk_size_gb" : "50",
    "disk_type" : "pd-standard",
    "enable_secure_boot" : true,
    "image_type" : "cos_containerd",
    "initial_node_count" : 1
    "local_ssd_count" : 0,
    "machine_type" : "n2-standard-4",
    "max_count" : 3,
    "min_count" : 1,
    "node_locations" : "us-central1-a",
    "preemptible" : true
  },
  {
    "name" : "user-pool",
    "auto_repair" : true,
    "auto_upgrade" : true,
    "autoscaling" : true
    "disk_size_gb" : "50",
    "disk_type" : "pd-standard",
    "enable_secure_boot" : true,
    "image_type" : "cos_containerd",
    "initial_node_count" : 1
    "local_ssd_count" : 0,
    "machine_type" : "n2-standard-4",
    "max_count" : 3,
    "min_count" : 1,
    "node_locations" : "us-central1-a",
    "preemptible" : true
  }
]

node_pools_labels = {
  user-pool = {
    "hub.jupyter.org/node-purpose" = "user"
  }
  core-pool = {
    "hub.jupyter.org/node-purpose" = "core"
  }
}

node_pools_taints = {
  user-pool = [
    {
      key    = "hub.jupyter.org/dedicated"
      value  = "user"
      effect = "NO_SCHEDULE"
    },
  ]
}

Helm Resource Provisioning:

resource "helm_release" "jupyterhub" {
  for_each = var.namespaces

  name       = var.release_name
  repository = var.repository_url
  chart      = var.helm_chart
  version    = var.helm_version
  namespace  = each.key
  timeout    = var.timeout
  values     = var.values

  set {
    name  = "proxy.service.loadBalancerIP"
    value = each.value
  }

  set {
    name  = "proxy.secretToken"
    value = random_id.secret_token.id
  }

  // There is a bug in the helm provider v2.0.3 and this is the work around
  // https://github.com/hashicorp/terraform-provider-helm/issues/701
  // https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1998

  set {
    name  = "custom.whatever"
    value = "doesnotmatter"
  }
}

I am unable to provide links to any GitHub file due to it being a private repository.

If I run kubectl get events on a namespace which failed this is what I see.

85s         Normal    Scheduled                pod/hook-image-awaiter-22v7v           Successfully assigned phys202/hook-image-awaiter-22v7v to gke-tf-jh-cluster-core-pool-0d97fecb-sxd8
85s         Normal    Scheduled                pod/hook-image-puller-sr45s            Successfully assigned phys202/hook-image-puller-sr45s to gke-tf-jh-cluster-user-pool-bd2cec1e-jr3v
85s         Normal    SuccessfulCreate         daemonset/hook-image-puller            Created pod: hook-image-puller-sr45s
85s         Normal    SuccessfulCreate         job/hook-image-awaiter                 Created pod: hook-image-awaiter-22v7v
84s         Normal    Pulled                   pod/hook-image-awaiter-22v7v           Container image "jupyterhub/k8s-image-awaiter:0.11.1" already present on machine
84s         Normal    Started                  pod/hook-image-puller-sr45s            Started container image-pull-metadata-block
84s         Normal    Started                  pod/hook-image-awaiter-22v7v           Started container hook-image-awaiter
84s         Normal    Created                  pod/hook-image-awaiter-22v7v           Created container hook-image-awaiter
84s         Normal    Pulled                   pod/hook-image-puller-sr45s            Container image "jupyterhub/k8s-network-tools:0.11.1" already present on machine
84s         Normal    Created                  pod/hook-image-puller-sr45s            Created container image-pull-metadata-block
83s         Normal    Pulling                  pod/hook-image-puller-sr45s            Pulling image "jupyter/datascience-notebook:latest"
81s         Normal    Pulled                   pod/hook-image-puller-sr45s            Successfully pulled image "jupyter/datascience-notebook:latest"
81s         Normal    Started                  pod/hook-image-puller-sr45s            Started container image-pull-singleuser
81s         Normal    Created                  pod/hook-image-puller-sr45s            Created container image-pull-singleuser
80s         Normal    Pulling                  pod/hook-image-puller-sr45s            Pulling image "jupyter/datascience-notebook:latest"
79s         Normal    Pulled                   pod/hook-image-puller-sr45s            Successfully pulled image "jupyter/datascience-notebook:latest"
79s         Normal    Created                  pod/hook-image-puller-sr45s            Created container image-pull-singleuser-profilelist-1
79s         Normal    Started                  pod/hook-image-puller-sr45s            Started container image-pull-singleuser-profilelist-1
78s         Normal    Pulling                  pod/hook-image-puller-sr45s            Pulling image "jupyter/all-spark-notebook:latest"
77s         Normal    Pulled                   pod/hook-image-puller-sr45s            Successfully pulled image "jupyter/all-spark-notebook:latest"
77s         Normal    Created                  pod/hook-image-puller-sr45s            Created container image-pull-singleuser-profilelist-2
77s         Normal    Started                  pod/hook-image-puller-sr45s            Started container image-pull-singleuser-profilelist-2
76s         Normal    Pulled                   pod/hook-image-puller-sr45s            Container image "k8s.gcr.io/pause:3.2" already present on machine
76s         Normal    Created                  pod/hook-image-puller-sr45s            Created container pause
76s         Normal    Started                  pod/hook-image-puller-sr45s            Started container pause
4s          Normal    NoPods                   poddisruptionbudget/user-placeholder   No matching pods found
4s          Normal    NoPods                   poddisruptionbudget/user-scheduler     No matching pods found
73s         Normal    Killing                  pod/hook-image-puller-sr45s            Stopping container pause
70s         Normal    ProvisioningSucceeded    persistentvolumeclaim/hub-db-dir       Successfully provisioned volume pvc-5d26cf3e-9717-4640-bc3a-39e8a781ebeb using kubernetes.io

Just to ask, do I need multiple secret tokens for each namespace that a Helm chart would be provisioned into? The secret token is used to authenticate between the hub and proxy pods correct?

This is a guess, but maybe having multiple image pullers for the same image across different deployments is a problem? You could try disabling it.

proxy.secretToken secures traffic between the hub and proxy. It’s best practice to use different tokens but.you’re obviously free to make the security trade-off. The latest dev version of Z2JH uses some newish Helm features to autogenerate the secret tokens for new deployments.

Thanks for the reply. Just to confirm the image puller you reference is the hook image puller? I’ll give the below code a try.

prePuller:
  hook:
    enabled: false

It’s probably worth disabling the continuous prePuller too:

  hook:
    enabled: false
  continuous:
    enabled: false

If that doesn’t work you could try disabling the user scheduler and userPlaceholder? That’s the only other cluster wide resource I can think of:

Thanks for the response, I’ll try that out and see.

My workaround at the moment is to just run two Terraform apply steps within a single YAML file to ensure that the other namespaces have the Helm chart provision properly.