Log Monitoring for JupyterLab on DockerSpawner

Hi folks , I have my Jupyterhub running on a server and I use Docker spawner to spawn container for each user. i want to use some monitoring tool like grafana to monitor Hub logs as well as container logs. I know that its possible for jupterhub on K8s but how to do the same for tools not using kubernetes ?

Any help is appreciated. Thanks in advanced

Grafana is typically used for visualising metrics, e.g. resource consumption, number of servers, performance, etc. JupyterHub exposes some Prometheus metrics

These can be scraped using Prometheus, and viewed with Grafana, regardless of the hosting platform (K8s, Docker, etc).

If you’re actually interested in gathering logs, try searching for “log aggregation”, “centralised logging”, or similar, as this is a topic that applies to all infrastructure, not just JupyterHub

We did this sort of stuff using Promtail and Grafana Loki. Ansible repos are private so cant share them here. But here are the config files for Promtail and Grafana Loki:

Promtail:

#
# Ansible managed
#
# https://github.com/grafana/loki/blob/master/docs/clients/promtail/configuration.md
server:
  grpc_listen_port: 9081
  http_listen_address: localhost
  http_listen_port: 9080


positions:
  filename: /var/lib/promtail/positions.yml


clients:
  - url: http://localhost:3100/loki/api/v1/push


scrape_configs:
  - file_sd_configs:
    - files:
      - /etc/promtail/file_sd/*.yml
      - /etc/promtail/file_sd/*.yaml
      - /etc/promtail/file_sd/*.json
    job_name: file_sd

  - job_name: system
    pipeline_stages:
    - replace:
        expression: (?:[0-9]{1,3}\.){3}([0-9]{1,3})
        replace: '***'
    static_configs:
    - labels:
        __path__: /var/log/nginx/json_access.log
        agent: promtail
        host: jupyterhub
        job: nginx_access_log
      targets:
      - localhost
  - job_name: jupyterhub
    pipeline_stages:
    - regex:
        expression: ^\[(?P<level>\w{1}) (?P<timestamp>[\d:\.\s-]*) (?P<app>\w*) (?P<func>[\w:\d]*)\]
          (?P<message>.*)
    - drop:
        drop_counter_reason: promtail_424_non_running_server
        expression: .*424.*
    - drop:
        drop_counter_reason: promtail_non_running_server_api_error
        expression: .*Failing suspected API.*
    - drop:
        drop_counter_reason: promtail_http_redirect_log
        expression: (.*)302 GET /user/(.*)/(.*)/(.*)
    - drop:
        drop_counter_reason: promtail_metrics_log
        expression: .*/metrics.*
    static_configs:
    - labels:
        __path__: /var/log/jupyterhub-production/jupyterhub.log
        agent: promtail
        host: jupyterhub
        job: jupyterhub_log
      targets:
      - localhost
  - job_name: jupyterhub_proxy
    pipeline_stages:
    - regex:
        expression: ^(?P<timestamp>[\d:\.]*) \[(?P<app>\w*)\] (?P<level>[\w:]*) (?P<message>.*)
    static_configs:
    - labels:
        __path__: /var/log/jupyterhub-production/jupyterhub-proxy.log
        agent: promtail
        host: jupyterhub
        job: jupyterhub_proxy_log
      targets:
      - localhost

Grafana Loki:

# WARNING: This file is Ansible managed. Do not modify it

# Loki Config file

# based on https://github.com/grafana/loki/blob/master/cmd/loki/loki-docker-config.yaml

# Documentation: https://grafana.com/docs/loki/latest/configuration/
# Reference: https://github.com/grafana/loki/issues/4613#issuecomment-1018367471

# Enables authentication through the X-Scope-OrgID header, which must be present
# if true. If false, the OrgID will always be set to "fake".
auth_enabled: False

# Configures the server of the launched module(s).
server:
  http_listen_address: localhost
  http_listen_port: 3100
  http_server_read_timeout: 310s # allow longer time span queries
  http_server_write_timeout: 310s # allow longer time span queries
  grpc_server_max_recv_msg_size: 33554432 # 32MiB (int bytes), default 4MB
  grpc_server_max_send_msg_size: 33554432 # 32MiB (int bytes), default 4MB

  # Log only messages with the given severity or above. Supported values [debug,
  # info, warn, error]
  # CLI flag: -log.level
  log_level: info

# Configures the ingester and how the ingester will register itself to a
# key value store.
ingester:
  wal:
    enabled: true
    dir: /var/lib/loki/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb:
    directory: /var/lib/loki/index
  filesystem:
    directory: /var/lib/loki/chunks
  boltdb_shipper:
    active_index_directory: /var/lib/loki/boltdb-shipper-active
    cache_location: /var/lib/loki/boltdb-shipper-cache
    cache_ttl: 72h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem

compactor:
  working_directory: /var/lib/loki/boltdb-shipper-compactor
  shared_store: filesystem
  compaction_interval: 2h
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 84h

  # Per-user ingestion rate limit in sample size per second. Units in MB.
  # CLI flag: -distributor.ingestion-rate-limit-mb
  ingestion_rate_mb: 8 # <float> | default = 4

  # Per-user allowed ingestion burst size (in sample size). Units in MB.
  # The burst size refers to the per-distributor local rate limiter even in the
  # case of the "global" strategy, and should be set at least to the maximum logs
  # size expected in a single push request.
  # CLI flag: -distributor.ingestion-burst-size-mb
  ingestion_burst_size_mb: 16 # <int> | default = 6

  # Maximum byte rate per second per stream,
  # also expressible in human readable forms (1MB, 256KB, etc).
  # CLI flag: -ingester.per-stream-rate-limit
  per_stream_rate_limit: 5MB # <string|int> | default = "3MB"

  # Maximum burst bytes per stream,
  # also expressible in human readable forms (1MB, 256KB, etc).
  # This is how far above the rate limit a stream can "burst" before the stream is limited.
  # CLI flag: -ingester.per-stream-rate-limit-burst
  per_stream_rate_limit_burst: 15MB # <string|int> | default = "15MB"

  # The limit to length of chunk store queries. 0 to disable.
  # CLI flag: -store.max-query-length
  max_query_length: 168h # <duration> | default = 721h

  # Limit how far back in time series data and metadata can be queried,
  # up until lookback duration ago.
  # This limit is enforced in the query frontend, the querier and the ruler.
  # If the requested time range is outside the allowed range, the request will not fail,
  # but will be modified to only query data within the allowed time range.
  # The default value of 0 does not set a limit.
  # CLI flag: -querier.max-query-lookback
  max_query_lookback: 168h

  # Split queries by a time interval and execute in parallel.
  # The value 0 disables splitting by time.
  # This also determines how cache keys are chosen when result caching is enabled
  split_queries_by_interval: 30m

  # Maximum number of active streams per user, across the cluster. 0 to disable.
  # When the global limit is enabled, each ingester is configured with a dynamic
  # local limit based on the replication factor and the current number of healthy
  # ingesters, and is kept updated whenever the number of ingesters change.
  # CLI flag: -ingester.max-global-streams-per-user
  max_global_streams_per_user: 100000 # <int> | default = 5000

  # Limit the maximum of unique series that is returned by a metric query.
  # When the limit is reached an error is returned.
  # CLI flag: -querier.max-query-series
  max_query_series: 100000 # <int> | default = 500

  # Timeout when querying backends (ingesters or storage) during the execution of
  # a query request. If a specific per-tenant timeout is used, this timeout is
  # ignored.
  # CLI flag: -querier.query-timeout
  query_timeout: 5m # default = 1m

frontend:
  # Maximum number of outstanding requests per tenant per frontend; requests
  # beyond this error with HTTP 429.
  # CLI flag: -querier.max-outstanding-requests-per-tenant
  max_outstanding_per_tenant: 2048 # default = 100

  # Compress HTTP responses.
  # CLI flag: -querier.compress-http-responses
  compress_responses: true # default = false

  # Log queries that are slower than the specified duration. Set to 0 to disable.
  # Set to < 0 to enable on all queries.
  # CLI flag: -frontend.log-queries-longer-than
  log_queries_longer_than: 20s

frontend_worker:
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32MiB, default = 16777216
    max_recv_msg_size: 33554432

ingester_client:
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32mb, default = 16777216
    max_recv_msg_size: 33554432

query_scheduler:
  max_outstanding_requests_per_tenant: 2048
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32mb, default = 16777216
    max_recv_msg_size: 33554432

# Dont enable anonymous usage reporting.
analytics:
  reporting_enabled: false

And you can setup dashboards in Grafana to get these logs. I hope that can give you an idea!

1 Like