Consciensciously logging the origin of launch events

analytics

#1

Two metrics from Binder that have come up recently that we don’t have data on, because our analytics don’t capture them:

  • people asking about launches via the API vs pageviews
  • asking about specific clients (thebelab or other clients)

Until the events publication, we weren’t tracking API launches at all (outside prometheus).

The Referer and Origin headers can be used to identify users. Logging the Referer whole would be a big ol’ privacy violation, but I see two options:

  1. log only the hostname in the Referer/Origin, and only if it’s not an ip
  2. log a kind='api' flag (or similar) if the Referer doesn’t match the Host

I don’t know if capturing the hostname of the Referer is too personal to collect long-term (short-term logging is a different story). On one hand, I could imagine it potentially identifying users for low-volume or unique hostnames. On the other hand, I feel like it’s appropriate to see what sites are using mybinder.org to provide free, anonymous compute. I feel like we should at least do the second option so we can see how much binder is being used to launch kernels via the api from other sites, even if we decide not to track the site of origin.

Note that I am specifically not talking about the Referer to the page that builds, but only the Referer for the api requests. Binder links would show the Referer as mybinder.org, because the API request originates on our page.

For the second question, it might be useful to define an opt-in field that clients should use to identify themselves (e.g. X-Binder-Client: “thebelab 3.1.5”), so that we can see what clients are being used if they are interested in being tracked.


#2

We already collect the referer for people who come to the mybinder.org launch page via their browser, so I think we can also collect this for API based launches. It would be incredibly useful to find out when a client goes haywire who to contact as currently it is quite tricky to track down the source of API requests. This means all we can do is ban those repositories and hope their owner stops by to ask why. Helping them debug their setup seems like a much more productive way of spending our time.


#3

Collecting is one thing (this is fine in short-lived private analytics for diagnostic purposes, an explicitly whitelisted GDPR use case). The public event export is another story, where including the full Referer for every launch would likely be a violation. I think we have to truncate at least to a hostname or most conservatively a bool(is remote).