Consciensciously logging the origin of launch events

minrk · November 9, 2018, 8:51am

Two metrics from Binder that have come up recently that we don’t have data on, because our analytics don’t capture them:

people asking about launches via the API vs pageviews
asking about specific clients (thebelab or other clients)

Until the events publication, we weren’t tracking API launches at all (outside prometheus).

The Referer and Origin headers can be used to identify users. Logging the Referer whole would be a big ol’ privacy violation, but I see two options:

log only the hostname in the Referer/Origin, and only if it’s not an ip
log a kind='api' flag (or similar) if the Referer doesn’t match the Host

I don’t know if capturing the hostname of the Referer is too personal to collect long-term (short-term logging is a different story). On one hand, I could imagine it potentially identifying users for low-volume or unique hostnames. On the other hand, I feel like it’s appropriate to see what sites are using mybinder.org to provide free, anonymous compute. I feel like we should at least do the second option so we can see how much binder is being used to launch kernels via the api from other sites, even if we decide not to track the site of origin.

Note that I am specifically not talking about the Referer to the page that builds, but only the Referer for the api requests. Binder links would show the Referer as mybinder.org, because the API request originates on our page.

For the second question, it might be useful to define an opt-in field that clients should use to identify themselves (e.g. X-Binder-Client: “thebelab 3.1.5”), so that we can see what clients are being used if they are interested in being tracked.

betatim · November 9, 2018, 12:25pm

We already collect the referer for people who come to the mybinder.org launch page via their browser, so I think we can also collect this for API based launches. It would be incredibly useful to find out when a client goes haywire who to contact as currently it is quite tricky to track down the source of API requests. This means all we can do is ban those repositories and hope their owner stops by to ask why. Helping them debug their setup seems like a much more productive way of spending our time.

minrk · November 9, 2018, 3:01pm

Collecting is one thing (this is fine in short-lived private analytics for diagnostic purposes, an explicitly whitelisted GDPR use case). The public event export is another story, where including the full Referer for every launch would likely be a violation. I think we have to truncate at least to a hostname or most conservatively a bool(is remote).

Topic		Replies	Views
I updated the `binder-data` repository with a more usable launches dataset Binder	1	125	August 23, 2025
Binderhub button - 'pull from referrer' discuss	8	1380	July 9, 2019
Capturing events Binder	2	45	November 19, 2025
A tool to parse and visualize binder launch events Binder	3	676	April 17, 2021
Mybinder.org blocking launches originating from most cloud providers mybinder.org ops	2	1093	June 25, 2021

Consciensciously logging the origin of launch events

Related topics