A datasette of mybinder.org launches

betatim · March 23, 2020, 10:15pm

I wanted to explore the data about launches on mybinder.org that we collect in our analytics. I also wanted to learn about datasette. As a result there is now https://binderlytics.herokuapp.com/binder-launches/binder

It contains the last ~~90days of data, about 1.6million rows~~ 270days of data, about 4.4million rows.

Please share interesting things you find, queries you use, things that look a bit weird. Here are some I used:

The notebook used to create the DB is https://github.com/betatim/binder-datasette/blob/fcbf0fc9d468fc46aadde8c7cf762964ac589c1a/create-db.ipynb. I will probably re-run it with more than the last 90 days. Maybe even all ~500 days of data. Resulting in a dataset of about 7-8million rows!

betatim · March 23, 2020, 11:37pm

Now with ~4.4 million launches or 270 days of history.

willingc · March 24, 2020, 1:46pm

This is fabulous. Thank you @betatim.

manics · March 25, 2020, 6:36pm

Graph of launches by provider per day (warning: takes a while for the graph to load). Not suprisingly it’s dominated by GitHub, I couldn’t find a way to display log10(count(provider)). You can easily distinguish weekday vs weekend though.

Least popular GitHub repos (only 1 launch ever)

betatim · March 26, 2020, 6:37am

The chart is super cool. Even super cooler is that the link contains all the info so it just works when I click it

Based on your “less popular repos”: a list of “somewhat popular repos” (more than 5 but lass than 50)

betatim · April 3, 2020, 7:50am

I’ve updated the datasette. It now contains all of our historical analytics events. All 7.6million of them.

https://binderlytics.herokuapp.com/binder-launches/binder

I will try and setup a GitHub Action that runs every day on https://github.com/betatim/binder-datasette to try and keep the binderlytics page updated.

It is crazy to think that we will sooner or later reach 10million launches!

betatim · April 3, 2020, 8:40am

A new query showing the number of launches per week.

We are getting close to 30k per day!

betatim · April 20, 2020, 8:19am

I’ve updated https://binderlytics.herokuapp.com/binder-launches/binder. It now contains ~8,000,000 launch events.

mybinder.org is ticking along at roughly 140,000 launches a week. So you still have some time (14-15weeks) to get your favourite drink chilled for the 10M celebration.

choldgraf · March 30, 2021, 2:52am

Binderlytics seems to be down

betatim · March 30, 2021, 5:07am

Hereoku is saying that “the app crashed” which is not super insightful Last deploy was in January so it seems weird that it would now start crashing.

I’ll look into it.

Longer term we need some smart ideas because the size of the database file is pushing various limits for free heroku instances. Maybe we can host it on mybinder.org infrastructure?

betatim · March 30, 2021, 5:24am

Looks like redeploying it did the trick. Let me know if it has started working for you as well

yuvipanda · March 30, 2021, 6:45am

I’d like that very much! Along with the analytics archive maybe?

betatim · March 30, 2021, 7:37am

Right now I run datasette deploy heroku ... and it magically figures out all the things. Building the sqlite database from the analytics archive takes quite a while on my laptop (I start it and then leave it until a few hours later I notice it is done). Al this made me think what the easiest way would be to build and deploy it. Maybe with a bit of tweaking we can make the notebook in GitHub - betatim/binder-datasette: Tools to create a datasette for mybinder.org faster at creating the DB. Then we could build an image for this service like we do for the analytics archive and federation redirector image (via GH action on mybinder.org-deploy)?

What do you think?

minrk · March 30, 2021, 8:50am

Is this because it creates the whole db from scratch every time? I imagine it would be pretty quick to only do inserts on new data, since I would guess the bulk of the time is making an http request for every day since we started collecting data.

If you could:

fetch/open an existing db
retrieve the date of the last item
collect and insert only new events since the last item

then it seems like it wouldn’t be such a big job to run every day or so, since it would only be one or two http requests, a few thousand inserts.

Could it be something like:

volume: sqlite database
run datasette serve volume/events.db ref
every day or so runs an update to fetch only new events and add them to events.db (I’m not sure if datasette serve would need to restart after these updates. I wouldn’t think so.)

?

betatim · March 30, 2021, 9:12am

I think so. I don’t remember exactly why it is setup like this. My vague memory is that the file size got bigger if I appended instead of recreated? Or maybe I was too lazy to write the “figure out where to resume from” code.

TL;DR: append-instead-of-recreate is the thing to do

I just updated the DB and now the resulting heroku image is not only over their soft limit (300MB) but also above their hard limit (500MB). So moving to mybinder.org hosting is required if we want to add new events (We are at about 16M rows now)

minrk · April 2, 2021, 6:36pm

Now that we’ve got chartpress all set up on mybinder.org-deploy, building an image with datasette and the scraper and adding it to the deployment like our analytics archive shouldn’t be a huge undertaking.

Topic		Replies	Views
A tool to parse and visualize binder launch events Binder	3	655	April 17, 2021
[ANN] a new binder gallery Binder	12	951	August 8, 2019
Jovian.ml increased usage in Binder General	8	1860	October 3, 2020
How to reduce mybinder.org repository startup time discuss	60	42231	December 1, 2022
Mybinder.org cost updates mybinder.org ops	0	1448	October 16, 2019

A datasette of mybinder.org launches

Related topics