A datasette of mybinder.org launches

I wanted to explore the data about launches on mybinder.org that we collect in our analytics. I also wanted to learn about datasette. As a result there is now https://binderlytics.herokuapp.com/binder-launches/binder

It contains the last 90days of data, about 1.6million rows 270days of data, about 4.4million rows.

Please share interesting things you find, queries you use, things that look a bit weird. Here are some I used:


The notebook used to create the DB is https://github.com/betatim/binder-datasette/blob/fcbf0fc9d468fc46aadde8c7cf762964ac589c1a/create-db.ipynb. I will probably re-run it with more than the last 90 days. Maybe even all ~500 days of data. Resulting in a dataset of about 7-8million rows!

7 Likes

Now with ~4.4 million launches or 270 days of history.

This is fabulous. Thank you @betatim.

1 Like

Graph of launches by provider per day (warning: takes a while for the graph to load). Not suprisingly it’s dominated by GitHub, I couldn’t find a way to display log10(count(provider)). You can easily distinguish weekday vs weekend though.


Least popular GitHub repos (only 1 launch ever)

1 Like

The chart is super cool. Even super cooler is that the link contains all the info so it just works when I click it :slight_smile:

Based on your “less popular repos”: a list of “somewhat popular repos” (more than 5 but lass than 50)

I’ve updated the datasette. It now contains all of our historical analytics events. All 7.6million of them.

https://binderlytics.herokuapp.com/binder-launches/binder

I will try and setup a GitHub Action that runs every day on https://github.com/betatim/binder-datasette to try and keep the binderlytics page updated.

It is crazy to think that we will sooner or later reach 10million launches!

A new query showing the number of launches per week.

We are getting close to 30k per day!

I’ve updated https://binderlytics.herokuapp.com/binder-launches/binder. It now contains ~8,000,000 launch events.

mybinder.org is ticking along at roughly 140,000 launches a week. So you still have some time (14-15weeks) to get your favourite drink chilled for the 10M celebration.

4 Likes

Binderlytics seems to be down :cry:

1 Like

Hereoku is saying that “the app crashed” which is not super insightful :wink: Last deploy was in January so it seems weird that it would now start crashing.

I’ll look into it.

Longer term we need some smart ideas because the size of the database file is pushing various limits for free heroku instances. Maybe we can host it on mybinder.org infrastructure?

Looks like redeploying it did the trick. Let me know if it has started working for you as well

I’d like that very much! Along with the analytics archive maybe?

1 Like

Right now I run datasette deploy heroku ... and it magically figures out all the things. Building the sqlite database from the analytics archive takes quite a while on my laptop (I start it and then leave it until a few hours later I notice it is done). Al this made me think what the easiest way would be to build and deploy it. Maybe with a bit of tweaking we can make the notebook in GitHub - betatim/binder-datasette: Tools to create a datasette for mybinder.org faster at creating the DB. Then we could build an image for this service like we do for the analytics archive and federation redirector image (via GH action on mybinder.org-deploy)?

What do you think?

Is this because it creates the whole db from scratch every time? I imagine it would be pretty quick to only do inserts on new data, since I would guess the bulk of the time is making an http request for every day since we started collecting data.

If you could:

  • fetch/open an existing db
  • retrieve the date of the last item
  • collect and insert only new events since the last item

then it seems like it wouldn’t be such a big job to run every day or so, since it would only be one or two http requests, a few thousand inserts.

Could it be something like:

  1. volume: sqlite database
  2. run datasette serve volume/events.db ref
  3. every day or so runs an update to fetch only new events and add them to events.db (I’m not sure if datasette serve would need to restart after these updates. I wouldn’t think so.)

?

I think so. I don’t remember exactly why it is setup like this. My vague memory is that the file size got bigger if I appended instead of recreated? Or maybe I was too lazy to write the “figure out where to resume from” code.

TL;DR: append-instead-of-recreate is the thing to do :+1:

I just updated the DB and now the resulting heroku image is not only over their soft limit (300MB) but also above their hard limit (500MB). So moving to mybinder.org hosting is required if we want to add new events :smiley: (We are at about 16M rows now)

Now that we’ve got chartpress all set up on mybinder.org-deploy, building an image with datasette and the scraper and adding it to the deployment like our analytics archive shouldn’t be a huge undertaking.

2 Likes