Confusion of the db instance

The JupyterHub use one single session instance through the app life cycle. This makes sense for the synchronized application but JupyterHub is backed with Tornado which is asynchronized.

For what I know in Tornado, it’s better to create a session when a request come in and close it when that request is finished. Thus requests keep their own session instance and never messed up with each other.

There is a tiny scripts which reproduce these thoughts:

import json

import tornado.ioloop
import tornado.web
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from jupyterhub import orm

engine = create_engine('postgresql+psycopg2://xxxx:xxxx@localhost:5432/hub_test')
Session = sessionmaker(bind=engine)


class MainHandler(tornado.web.RequestHandler):

    @property
    def db(self):
        return self.settings['db']

    async def get(self):
        delay = self.get_argument('sleep', 0)
        name = self.get_argument('name')
        new_user = orm.User(name=name)
        self.db.add(new_user)

        # executing some heavy task
        await tornado.gen.sleep(int(delay))

        # record user to be added in this request
        new_instances = self.db.new

        self.db.commit()
        self.write(json.dumps(
            list(map(lambda u: u.name, new_instances))
        ))


def make_app():
    settings = dict(
        db=Session()
    )
    return tornado.web.Application([
        (r"/", MainHandler),
    ], **settings)


if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    tornado.ioloop.IOLoop.current().start()

After start it up, execute the following commands in sequence:

curl "http://localhost:8888/?name=JoJo1&sleep=10"    # output [] after 30 after 10 seconds
# after 2 seconds
curl "http://localhost:8888/?name=JoJo2&sleep=1"     # output ["JoJo1", "JoJo2"] after 1 seconds

There is the mess occured, the JoJo2 is been added by the second request and the first request added nothing.

Not sure if it’s a problem for JupyterHub, can anybody explain it? Thanks.

Anybody know about this question?

Great question! There’s been work related to this in the past. These PRs and issues linked from them might be interesting to you:

  1. https://github.com/jupyterhub/jupyterhub/pull/85
  2. https://github.com/jupyterhub/jupyterhub/pull/1291
  3. https://github.com/jupyterhub/jupyterhub/pull/1809

We’ve ended up on a somewhat stable state at the moment, but always useful to keep digging. Hope this is useful!

1 Like

I’m curious if anyone has ever tried using tornado-sqlalchemy with jupyterhub? I’ve been reading through this old issue because we’re experiencing scaling problems with the hub as well and it seems to be rooted in the database ORM layer and how there is a single session for all requests. I thought maybe our max_connections setting for the postgresql database was too low (115) but when looking at actual connections to the hub DB in our production cluster there are only 5 because that’s the default in sqlalchemy QueuePool. Or am I misunderstanding and people with larger hubs are commonly passing through a larger pool_size using the db_kwargs config? For an idea of target numbers we’re trying to be able to support up to 2000 active users/notebooks (pods) on our hub at the same time. So far we’ve had up to 800 after changing c.JupyterHub.activity_resolution and I’m doing stress testing to see how much beyond that we can get before the hub starts to crash.

This single session is used as an optimization (local caches means fewer repeated queries necessary for already-fetched state), but is also one of the biggest impediments to multiple-instances needed for HA deployments. To use per-handler Sessions, we would need to eliminate long-lived database-referencing objects (mainly User and Spawner). That’s a big change! It would also make certain operations noticeably more expensive, since it would mean re-fetching a lot of data from the database that we currently tend to have in memory. That’s already a requirement if we ever have another process writing to the database invalidating our cache.

Since our use of async/await means “cooperative” multitasking, the main thing we really have to do to ensure consistent database state is to never do this:

        self.db.add(new_user)
        # executing some heavy task
        await tornado.gen.sleep(int(delay))
        self.db.commit()

that is, have an await between preparing a transaction and committing it. As long as we always commit any changes before we await and let another handler start working with the database, we stay consistent.

2 Likes

Just to follow up on this, I did some digging into the max DB connections used in our production Hub database and in the last two weeks it hasn’t gone over 7. The default connection pool size in sqlalchemy is 5 with a default max_overflow of 10 so we have at most 15 connections to give out from the pool before connection requests start waiting. Since we’re not near that 15 connection limit increasing the pool_size likely wouldn’t help us.

Our last big event was at the beginning of May and we have another one coming up next week so I guess we’ll see if those numbers change but so far it seems as though the defaults are sufficient for us.