Persist auth_state between login attempts

I am trying to integrate JupyterHub with Shibboleth by setting up a SAML based authentication by writing my own authenticator. There is a SAML authenticator out there, but it’s not super robust for me needs. One of the things I want to do is prevent replay attacks. In order to do this, I need to record the specific ID of a SAML authorization message at login, and then make sure that I don’t allow any SAML authorization message with that ID to log in again. However, I’m having a hard time figuring out how to do this.

I figure it has to be somehow related to auth_state, but I can’t figure out how to access auth_state from a previous login in a new login attempt. I basically need a data structure where I can quickly check to see if the message ID is in the data structure, and if it is, fail authentication. I also intend to record timestamps for each of the message IDs so that I can drop the IDs after a reasonable period of time. I’m not too worried about this persisting past restarts of the hub, so I don’t need it to be stored permanently.

Is there any way to do this? I apologize if this is obvious, but it hasn’t been to me.

Thanks for your help!

1 Like

Have you checked the source code of the authenticator? Have you considered adjusting the code yourself if deemed necessary? I haven’t worked with it but I guess if not already provided it should be easy to store that information yourself in a database of your choice. Or do I miss sth?

Thanks for your response! To be clear, I’m writing a custom authenticator. There is an existing authenticator (not written or supported by the Jupyter foundation, but just an individual) that implements SAML authentication, but it does not do so in a way that meets my needs (it also does not protect against replay attacks). I could certainly update the code, but it will honestly be easier just to write my own authenticator.

My question is about persisting information between calls to the authenticator. What is the best practice for this use? Specifically, I need to keep track of which message IDs I have already seen and reject any message that re-uses a message ID to prevent replay attacks. I’m actually not sure if the authenticator is re-instantiated every time someone logs in. If it’s not, then I can create a data structure within the authenticator that keeps track of things. If it is, then the hub itself somehow needs to keep track of things, which it’s not clear to me how to use hub data structures from within the authenticator.

Is there a best practice around writing a custom authenticator with “memory”?

After some experimentation, it appears that there is not a new instance of an authenticator for each login attempt. Therefore, I can just use a new variable in the authenticator class to keep track fo previous message IDs. I can also persist these across reboot of the hub by using the hub database. After going through the code for JupyterHub, it appears that the database is exposed to the authenticator as self.db.

I think that answers my questions, though I would love someone more experienced to chime in about best practices. However, I wanted to post this follow up in case anyone stumbles on this question.

I would be very careful about keeping information in memory when several users use that data structure in parallel. At some point you run into danger to have a memory overflow that might be hard to track, especially at times of an attack. Hence, I would prefer using a database that manages such things (keeping things in memory versus writing it to disk) automatically. That way you can also examine an attack once your system has crashed and you carefully restart the machine afterwards. Databases usually ensure that data is not lost.

1 Like

I think you’ve found your answers, but I’ll chime in to confirm that you are exactly right:

  1. One Authenticator instance lives for the duration of the JupyterHub process, so storing short-term state in attributes is totally reasonable, and
  2. the database is accessible as self.db and is the place JupyterHub stores long-term information. It’s okay to piggy-back on this, or you can maintain your own state in a separate file, e.g. a simple sqlite database of your own or even a JSON file.

I’ll add that auth_state is the field where authenticators are expected to store extensible information about users in the database. This is designed in such a way that Authenticators do not need to access the db directly. If, on completion of authentication, you return

{
  "name": "myuser",
  "auth_state": {"key": "value"},
}

that state dict will be persisted in an encrypted column in the database, and accessible during spawn, etc. This is meant for things like passing auth tokens to pass to Spawners, etc.

However, auth_state was not designed in a way that makes it easy for the Authenticator itself to check during authentication, because the Authenticator does not have access to the high-level wrappers of the User data in the database. Currently you have to do something like:

def authenticate(self, ...):
    app = self.parent
    username = self.normalize_username(username)
    try:
        user = app.users[username]
    except KeyError:
        # first-time login, user not defined yet
        user = None 
        auth_state = None
    else:
        auth_state = await user.get_auth_state()
    if auth_state:
        message_history = auth_state.get("message_history", [])
    ...

This has come up a few times, though, and I think we should probably make accessing auth-state during authentication a first-class supported activity. Would you mind opening an Issue about this use case?

1 Like

Small comment on the code: The KeyError was designed for a real error case, when something unintended happens. Otherwise, different constructs such as if key in my_dict or my_dict.get(key) can be used.

Therefore, wouldn’t it be cleaner to phrase the code like this?


def authenticate(self, ...):
    app = self.parent
    username = self.normalize_username(username)
    if username in app.users:
        user = app.users[username]
        auth_state = await user.get_auth_state()
        message_history = auth_state.get("message_history", [])
    else:
        user = None             # if these are needed at all?
        auth_state = None   # if these are needed at all?
    ...

Or did I miss something?

Not quite, because the UserDict is a somewhat weird dict subclass which contains a cache of User wrapper objects that map to the active subset of users in the database. Using in checks if the User object is in the cache, which doesn’t necessarily mean that the user doesn’t exist in the database. Actually attempting to access the user object ensures retrieval from the database.

What I would expect to work but still doesn’t is user = app.users.get(username, None) which doesn’t work because get wasn’t implemented in the users dict. This is just a bug that I found when writing this answer for the first time.

The KeyError was designed for a real error case

I also don’t think this is quite true. It’s a pretty common pattern to attempt access and catch key errors instead of checking presence and then accessing, which redundantly checks presence again. In dict-like proxy objects, this can mean multiple database accesses. Raising KeyError doesn’t mean a mistake was made, it just means the item isn’t present. get would be the most common way to do this, though, I think, and what I would use after I fix get in the UserDict.

Thank you for the insights!

@minrk and @1kastner, thank you for your responses and insights! @minrk Your information has been super helpful in figuring out how to proceed with writing the authenticator.

I’m happy to open an issue describing my use case, but it may take me a couple of days to find the time.

@minrk I just submitted an issue. Thanks for your help!