Extract specific cells from students' notebooks

I did release a jupyter notebook for my students where they need to fill some of the fields with their answers. I am using also Nbgrader in order to automatically grade their answers.

What I would like to do, is to export specific cells for each student and run a plagiarism test using Urkund app. However, I am not sure, whether it is possible to extract for instance the same cell for each student notebook submission.

Is there a way to achieve this?

Well, it might be the easiest if you just parse the JSON content of the Jupyter Notebook file yourself with a programming language of your choice.

Hey, sounds cool, but is there a library that I can do that? For instance, how can I export the json file from the .ipynb file?

The JSON content and the .ipynb file are one in the same. It just differs how you access them. If you open it as a notebook in Jupyter or if you open the file for the notebook in a text editor or via reading in a file stream in Python, etc… The latter two specific examples would let you see the JSON directly.

I would actually suggest using one level up from the JSON by using nbformat to access the noetebook elements and then cell content you want. It already handles parsing notebooks. nbformat let’s you access the cell contents as strings and so you don’t have to worry about the encoding that may be in the JSON code. Search in this forum for ‘nbformat’ for a number of examples of using it I’ve linked to in this forum. In particular, this rather complex-looking example links to some of my uses of it.

Your implementation would just include the reading part (see mainly first few lines here and then you’d then be able to parse the cells in a number of ways to extract the particular ones and the content you want.

Alternatively, you could use Jupytext to convert the .ipynb file to a Python script that would then allow you access the content you want if you further parsed the .py file using Python to read it in as you would a text file.

Which one of those three you choose sort of depends on what is the cell content you are looking for and how easy it is to find the hooks you need to extract what will work for your downstream uses.

And I could easily see what you describe in your post as snakefile that let’s you run a Snakemake pipeline to do that process for student’s notebooks. The advantage there is that snakemake defines recipes to process each file all the way through the pipeline and you can add files later without running the steps on all the input files again, just the new ones. That may be nice since your students may not hand things on all at the same time. And even if they do, certain notebooks may require some massaging to get them to process right and so that way you aren’t running all the steps again on all the ones that may have worked on the first pass. If that seems unclear or unfamiliar, I’d be glad you to help you more.


It works nice and I have access to the cells. Now, seems that I can filter them by two options either markdown/code. However, since I did use nbgrader, there are more options in my cells (read-only, automatic-answer, automatic-test, and so forth). Is there either a way to access these types or alternatively to access cell ids?

Is there also a way in nbformat to filtered a cell based on a string and convert it back to a cell?

Example for nbgrader options is below. To run this, launch a session from here by clicking on the launch binder badge. Open a new notebook. Paste the following into a cell and run it:

import os
notebook_example = "4%20-%20Manipulating%20and%20Plotting%20Data.ipynb"
if not os.path.isfile(notebook_example):
    !curl -OL https://raw.githubusercontent.com/jhamrick/nbgrader-demo/master/instructor/source/ps1/4%20-%20Manipulating%20and%20Plotting%20Data.ipynb
import nbformat as nbf
ntbk = nbf.read(notebook_example, nbf.NO_CONVERT)
for cell in ntbk.cells:
    if "nbgrader" in cell.metadata:
        if cell.metadata.nbgrader.grade == True:
            print("\n\n\nWAIT. THIS CELL IS DIFFERENT:\n---------------------------\n")
            print("for this cell: 'nbgrader', {'grade': False,")

It will indicate most cells have a ‘grade’ of False and print the source code of the few cells that aren’t. The ‘few cells’ seemed to be three when I ran it.

You can also access the cells by number overall or by execution_count. Unfortunately the demo of nbgrader above didn’t have executed cells and so you’ll need to run this a different cell to get another notebook. This will show the source for the 13th cell overall and the cell that was executed 13th when the notebook was run.

import os
notebook_example = "Working%20with%20PDBsum%20in%20Jupyter%20Basics.ipynb"
if not os.path.isfile(notebook_example):
    !curl -OL https://raw.githubusercontent.com/fomightez/pdbsum-binder/main/notebooks/Working%20with%20PDBsum%20in%20Jupyter%20Basics.ipynb
import nbformat as nbf
ntbk = nbf.read(notebook_example, nbf.NO_CONVERT)
print("13th cell in notebook:")
for cell in ntbk.cells:
    if 'execution_count' in cell and (cell.execution_count == 13):
        print("\n\n13th cell exectuted in notebook:")
1 Like

Yes. The ideas demonstrated in this example can be modified to do that. In that example the cells are examined and just the ones that are markdown are kept. You could instead examine the source of the cell and if it contains a string, you do something to that cell and then append the modified version to the list of cells. All other cells just get appended to the list of cells.

1 Like