Running A Cell In A Background Thread

I’m trying to run a long-running cell in a background thread so I can check on from other cells. If I just run the cell as normal, the other cells will hang waiting on the long running cell.

Surprisingly, I can get pretty close to what I want just by running the long running cell in a background thread. The only issue that the output from the backgrounded thread starts creeping into the output of the other cells that I run. It looks like this:

# Long running cell
import threading, time

def network_call():
    for i in range(20):
        print(i)
        time.sleep(1)
    
threading.Thread(target=network_call).start()

0
1
2

# Another cell
print("Output from another cell")

Output from another cell
3
4
5
6

Any tricks I can try that might keep the long running cell output on that celll? Or other workarounds?

I was hoping that as a trick to work around this present flaw, you could use the %%capture out magic on the long running cell to collect a pure output stream and not contaminate the other cells out streams. I explain it here; you can just ignore the %store information if you want to display it in the notebook.

However, testing it, I found the %%capture didn’t contain it. But you can keep the output from the other cell isolated from the long running output using %%capture for the normal cell. And then show the output of the normal cell (print() in your example) in another cell with:

import sys
sys.stdout.write(out.stdout);

Better approach to try, running the long running cell in a multiprocessing process, see Python 3 Module of the Week: multiprocessing – Manage processes like threads.
Following your posted example, you’d run:

# Long running cell
import multiprocessing, time

def network_call():
    for i in range(20):
        print(i)
        time.sleep(1)
    
multiprocessing.Process(target=network_call).start()


# Another cell
print("Output from another cell")

This is working in my tests in the classic notebook interface and JupyterLab.
The output from the first cell stays isolated in the first cell as it continues to run and doesn’t pollute elsewhere. Yet, while the first ‘long-running’ cell keeps running, you are able to run the other ‘normal’ cells.

3 Likes

That works PERFECTLY! Thanks!

I’ve generally had better luck using the multiprocessing module in Python. This is another example where multiprocessing shines. (I couldn’t tell you why.) Fortunately, as noted in that link, the developers of the multiprocessing module developed it following the threading API so that you can pretty much swap in one or the other without changing much to be able to test either.

Here is a slightly more illustrative example showing the concurrent/interleaved output:

 # Long running cell
import multiprocessing, time
t0 = time.time()

def network_call():
    for i in range(5):
        print("Step %i at time %1.1f" % (time.time() - t0, i))
        time.sleep(1)
    
multiprocessing.Process(target=network_call).start()

Step 0 at time 0.0
Step 1 at time 1.0
Step 2 at time 2.0
Step 3 at time 3.0
Step 4 at time 4.0

print("doing other stuff at time %1.1f" % (time.time() - t0))

doing other stuff at time 1.6

1 Like

A problem with multiprocessing is that the processes have separate memory spaces. This means that I have to design inter-process communication rather than being able to quick-n-dirty use the same variables. What I often do during development is running some long-running process in a separate thread and control it via some global variables. For this it would be awesome to have the long-runner output in the right cell. But I’m not sure if this is possible at all, because the jupyter server would have to figure out due to which cell input the kernel is writing to stdout at a particular moment.

But I’m amazed that this works for multiprocessing (where the process can be used for this identification, I guess). I wasn’t aware of this.

1 Like

In my case, the thing I’m kicking off in the long running cell runs out-of-process anyway–so I’m no worse off introducing additional processes. I’m guessing that the even though it runs in a separate process, it blocks the kernel so it can return stdout back to the caller (that’s just a guess though). In any event, the multiprocessing invocation does exactly what I needed. Obviously, YMMV…

1 Like

I tried this code and am getting an error. This may be caused by changes to the multiprocessing module that have taken effect since others tested it – in particular, the switch to the ‘spawn’ start method implemented in Python 3.8.

I tested the code I quoted using Python 3.10.7 and Jupyter Lab 3.4.8 on macOS 12.6 on both an M1 Mac and an x86 Mac. Relative to the code in this solution, I modified the code slightly to make it easy to try different start methods and to have the subprocess write to a file so I could rule out problems related to having two different output streams in the notebook:

# Long running cell
import multiprocessing as mp
import time
from pathlib import Path

def subprocess_function(p):
    for i in range(1000):
        p.write_text("i = " + str(i))
        # print(i)
        time.sleep(1)
        
mp.set_start_method('spawn')
ps = mp.Process(target=subprocess_function, args=((Path.home() / "mp_test.txt"), ))
ps.start()

When I run this using ‘spawn’ as the start method I get the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/gpython.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Library/Frameworks/gpython.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'subprocess_function' on <module '__main__' (built-in)>

I get the same error when I use ‘forkserver’ as the start method.

On the other hand, if I change the start method to ‘fork’, this code works and so do the examples cited above.

According to the documentation for multiprocessing, this switch to ‘spawn’ was made because:

The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

There is a report that the problem arises because of the limitations of ‘pickle’ and that because the multiprocess fork of the multiprocessing library uses ‘dill’ instead, this error can be avoided. But a preliminary look suggests that the key difference is that multiprocess has reverted to ‘fork’ as the default start method even on macOS. https://github.com/uqfoundation/multiprocess/issues/65.

I have not tried to reproduce the crash of the subprocess on my more recent version of macOS. One report suggests that it is an intermittent problem https://github.com/python/cpython/issues/77906#issuecomment-1093788360 and expresses pessimism about whether the change to macOS that will be required to fix this problem will be implemented by Apple.

It might be worth trying to confirm that the problem is still present in the current version of macOS. The fact that people using multiprocess have not objected to the reversion to the ‘fork’ method may suggest that the crash is rare; or that it has been fixed; or that people who use multiprocess work mainly on Windows. Does anyone have more recent information about the status of this problem with macOS?

Unless there is other evidence that the underlying issue with macOS has been resolved, the safe course would presumably be to stick with the ‘spawn’ method on that operating system and to look for some way to avoid the error I report above. Does anyone have any suggestions about what to try?

1 Like