Segmentation Fault

I am using a Supercomputer at our organization for training an AI model. The supercomputer has 4 GPU nodes since I am a beginner I used the following command from the login node to run the jupyter notebook in one of the GPU nodes:
$srun --partition=gpu --pty --nodelist=hpc-node-03 jupyter notebook --ip=0.0.0.0

I encountered two of the following fault scenarios

  1. While running programs in jupyter notebook for a long time, I get segmentation fault (core dumped)
  2. When a large dataset is getting trained, with the higher number of epochs, the kernel dies and doesn’t get restarted.

So, what would be the reason for the segmentation fault and disconnected kernel?

There are lots of possible reasons, a common one is running out of compute resources (e.g. memory). It’s also possible there’s a bug in your code or a library you’re using. If you’ve got any colleagues who use the same supercomputer they should be able to help you work out whether it’s a resource problem.

1 Like

Following up on @manics’ response (I suspect resources as well), if you could provide any of the information surrounding the seg-fault (console messages, stack traces, etc.), we (and/or others) might be able to glean some hints. Also, about how long is “a long time”?

Thanks.

@manics I have attached the result of $cat /proc/meminfo for reference

I use python code for training the AI model and as of now, no colleagues are using the HPC for this application, they use the system for OpenMP programming. Following are the libraries I use for implementation:

import os
import os.path
import sys
import h5py
import numpy as np
import tensorflow as tf
from keras.models import Model, Sequential
from keras.layers import Flatten, Dense, Input, Conv1D, MaxPooling1D, GlobalAveragePooling1D, GlobalMaxPooling1D, AveragePooling1D
from keras.engine.topology import get_source_inputs
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras import backend as K
from keras.applications.imagenet_utils import decode_predictions
from keras.applications.imagenet_utils import preprocess_input
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint
from keras.utils import to_categorical
from keras.models import load_model

I use CNN model for my application

I use the dataset created by myself for training the model. I have attached a few details of my analysis in the following:

  1. When I use dataset of size 80,000 X 10000 for 50 epochs, I get segmentation fault after completing 15th epoch out of 50. Then tried the same by setting $ulimit -s unlimited. But also got the same segmentation fault after completing 15th epoch.
  2. When I use dataset of size 80,000 X 2500 for 50 epochs, the model runs fine as desired and I am able to get the required output. But with the same dataset when the epochs is increased for 75, I get segmentation fault after 50th epoch
    3)When I use dataset of size 80,000 X 1500 for 100 epochs, segmentation fault was after completing 78th epoch.
  3. When I use dataset of size 60,000 X 1500 for 150 epochs, segmentation fault was after completing 103rd epoch.

I all the scenarios I didn’t change any of the code’s parameter other than the epoch value. Also I used the same dataset (80,000 X 10,000) to derive smaller datasets.

@kevin-bates
When I access jupyter notebook directly from gpu (node 03), using the following command, I get segmentation fault after using the jupyter notebook for around 3-5 hours continuously (not at constant time) even when it is open and kept idle. In other cases when the dataset size is changed and number of epochs are varied, the segementation fault occurs after a that particular epochs as pointed clearly in previous reply to manic
$srun --partition=gpu --pty --nodelist=hpc-node-03 jupyter notebook --ip=0.0.0.0

Following is the screenshot of the error in logs:

This sure smells like a resources issue related to your experiment and/or environment.

Are you able to reproduce the issue outside of Jupyter by running the experiment’s code directly?

Have you closely examined your epoch boundary, which I would assume is where potential leaks would probably reside?

1 Like

Thank you for your reply @kevin-bates.

I use a virtual environment and the same Kernel in Jupyter notebook too.

I didn’t try to reproduce outside Jupyter notebook.

What does epoch boundary mean? Is there any way how can I find it?

Even if it is due to epoch boundary, what could be the reason for getting Segmentation fault(core dumped) even when the Jupyter notebook is opened and kept idle?

Sorry for the delay in responding…

What does epoch boundary mean?

This is probably naive on my part, but I’m figuring the calculations restart at each epoch and if resources are not freed then, eventually, they could be exhausted. When resources like memory get exhausted, a program’s behavior is essentially unpredictable because any particular method can be impacted at any moment.

Since the jupyter stack and kernels can introduce lots of moving parts, I suspect it might be in your best interest to try reproducing the issue outside of Jupyter by running your script directly (if that’s manageable). This is probably not an easy thing to troubleshoot, but you’ve already shown that the total number of calculations has some influence here - so that’s a good clue (and implies resource limitations).

It’s been years since I’ve dug through a core dump, but sometimes (surprisingly) bits and pieces can be gleaned that can help one reach a solution.

even when the Jupyter notebook is opened and kept idle?

What do you mean by this? I’m assuming that the Seg fault occurs during the notebook cell’s execution. So although the notebook may not be actively displaying results its kernel is processing the cell’s code and, eventually, encounters the segmentation violation.

1 Like

Thank you @kevin-bates. Sorry for the late response. It got delayed to execute outside of jupyter notebook directly using slurm script.
Even it also resulted in Segmentation fault at same epoch.
I am also attaching the stackoverflow link where I have added the code that I use for training:

The dataset size is around 9GB.

Following is the slurm script used when run without Jupyter Notebook:

#!/bin/bash
#SBATCH --partition=GPU_three
#SBATCH --nodelist=Node_03
#SBATCH --output=output

python3 train.py

Thanks for the update. Since this occurs outside of Jupyter, you’ll likely need to explore this issue from a different angle. I’m sorry I don’t have experience with tensorflow, keras, or deep learning modeling, but this will likely take someone with that kind of experience, along with knowledge about your configuration. As @manics originally suggested, perhaps collaborating with a colleague or two can shed some light.

I would also try to locate the core dump file and give it a scan. It might look like a lost cause, but you might also be able to pick out a library or method call that may have been active at the time. It’s not always the case, but there’s a decent chance that such a method might reflect a “hot spot” in which further analysis can take place. For example, if this leads to a method in an installed package, search that package’s repository for similar issues, perhaps there’s an update, etc. Just keep in mind that this clue could also be misleading since memory issues can side-affect anything (including locations that are not the “hot spot”).

You might also want to look at instrumenting your code with utilities like resource and/or psutil and try to get an idea of where memory (assuming that’s the actual issue here) is being consumed when it should have been freed. I would also keep an eye out for wherever iterations or cycles might be present. Depending on that location, you might have missed an opportunity to release resources, eventually resulting in their exhaustion.

Speaking from years of experience, this will be one of those issues that will be challenging to solve, will likely be something fairly obvious (in retrospect), and feel great once resolved - and they do get resolved.

Good luck!

P.S., when you do find the solution, please feel free to post back here. It might help someone out later.

1 Like

Thank you @kevin-bates for your guidance. I will follow the steps suggested above and will let you know the updates.

1 Like