A number of the data sets we want to perform machine learning on using Jupyter Lab (version 2.2.9 ) range between 100 GB and 1 TB. We are provisioning very large Kubernetes nodes (96 CPUs, 16 GPUs and 1.5 TB RAM).
Can you tell me what the size limit is for data sets (files) that are loaded into the memory of a user’s Jupyter Lab Server instance – assuming each user gets a dedicated node with the resources mentioned above.
JupyterLab and Jupyter in general shouldn’t be part of this question, except in that the host and server process might occupy on the order of 30-100MB to get running. On the scale of your data set, Jupyter shouldn’t be contributing to the memory footprint noticeably.
Instead, try looking at the tools you are using (dask, pandas, xarray, vaex, etc.) to load the data and see how much memory they use to load your data sets. The relationship of file to memory is highly dependent on the nature of the data and the tool you use to load it. What computations you do will also greatly affect how much memory you need. Many of these tools do clever out-of-memory operations to avoid filling up RAM if it doesn’t fit. You might have better luck asking in the community for your given data-loading/processing tools about how to best predict RAM usage based on data files.