This is a very particular example as compared to other examples we have, it uses our analysis package “root” and runs over more than 60M (!!) particle physics collisions. It used to run nicely (from July 2019 to ~September or so), which was amazing! For a reason or another, it stopped working without changes in the repository.
The error messages, I was told, indicate that it cannot access the file.
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-4-29596316c4b5> in <module>
----> 1 df_2mu = df.Filter("nMuon == 2", "Events with exactly two muons")
Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(basic_string_view<char,char_traits<char> > expression, basic_string_view<char,char_traits<char> > name = "") =>
Cannot interpret the following expression:
nMuon == 2
Make sure it is valid C++. (C++ exception of type runtime_error)
Error in <TNetXNGFile::Open>: [FATAL] Connection error
input_line_90:1:46: error: use of undeclared identifier 'nMuon'
namespace __tdf_0{ auto tdf_f = []() {return nMuon == 2
^
Would you be able to tell if there are some changes or restrictions (we do understand that this amount of data is quite to the limit) which could cause the failure, and if there’s something we can do to debug?
@bitnik do you know what the network policy is on GESIS? When I launch the repository on GKE it works (at least no error message). Which makes me wonder if GESIS allows the same ports that GKE does?
It looks indeed that on the GKE cluster, it goes past the point where it failed on GESIS.
But it still fails when it comes to actual computation at step 10 hist.Draw(); (some “lazy” execution on ROOT side, it only takes place there):
The kernel appears to have died. It will restart automatically.
Earlier, it surely took long (minutes), but it went nicely through. Maybe a time limit?
This almost always means it uses too much memory. The memory limit hasn’t changed so maybe the dataset got bigger? i would try with a subset of the data to verify that that works.
Thanks @betatim and @bitnik!
If I restrict the number of events e.g. to 10M (df_mass = df_mass.Range(10000000) in front of hist = df_mass.Histo1D and do not ROOT.ROOT.EnableImplicitMT(), it indeed now runs nicely
Kati and I had just a second look at the problem and we found out that the issue comes from the number of threads spawned in the workflow. So ROOT.ROOT.EnableImplicitMT() detects the available resources similar to nproc and spawns a thread pool with that size.
The problem is now, that the sessions have 72 cpus available, detected by ROOT and nproc. So here the questions: I suppose you protect your resources in case someone spawns a massively parallel task on binder?
A Binder launched by someone is free to start as many threads or processes as they want. We limit the total share of the available CPU each Binder can use (via the resource limits of linux containers). So spawning lots of threads or processes harms the individual user because of all the overhead but to everyone else on the machine they look like they are just maxing out their allocated resources.
The resource limit per Binder is 1 CPU (or 1000mC in kubernetes speak).
Unfortunately you can’t trust the number of CPUs you “detect” from inside a docker container as that only tells you how many cores the host has, not how many you actually have access to. You should be able to find out the limit on mybinder.org by looking at the CPU_LIMIT environment variable.