A repo failing to access a file <TNetXNGFile::Open>: [FATAL] Connection error

katilp · December 8, 2019, 9:12pm

Hi binder team!

I’m not sure if this is the right place to ask this sort of question, just ignore if it not.
We have this wonderful example of an analysis on CMS experiment particle physics open data: https://mybinder.org/v2/gh/cms-opendata-analyses/DimuonSpectrumNanoAODOutreachAnalysis/master?filepath=dimuonSpectrum.ipynb
(https://github.com/cms-opendata-analyses/DimuonSpectrumNanoAODOutreachAnalysis)

This is a very particular example as compared to other examples we have, it uses our analysis package “root” and runs over more than 60M (!!) particle physics collisions. It used to run nicely (from July 2019 to ~September or so), which was amazing! For a reason or another, it stopped working without changes in the repository.

The error messages, I was told, indicate that it cannot access the file.

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-29596316c4b5> in <module>
----> 1 df_2mu = df.Filter("nMuon == 2", "Events with exactly two muons")

Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(basic_string_view<char,char_traits<char> > expression, basic_string_view<char,char_traits<char> > name = "") =>
    Cannot interpret the following expression:
nMuon == 2

Make sure it is valid C++. (C++ exception of type runtime_error)

Error in <TNetXNGFile::Open>: [FATAL] Connection error
input_line_90:1:46: error: use of undeclared identifier 'nMuon'
namespace __tdf_0{ auto tdf_f = []() {return nMuon == 2
                                             ^

Would you be able to tell if there are some changes or restrictions (we do understand that this amount of data is quite to the limit) which could cause the failure, and if there’s something we can do to debug?

Many thanks!

Best, Kati

betatim · December 8, 2019, 10:51pm

@bitnik do you know what the network policy is on GESIS? When I launch the repository on GKE it works (at least no error message). Which makes me wonder if GESIS allows the same ports that GKE does?

@katilp as a workaround you can explicitly launch the repository on our GKE cluster with https://gke.mybinder.org/v2/gh/cms-opendata-analyses/DimuonSpectrumNanoAODOutreachAnalysis/master?filepath=dimuonSpectrum.ipynb

Thanks for letting us know!

katilp · December 9, 2019, 12:23pm

Thanks @betatim!

It looks indeed that on the GKE cluster, it goes past the point where it failed on GESIS.
But it still fails when it comes to actual computation at step 10 hist.Draw(); (some “lazy” execution on ROOT side, it only takes place there):

The kernel appears to have died. It will restart automatically.

Earlier, it surely took long (minutes), but it went nicely through. Maybe a time limit?

Best, Kati

betatim · December 9, 2019, 1:08pm

This almost always means it uses too much memory. The memory limit hasn’t changed so maybe the dataset got bigger? i would try with a subset of the data to verify that that works.

bitnik · December 9, 2019, 1:46pm

@betatim you were right, we just updated the allowed ports. And now it works on GESIS Binder too, but kernel timeouts as well.

katilp · December 9, 2019, 2:45pm

Thanks @betatim and @bitnik!
If I restrict the number of events e.g. to 10M (df_mass = df_mass.Range(10000000) in front of hist = df_mass.Histo1D and do not ROOT.ROOT.EnableImplicitMT(), it indeed now runs nicely

betatim · December 9, 2019, 2:58pm

Thanks! We should add “check ports” to the (not yet existent) list of things to do when adding a new cluster to the federation

katilp · December 9, 2019, 3:15pm

And just for the record: no need to limit the number of events if ROOT.ROOT.EnableImplicitMT() is left out. All is good then! Thanks!

stwunsch · December 10, 2019, 9:10am

Hi all!

Kati and I had just a second look at the problem and we found out that the issue comes from the number of threads spawned in the workflow. So ROOT.ROOT.EnableImplicitMT() detects the available resources similar to nproc and spawns a thread pool with that size.

The problem is now, that the sessions have 72 cpus available, detected by ROOT and nproc. So here the questions: I suppose you protect your resources in case someone spawns a massively parallel task on binder?

@betatim @bitnik Could this be the culprit?

Best
Stefan

betatim · December 10, 2019, 10:09am

A Binder launched by someone is free to start as many threads or processes as they want. We limit the total share of the available CPU each Binder can use (via the resource limits of linux containers). So spawning lots of threads or processes harms the individual user because of all the overhead but to everyone else on the machine they look like they are just maxing out their allocated resources.

The resource limit per Binder is 1 CPU (or 1000mC in kubernetes speak).

Unfortunately you can’t trust the number of CPUs you “detect” from inside a docker container as that only tells you how many cores the host has, not how many you actually have access to. You should be able to find out the limit on mybinder.org by looking at the CPU_LIMIT environment variable.

manics · December 10, 2019, 12:06pm

Java can detect the correct number of CPUs inside a container. There are a couple of open issues on the Python tracker asking for something similar:

Topic		Replies	Views
Failed to connect to event stream Binder	23	7110	March 17, 2022
Urlopen error [Errno -3] when using Binder Binder help-wanted	1	709	October 22, 2020
Jovian.ml increased usage in Binder General	8	1885	October 3, 2020
Notebook added to binder but it doesn't run (kernel not connected) Binder	3	1796	March 21, 2019
Binder does not start, logs are empty Binder	13	2066	June 10, 2020

A repo failing to access a file <TNetXNGFile::Open>: [FATAL] Connection error

Related topics