Binder & Tesseract?

Hi, folks,

I’m looking for some advice about using Tesseract in Binder. I have previously been able to use Tesseract in Binder successfully, but recently I encountered the following error.

Install Tesseract:

!conda install -c conda-forge -y tesseract

Run Tesseract:

# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("filename.jpg"), lang="eng"))

Error:

TesseractError: (1, 'Error opening data file /srv/conda/envs/notebook/share/tessdata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

In the past when I’ve run into similar issues, the following has helped to fix this, but is no longer working:

!wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
!mv eng.traineddata /srv/conda/envs/notebook/share/tessdata/eng.traineddata

Any advice would be much appreciated. Thank you!

I found your own MyBinder session launch here (that binder-ready repo) and see that no longer works as written right now.
Change the wget line to:

!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata

Or

!curl -OL https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata

This is because main is now preferably used for the main branch and it looks like they converted. (See Adeoy’s comment from Feb 18 2022 here.)

… Still testing if anything else is needed because I think there is a permissions issue with /srv/conda/envs/notebook/share/. …

In case someone is looking to have this launch already installed, I note that at one time this was the suggestion to install this here using apt.txt so this get installed by apt-get as the environment is built. I suspect though with the proper conda commands and then adding the trained data via postBuild the same thing can be accomplished without apt.txt.


Minor thing, you’ll note that I suggest your install should be:

%conda install -c conda-forge -y tesseract
%conda install -c conda-forge pytesseract

The reason is that it uses the more modern magics, see here. Using the exclamation point there is no longer the best practice and you are actually better off without it because of automagics.

Thanks so much. If you figure anything else out, please let me know. I changed to curl from wget, and that hasn’t changed anything. We had originally tried including Tesseract in a requirements.txt but found it was too heavy to load during the Binder launch. I will take a look at the apt.txt option as well.

That one with the apt.txt works. I’m still trying with your launches. It worked once; however, now I cannot reproduce the steps that allowed it to work. I had tried too many things in that session apparently.

The wget failing might have been the URL being wrong too. Note the URL I used in curl and what you had posted in your question are different.

Some recommendations from biased former anaconda employee and, maintainer of a bunch of conda-forge packages:

  • check in an environment.yml including all your dependencies
    • any “special” things will be done properly when the server starts
      • in this case, TESSERACT_HOME is set when the environment is activated
      • this is hard to achieve inside an interactive session
    • all of the install stuff will happen even before your server starts
      • … even before “computer” starts,
        • and get cached better between runs
    • get everything you can from conda(-forge),
      • all conda packages are pre-built
      • some pip packages might need to be compiled from source distributions, and might need specific versions of apt.txt dependencies
      • if you need stuff from pip, you can have a pip: subsection
  • if you must install packages interactively
    • use mamba instead of conda
      • it’s faster to solve, download, and extract packages… at the very small risk of a different solve
    • binder already has and is using mamba to set up your environment
2 Likes

Following Nick’s (@bollwyvl) excellent advice, I switched it to using environment.yml and both via conda/mamba and the ‘01-WhatIsOCR.ipynb’ notebook worked immediately with no trouble at all. You can find & test from my fork here:

https://mybinder.org/v2/gh/fomightez/tapi_2021_ocr/main

I still left getting the bigger training data file inside the notebook, but it also worked using the default english language data from the installation even before I put that in there.

I would have tried the environment.yml a lot earlier if I hadn’t seen it work once early on. I hope it wasn’t in the apt.txt one that I had seen the conversion of sessionlawsresol1955nort_0057.jpg be successful. I thought it was a launch from yours, but I could never find the magic combination again.

1 Like

Thank you so much @fomightez @bollwyvl. Your responses have been very helpful. I have opted to update my environment.yml to include tesseract. I should have said that we tried this originally in early 2021, and in the older version of Binder/Jupyter the build took an excruciatingly long time every time it was loaded and sometimes failed. As these materials are for teaching, and we wanted students to be encouraged to work with these materials on their own and not get frustrated or continually reload their browser, we opted to break up the build instead by including some of the installs within the notebooks. I’m not sure if this was before Docker images were available in Binder (or if the ways we build Docker images have changed), but while the initial build takes a few minutes the notebooks load and run Tesseract without issue once the image is created.

1 Like

excruciatingly long time

Yeah, aside from some generally-better entropy management in repo2docker, adopting mamba had a lot to do with with improving bottom line performance. Among other things, it’s solver is faster: indeed, conda can now enable it behind a flag.

these materials are for teaching

When a project/course has a large, slow-changing environment and some (relatively) small content, nbgitpuller is pretty good solution. While I wouldn’t want to use it locally (it kinda makes a mess of the filesystem), it does “the binder thing” really well, at the expense of longer URLs. With a link shortener in front of it, even that isn’t a problem.

2 Likes