Hi, folks,
I’m looking for some advice about using Tesseract in Binder. I have previously been able to use Tesseract in Binder successfully, but recently I encountered the following error.
Install Tesseract:
!conda install -c conda-forge -y tesseract
Run Tesseract:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image
# Import the pytesseract library, which will run the OCR process.
import pytesseract
# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("filename.jpg"), lang="eng"))
Error:
TesseractError: (1, 'Error opening data file /srv/conda/envs/notebook/share/tessdata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
In the past when I’ve run into similar issues, the following has helped to fix this, but is no longer working:
!wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
!mv eng.traineddata /srv/conda/envs/notebook/share/tessdata/eng.traineddata
Any advice would be much appreciated. Thank you!
I found your own MyBinder session launch here (that binder-ready repo) and see that no longer works as written right now.
Change the wget line to:
!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
Or
!curl -OL https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
This is because main
is now preferably used for the main branch and it looks like they converted. (See Adeoy’s comment from Feb 18 2022 here.)
… Still testing if anything else is needed because I think there is a permissions issue with /srv/conda/envs/notebook/share/
. …
In case someone is looking to have this launch already installed, I note that at one time this was the suggestion to install this here using apt.txt
so this get installed by apt-get as the environment is built. I suspect though with the proper conda commands and then adding the trained data via postBuild
the same thing can be accomplished without apt.txt
.
Minor thing, you’ll note that I suggest your install should be:
%conda install -c conda-forge -y tesseract
%conda install -c conda-forge pytesseract
The reason is that it uses the more modern magics, see here. Using the exclamation point there is no longer the best practice and you are actually better off without it because of automagics.
Thanks so much. If you figure anything else out, please let me know. I changed to curl from wget, and that hasn’t changed anything. We had originally tried including Tesseract in a requirements.txt but found it was too heavy to load during the Binder launch. I will take a look at the apt.txt option as well.
That one with the apt.txt
works. I’m still trying with your launches. It worked once; however, now I cannot reproduce the steps that allowed it to work. I had tried too many things in that session apparently.
The wget failing might have been the URL being wrong too. Note the URL I used in curl and what you had posted in your question are different.
Some recommendations from biased former anaconda employee and, maintainer of a bunch of conda-forge packages:
- check in an
environment.yml
including all your dependencies
- any “special” things will be done properly when the server starts
- in this case,
TESSERACT_HOME
is set when the environment is activated
- this is hard to achieve inside an interactive session
- all of the install stuff will happen even before your server starts
- … even before “computer” starts,
- and get cached better between runs
- get everything you can from conda(-forge),
- all
conda
packages are pre-built
- some pip packages might need to be compiled from source distributions, and might need specific versions of
apt.txt
dependencies
- if you need stuff from
pip
, you can have a pip:
subsection
- if you must install packages interactively
- use
mamba
instead of conda
- it’s faster to solve, download, and extract packages… at the very small risk of a different solve
- binder already has and is using
mamba
to set up your environment
2 Likes
Following Nick’s (@bollwyvl) excellent advice, I switched it to using environment.yml
and both via conda/mamba and the ‘01-WhatIsOCR.ipynb’ notebook worked immediately with no trouble at all. You can find & test from my fork here:
https://mybinder.org/v2/gh/fomightez/tapi_2021_ocr/main
I still left getting the bigger training data file inside the notebook, but it also worked using the default english language data from the installation even before I put that in there.
I would have tried the environment.yml
a lot earlier if I hadn’t seen it work once early on. I hope it wasn’t in the apt.txt
one that I had seen the conversion of sessionlawsresol1955nort_0057.jpg
be successful. I thought it was a launch from yours, but I could never find the magic combination again.
1 Like
Thank you so much @fomightez @bollwyvl. Your responses have been very helpful. I have opted to update my environment.yml to include tesseract. I should have said that we tried this originally in early 2021, and in the older version of Binder/Jupyter the build took an excruciatingly long time every time it was loaded and sometimes failed. As these materials are for teaching, and we wanted students to be encouraged to work with these materials on their own and not get frustrated or continually reload their browser, we opted to break up the build instead by including some of the installs within the notebooks. I’m not sure if this was before Docker images were available in Binder (or if the ways we build Docker images have changed), but while the initial build takes a few minutes the notebooks load and run Tesseract without issue once the image is created.
1 Like
excruciatingly long time
Yeah, aside from some generally-better entropy management in repo2docker
, adopting mamba
had a lot to do with with improving bottom line performance. Among other things, it’s solver is faster: indeed, conda
can now enable it behind a flag.
these materials are for teaching
When a project/course has a large, slow-changing environment and some (relatively) small content, nbgitpuller is pretty good solution. While I wouldn’t want to use it locally (it kinda makes a mess of the filesystem), it does “the binder thing” really well, at the expense of longer URLs. With a link shortener in front of it, even that isn’t a problem.
2 Likes