Models for OCR-D processors

OCR engines rely on pre-trained models for their recognition. Every engine has its own internal format(s) for models. Some support central storage of models at a specific location (tesseract, ocropy, kraken) while others require the full path to a model (calamari).

Likewise, model distribution is not currently centralised within OCR-D though we are working towards a central model repository.

In the meantime, this guide will show you, for each OCR engine:

Tesseract / ocrd_tesserocr

Tesseract models are single files with a .traineddata extension.

Tesseract expects models to be in a directory tessdata within what Tesseract calls TESSDATA_PREFIX. When installing Tesseract from Ubuntu packages, that location is /usr/share/tesseract-ocr/4.00/tessdata. When building from source using ocrd_all, the models are searched at /path/to/ocrd_all/venv/share/tessdata. If you want to override the locations, you can set the TESSDATA_PREFIX environment variable, e.g. if you want the models location to be $HOME/tessdata, you can by adding to your $HOME/.bashrc: export TESSDATA_PREFIX=$HOME.

We recommend you download the following models, either by downloading and saving to the right location or by running make install-models-tesseract when using ocrd_all:

If you installed Tesseract with Ubuntu’s apt package manager, you may want to install standard models like deu or script/Fraktur with apt:

sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak

NOTE: When installing with apt, he script/* models are installed without the script/ prefix, so script/Latin becomes just Latin, script/Fraktur becomes Fraktur etc.

OCR-D’s Tesseract wrapper, ocrd_tesserocr and more specifically, the ocrd-tesserocr-recognize processor, expects the name of the model(s) to be provided as the model parameter. Multiple models can be combined by concatenating with + (which generally improves accuracy but always slows processing):

# Use the deu and frk models
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}'
# Use the script/Fraktur model
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}'

Ocropy / ocrd_cis

An Ocropy model is simply the neural network serialized as with Python’s pickle mechanism and is generally distributed in a gzipped form, with a .pyrnn.gz extension.

Ocropy has a rather convoluted algorithm to look up models, so we recommend you explicitly set the OCROPUS_DATA variable to point to the directory with ocropy’s models. E.g. if you intend to store your models in $HOME/ocropus-models, add the following to your $HOME/.bashrc: export OCROPUS_DATA=$HOME/ocropus-models.

We recommend you download the following models, either by downloading and saving to the right location or by running make install-models-ocropus when using ocrd_all:

To use a specific model with OCR-D’s ocropus wrapper in ocrd_cis and more specifically, the ocrd-cis-ocropy-recognize processor, use the model parameter:

ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}'

Calamari / ocrd_calamari

Calamari models are Tensorflow model directories. For distribution, this directory is usually packed to a tarball or ZIP file. Once downloaded, these containers must be unpacked to a directory again.

As calamari does not have a model discovery setup, you must always provide the path with a wildcard listing all *.ckpt.json (“checkpoint”) files.

We recommend you download the following model, either by downloading and unpacking manually or by using make install-models-calamari if using ocrd_all:

To use a specific model with OCR-D’s calamari wrapper ocrd_calamari and more specifically, the ocrd-calamari-recognize processor, use the checkpoint parameter:

ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}'