FAQ

General

Where can I start my journey into the OCR-D ecosphere?

Who is the target audience of OCR-D?

OCR-D’s primary target audience are libraries and archives, digitizing historical prints at scale.

Where can I get support on OCR-D?

What is the difference between OCR-D and ABBYY?

ABBYY is a software developer producing the ABBYY Recognition Server which offers layout detection and text recognition with a pay-per-page pricing model. OCR-D is a project that integrates a wide variety of solutions for the full gamut of possible OCR workflow steps. ABBYY is simple to use but offers few options for customization whereas OCR-D workflows can be fine-tuned for best recognition of specific corpora. OCR-D has a strong focus on historical prints, trainable layout detection and text recognition and open interfaces to accommodate future developments, whereas ABBYY performs more strongly for modern print. Finally, OCR-D is a community effort with a strong focus on transparency and Free Software.

What is the difference between OCR-D and Tesseract?

Tesseract is the leading Free Software OCR solution and tightly integrated into OCR-D in both a technical and organizational sense. Technically, Tesseract has been wrapped as ocrd_tesserocr, an OCR-D-compliant processor that is more powerful than the command line tool bundled with Tesseract. Organizationally, Tesseract maintainers and contributors have been part of the OCR-D project from the beginning and the originally OCR-D-developed Tesseract training tool tesstrain has been adopted by the wider Tesseract community.

What is the difference between OCR-D and TRANSKRIBUS?

TRANSKRIBUS is a software platform and server infrastructure to make it easier for Digital Humanities practitioners to collaborate on Handwriting Text Recognition. Apart from the different use cases

Is OCR-D production-ready?

Yes! Several libraries in Germany (e.g. Staatsbibliothek Berlin, ULB Göttingen, ULB Sachsen-Anhalt) are already using OCR-D at a large scale, with over 10 million pages digitized already.

Which formats are supported by OCR-D?

OCR-D is primarily based around METS as a container format and PAGE-XML for layout detection and text recognition results. Other OCR formats such as ALTO, hOCR or ABBYY FineReader XML are supported through conversion with ocrd_fileformat.

The preferred image format within OCR-D is TIFF but PNG and JPEG are also supported. JPEG2000 is not currently supported but can be added in the future if there is demand for it.

Why does OCR-D need METS files? How can I process images without METS?

The processes within OCR-D are designed around METS for the simple reason that it is such an ubiquitous and well-defined format used in libraries and archives around the world. By relying on a container format instead of just images, processors can make use of more information and can store detailed results in a well-defined fashion.

If the data to be processed isn’t already described by a METS file, the ocrd command line tool offers simple ways to create new METS files or augment existing ones.

How much does it cost to deploy OCR-D?

OCR-D is Free Software, licensed under the terms of the Apache 2.0 license and will be free to use and adapt in perpetuity.

What are the system requirements for OCR-D-software?

The OCR-D/core framework is fairly light compared with other interoperability platforms. System requirements therefore depend on the actual processors to be used and the scale of the operation. It is possible to use OCR-D on commodity hardware such as desktop PCs and laptops but can also be deployed to massive servers or even single-board computers.

However, OCR workflows can be very memory-intensive, in particular when working with large neural network models that have to be loaded into memory. We recommend at least 16 GB of RAM to support even the most demanding workflow steps.

Another bottleneck for OCR workflows is input/output. We recommend storing data on SSD instead of HDD.

CLI

How can I find out the version of OCR-D software?

To find the version of the OCR-D/core framework installed, run the ocrd CLI with the --version flag:

$ ocrd --version
ocrd, version 2.2.2

All OCR-D processors also support the --version flag, e.g.:

ocrd-tesserocr-recognize --version
Version 0.7.0, ocrd/core 2.2.2

How do I get help on ocrd CLI commands?

Every command and subcommand of the ocrd CLI tool supports the --help option to print a description, arguments and options:

ocrd --help
ocrd workspace --help
ocrd workspace add --help

How do I get help on OCR-D processors?

All OCR-D-compliant processors support the -h/--help flag as well:

$ ocrd-tesserocr-recognize --help

How can I specify parameters on the command line?

Parameters to an OCR-D-compliant processor must be specified in the JSON syntax. The JSON data can be passed to a processor with the -p CLI option, which can be either the filename of a file containing the JSON data or the JSON data itself:

ocrd-tesseract-recognize -I IN -O OUT -p '{"model": "Fraktur"}'
# same effect:
echo  '{"model": "Fraktur"}' > /tmp/params.json
ocrd-tesseract-recognize -I IN -O OUT -p /tmp/params.json

How do I specify multiple input/output file groups?

You can specify multiple file group names for both input and output by joining the names with a comma (,).

ocrd-tesserocr-recognize -I DEFAULT,REGIONS -O OCR-TESSSERACT

This would instruct ocrd-tesserocr-recognize to take images from the DEFAULT group and region-segmented layout information from the REGIONS group.

How to stop tensorflow logging spam

@bertsky

Another thing that needs to be added to tame TF is os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' – before the tensorflow module gets imported.

To achieve the same, run this before executing a TF-based processor in the shell (or even add it to your $HOME/.bashrc to set this permanently):

export TF_CPP_MIN_LOG_LEVEL=3