<>

OCR-D Glossary

Glossary of terms from the domain of image processing/OCR and how they are used within the OCR-D framework

This section is non-normative.

Layout and Typography

Block

See Region

Border

From the PAGE-XML content schema documentation

Border of the actual page (if the scanned image contains parts not belonging to the page).

Font family

Within OCR-D, font family refers to grouping elements by font similarity. The semantics of a font family are up to the data producer.

Glyph

Within OCR-D, a glyph is the atomic unit within a word.

Grapheme Cluster

See Glyph

Line

See TextLine

Reading Order

Reading order describes the logical sequence of regions within a document.

Region

A region is described by a polygon inside a page.

Region type

The semantics or function of a region such as heading, page number, column, table…

Symbol

See Glyph

TextLine

A TextLine is a region of text without line break.

Word

A word is a sequence of glyphs not containing any word-bounding whitespace.

Data

Ground Truth

Ground truth (GT) in the context of OCR-D are transcriptions, specific structure descriptions and word lists. These are essentially available in PAGE XML format in combination with the original image. Essential parts of the GT were created manually.

We distinguish different usage scenarios for GT:

Reference data

With the term reference data, we refer to data that illustrates different stages of an OCR/OLR process on representative materials. They are supposed to support the assessment of commonly encountered difficulties and challenges when running certain analysis operations and are therefore manually annotated at all levels.

Evaluation data

Evaluation data are used to quantitatively evaluate the performance of OCR tools and/or algorithms. Parts of these data which correspond to the tool(s) under consideration are guaranteed to be recorded manually.

Training data

Many OCR-related tools need to be adapted to the specific domain of the works which are to be processed. This domain adaptation is called training. Data used to guide this process are called training data. It is essential that those parts of these data which are fed to the training algorithm are captured manually.

Activities

Binarization

Binarization means converting all color or grayscale pixels in an image to either black or white.

Controlled term: binarized (comments of a mets:file), preprocessing/optimization/binarization (step in ocrd-tool.json)

See Felix’ Niklas interactive demo

Dewarping

Manipulate an image in such a way that all text lines are straightened and any geometrical distortions have been corrected.

Controlled term: preprocessing/optimization/dewarping

See Matt Zucker’s entry on Dewarping.

Despeckling

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

Controlled term: preprocessing/optimization/despeckling

Deskewing

Rotate an image so that all text lines are horizontal.

Controlled term: preprocessing/optimization/deskewing

Font identification

Detect the font type(s) used in the document, either before or after an OCR run.

Controlled term: recognition/font-identification

Grayscale normalization

ISSUE: https://github.com/OCR-D/spec/issues/41

Controlled term:

Gray normalization is similar to binarization but instead of a purely bitonal image, the output can also contain shades of gray to avoid inadvertently combining glyphs when they are very close together.

Document analysis

Document analysis is the detection of structure on the document level to e.g. create a table of contents.

Reading order detection

Detect the reading order of regions.

Cropping

Detecting the print space in a page, as opposed to the margins. It is a form of region segmentation.

Controlled term: preprocessing/optimization/cropping.

Border removal

–> Cropping

Segmentation

Segmentation means detecting areas within an image.

Specific segmentation algorithms are labelled by the semantics of the regions they detect not the semantics of the input, i.e. an algorithm that detects regions is called region segmentation.

Region segmentation

Segment an image into regions. Also determines whether this is a text or non-text region (e.g. images).

Controlled term:

Region classification

Determine the type of a detected region.

Line segmentation

Segment text regions into textlines.

Controlled term:

MP

Module Project, a software project producing one or more tools. Tools can comprise multiple methods/activities that are called processors for OCR-D. There were eight MP in the second phase of OCR-D (2018-2020).

OCR

Map pixel areas to glyphs and words.

Processor

A processor is a method provided by a tool that implements the OCR-D CLI and implements one or more activities.

OCR-D Workflow Guide

Word segmentation

Segment a textline into words

Controlled term:

Glyph segmentation

Segment a textline into glyphs

Controlled term: SEG-GLYPH

Text recognition

See OCR.

Text optimization

Text optimization encompasses the manipulations to the text based on the steps up to and including text recognition. This includes (semi-)automatically correcting recognition errors, orthographical harmonization, fixing segmentation errors etc.

Data Persistence

Software repository

The software repository contains all OCR-D algorithms and tools developed during the project including tests. It will also contain the documentation and installation instructions for deploying a document analysis workflow.

Ground Truth repository

Contains all the ground truth data.

Research data repository

The research data repository may contain the results of all activities during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository must be available locally.

Model repository

Contains all trained (OCR) models for text recognition. The model repository has to be available at least locally. Ideally, a publicly available model repository will be developed.

OCR-D modules

The OCR-D project divided the various elements of an OCR workflow into six modules.

Image preprocessing

Manipulating the input images for subsequent layout analysis and text recognition.

Layout analysis

Detection of structure within the page.

Text recognition and optimization

Recognition of text and post-correction of recognition errors.

Model training

Generating data files from aligned ground truth text and images to configure the prediction of text and layout recognition engines.

Long-term preservation and persistence

Storing results of OCR and OLR indefinitely, taking into account versioning, multiple runs, provenance/parametrization and providing access to these saved snapshots in a granular fashion.

From the PAGE-XML content schema documentation

Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).

It contains all living elements (except marginals) like body type, footnotes, headings, running titles.

It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.

Quality assurance

Providing measures, algorithms and software to estimate the quality of the individual processes within the OCR-D domain.