OCR-D Glossary
Glossary of terms from the domain of image processing/OCR and how they are used within the OCR-D framework
This section is non-normative.
Layout and Typography
Block
See Region
Border
From the PAGE-XML content schema documentation
Border of the actual page (if the scanned image contains parts not belonging to the page).
Font family
Within OCR-D, font family refers to grouping elements by font similarity. The semantics of a font family are up to the data producer.
Glyph
Within OCR-D, a glyph is the atomic unit within a word.
Grapheme Cluster
See Glyph
Line
See TextLine
Reading Order
Reading order describes the logical sequence of regions within a document.
Region
A region is described by a polygon inside a page.
Region type
The semantics or function of a region such as heading, page number, column, table…
Symbol
See Glyph
TextLine
A text line is a single row of words within a text region. (Depending on the region’s or page’s orientation, and the script’s writing direction, it can be horizontal or vertical.)
Print space
From the PAGE-XML content schema documentation
Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
It contains all living elements (except marginalia) like paragraphs and headings, as well as footnotes, headings, running titles.
It does not contain pagenumber (if not part of running title), marginalia, signature mark, preview words.
Word
A word is a sequence of glyphs within a line which does not contain any word-bounding whitespace. (That is, it includes punctuation and is synonym to token in NLP.)
Data
Ground Truth
Ground truth (GT) in the context of OCR-D are transcriptions, specific structure descriptions and word lists. These are essentially available in PAGE XML format in combination with the original image. Essential parts of the GT were created manually.
We distinguish different usage scenarios for GT:
Reference data
With the term reference data, we refer to data that illustrates different stages of an OCR/OLR process on representative materials. They are supposed to support the assessment of commonly encountered difficulties and challenges when running certain analysis operations and are therefore manually annotated at all levels.
Evaluation data
Evaluation data are used to quantitatively evaluate the performance of OCR tools and/or algorithms. Parts of these data which correspond to the tool(s) under consideration are guaranteed to be recorded manually.
Training data
Many OCR-related tools need to be adapted to the specific domain of the works which are to be processed. This domain adaptation is called training. Data used to guide this process are called training data. It is essential that those parts of these data which are fed to the training algorithm are captured manually.
Activities
Binarization
Binarization means converting all color or grayscale pixels in an image to either black or white.
Controlled term: binarized
(comments
of a mets:file), preprocessing/optimization/binarization
(step
in ocrd-tool.json)
See Felix’ Niklas interactive demo
Dewarping
Manipulate an image in such a way that all text lines are straightened and any geometrical distortions have been corrected.
Controlled term: preprocessing/optimization/dewarping
See Matt Zucker’s entry on Dewarping.
Despeckling
Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.
Controlled term: preprocessing/optimization/despeckling
Deskewing
Rotate an image so that all text lines are horizontal.
Controlled term: preprocessing/optimization/deskewing
Font identification
Detect the font type(s) used in the document, either before or after an OCR run.
Controlled term: recognition/font-identification
Grayscale normalization
ISSUE: https://github.com/OCR-D/spec/issues/41
Controlled term:
gray_normalized
(comments
in file)preprocessing/optimization/cropping
(step)
Gray normalization is similar to binarization but instead of a purely bitonal image, the output can also contain shades of gray to avoid inadvertently combining glyphs when they are very close together.
Document analysis
Document analysis is the detection of structure on the document level to e.g. create a table of contents.
Reading order detection
Detect the reading order of regions.
Cropping
Detecting the print space in a page, as opposed to the margins. It is a form of region segmentation.
Controlled term: preprocessing/optimization/cropping
.
Border removal
–> Cropping
Segmentation
Segmentation means detecting areas within an image.
Specific segmentation algorithms are labelled by the semantics of the regions they detect not the semantics of the input, i.e. an algorithm that detects regions is called region segmentation.
Region segmentation
Segment an image into regions. Also determines whether this is a text or non-text region (e.g. images).
Controlled term:
SEG-REGION
(USE
)layout/segmentation/region
(step)
Region classification
Determine the type of a detected region.
Line segmentation
Segment text regions into textlines.
Controlled term:
SEG-LINE
(USE
)layout/segmentation/line
(step)
Line recognition
See OCR.
OCR
Map pixel areas to glyphs and words.
Word segmentation
Controlled term:
SEG-LINE
(USE
)layout/segmentation/word
(step)
Glyph segmentation
Segment a textline into glyphs
Controlled term: SEG-GLYPH
Text recognition
See OCR.
Text optimization
Text optimization encompasses the manipulations to the text based on the steps up to and including text recognition. This includes (semi-)automatically correcting recognition errors, orthographical harmonization, fixing segmentation errors etc.
Data Persistence
Software repository
The software repository contains all OCR-D algorithms and tools developed during the project including tests. It will also contain the documentation and installation instructions for deploying a document analysis workflow.
Ground Truth repository
Contains all the ground truth data.
Research data repository
The research data repository may contain the results of all activities during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository must be available locally.
Model repository
Contains all trained (OCR) models for text recognition. The model repository has to be available at least locally. Ideally, a publicly available model repository will be developed.
Workspace
A workspace is a representation for some document in the local file system. Minimally it consists of a directory with a copy of the METS file. Additionally, that directory may contain physical data files and sub-directories belonging to the document (required or generated by run-time OCR-D processing), as referenced by the METS via mets:file/mets:FLocat/@href
and mets:fileGrp/@USE
. Files and sub-directories without reference (like log or config files) are not part of the workspace, as are references to remote locations. They can be added to the workspace by referencing them in the METS via their relative local path names.
Workflow modules
The OCR-D project divided the various elements of an OCR workflow into six abstract modules.
Image preprocessing
Manipulating the input images for subsequent layout analysis and text recognition.
Layout analysis
Detection of structure within the page.
Text recognition and optimization
Recognition of text and post-correction of recognition errors.
Model training
Generating data files from aligned ground truth text and images to configure the prediction of text and layout recognition engines.
Long-term preservation and persistence
Storing results of OCR and OLR indefinitely, taking into account versioning, multiple runs, provenance/parametrization and providing access to these saved snapshots in a granular fashion.
Print space
From the PAGE-XML content schema documentation
Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
It contains all living elements (except marginals) like body type, footnotes, headings, running titles.
It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.
Quality assurance
Providing measures, algorithms and software to estimate the quality of the individual processes within the OCR-D domain.
Component architecture
(OCR-D-)Applikation
Gesamtsystem bestehend aus verschiedenen Servern auf denen Prozessoren ausgeführt werden können; kann ein Einzelplatzrechner sein, oder ein verteiltes System aus einem Controller und mehreren Processing-Servern oder ein HPC-Cluster
(OCR-D-)Web-API
Wie in OCR-D/spec#173 skizziert einheitlich definierte und aufeinander bezogene Services, die sich real (je nach IP-Szenario) auf verschiedene Netzwerk-Komponenten verteilen können
(OCR-D-)Service
Funktionsgruppe aus der Web-API; discovery/workspace/processing/workflow/…
(OCR-D-)Server
konkreter Webserver für eine Teilmenge an Services
(OCR-D-)Controller
Server (mind. discovery+workspace+workflow), der Workflows abarbeitet (1 oder mehrere gleichzeitig) und dazu an verschiedene ihm bekannte Processing-Server verteilt und natürlich jeweils die Workspaces bereitstellt/zurückholt; hier gehört auch Lastverteilung hin
(OCR-D-)Processing-Server
Server (mind. discovery+processing), der einen oder mehrere (lokal installierte) Prozessoren oder Evaluatoren ausführt (aber nur 1 gleichzeitig) und natürlich jeweils die Workspaces abholt/ergänzt; hierher gehört die Abwägung zwischen mehreren OPS auf einem (multiskalaren) Rechner oder 1 OPS mit seitenparalleler Prozessierung, sowie GPU-spezifische OPS (nur mit CUDA-Prozessoren) o.ä. Installationen
(OCR-D-)Backend
netzwerkspezifische Software-Komponente von einem Server; z.B. Python-Bibliothek mit Request Handler, mit Implementierung von Service-Discovery und einer netzwerkfähigen Workspace-Verwaltung
(OCR-D-)Workflow-Runtime-Library
modellspezifische Software-Komponente von einem Server oder Prozessor; z.B. Python-API in core mit Klassen für alle wesentlichen funktionalen Teile (OcrdPage, OcrdMets, Workspace, Resolver, Processor, ProcessorTask, Workflow, WorkflowTask) einschließlich Mechanismen zur Signalisierung und Ablaufsteuerung von Workflows, mit denen sich die einzelnen Komponenten (vom Prozessor bis zum Controller) realisieren lassen
(OCR-D-)Workflow-Engine
zentrale Software-Komponente im Controller, die Workflows einschließlich Kontrollstrukturen (linear/parallel/inkrementell) abarbeitet; auch notwendig auf Einzelplatz-Installationen mit Kommandozeilenschnittstellen (wo es auf Basis von Interprozesskommunikation und Dateisystem-E/A realisiert werden kann), etwa ocrd process
Prozessor
Ein Prozessor ist eine Methode, die von einem Werkzeug bereitgestellt wird, dass die OCR-D CLI implementiert und eine oder mehrere OCR-bezogene Aktivitäten umsetzt.
Evaluator
CLI-Werkzeug welches die Ergebnis-Annotation eines bestimmten Workflow-Schrittes oder Prozessors qualitativ bewertet und relativ zu einem gegebenen Schwellwert vollständigen oder partiellen Erfolg signalisiert
Modul
Module sind Software-Pakete/-Repositorien, die eine oder mehrere Methoden/Aktivitäten in Form von Prozessoren, bzw. Evaluatoren enthalten.
Messaging
Benachrichtigungssystem auf Basis von Publish/Subscribe-Architekturen (o.ä.) für die Koordination von Netzwerkkomponenten; hier u.a. für die Verteilung von Tasks und deren Lastverteilung, für Signalisierung von Prozessor-/Evaluator-Ergebnissen
OCR-D-Workflow
Konfiguration von Activities durch Prozessoren/Evaluatoren und deren Parameter in Abhängigkeit ihres Erfolges. Implementiert als OCR-D-Workflow-Runtime-Library und serialisierbar in einem noch zu spezifizierenden Format (Stand 2020/10).
Der Begriff Workflow wird in anderen Kontexten weiter gefasst, kann bspw. auch manuelle Intervention durch den Benutzer beinhalten. Im Gegensatz zu dieser Terminologie in Workflow-Engines wie Taverna oder Digitalisierungsframeworks wie Kitodo, meint OCR-D-Workflow einen vollautomatischen Prozess.