馃摙 Ground truth level overview

The creation of ground truth (GT) involves several facets. Among other things, GT is subject to a specific purpose or use. The origin of the word ground truth is the German Grundwahrheit. In this sense, it generally means that everything on the printed page is reproduced in the same way. However, individual elements and regions may be reproduced in a simplified or complex and detailed manner. The extent to which this is done depends, among other things, on the purpose, use, scope, affordability, technical availability, and ... This means that in many cases an interpretation is made of individual typographic and graphemic phenomena. In order to make the interpretations made comprehensible, the GT can be transcribed according to corresponding levels of reproduction of the original or, in the case of existing GT, classified or evaluated. A description or explanation of the levels can be found in various places in these guidelines. The following overview lists these level descriptions.

Note: see: Grundwahrheit="ground truth, n.". in Oxford English Dictionary (OED) Online. September 2021. Oxford University Press. Link (accessed October 28, 2021) .

General explanation of the ground truth levels

Transcription in the corresponding level?

Structure-Ground-Truth in the corresponding level?

Explanations of certain cases such as ligatures, punctuation marks, differences between I/J...

This part of the documentation should be particularly noted:

脺bersichten und Beispiele (betrifft die Transkription von Zeichen)

Level recommendation

  • OCR-D recommends Level 2 for ground truth creation/transcription.

What are the levels not?

  • In addition to transcription, the classification of the levels also serves to evaluate ground truth.
  • The levels are not a seal of quality.
  • When using levels 2 and 3, text-critical automatic recording/transcription is better possible. The precondition is that the respective model has been trained with this ground truth.
  • There is only limited compatibility between the texts in levels 1, 2 and 3. In most cases, there is only compatibility in the descending direction (3->2->1).
  • The levels can only be used to a limited extent to convert automatically between individual levels.

Problems:

Some problem cases are listed below. Due to the heterogeneity of the phenomena, completeness is not possible.

The problem of long S

siehe: https://de.wikipedia.org/wiki/Langes_s

Zitat aus Wikipedia:

  • Wach趴tube (Wach路stu路be [Room of a security guard]) and Wachstube (Wachs路tu路be [Tube (see https://www.dwds.de/wb/Tube#d-1-1) filled with wax])
  • Krei趴chen (Krei路schen, Screeching) and Kreischen (Kreis路chen, a small circle)
  • Ver趴endung (Ver路sen路dung [Dispatch: something is sent to another place] ) and Versendung (Vers路en路dung [the end of a verse])
  • R枚schenhof (R枚s路chen路hof, a courtyard with small roses) and R枚趴chenhof (R枚路schen路hof, from given name R枚schen)
  • Lach趴turm (Lach路sturm [hearty laughter]) und Lachsturm (Lachs路turm [a tower made of the fish salmon])

Problem of macron above the sign

Transkription:

vnderla趴趴en/ vn虄 fu亭rnemlich = Level 2

General rule: Level 3: 帽, Level 2 帽, Level 1 nn (Always consider the context!!)

The normalised paragraph reads:

underlassen/ and f眉rnemlich = Level 1

unterlassen/ and vornemlich = possible level 1 (A very strong normalisation has been made.)

Problem of normalisation of transcriptions/text

  • ihren Haaren seine F眉脽e, = Possible level 1
  • yhren hare虄 趴eyne fu趴趴e/ = Level 2

  • doch soll der Vischer solch(es) Haar = Possible level 1
  • doch soll der Vischer sollich haar = Level 1
  • doch 趴oll der Vi趴cher 趴ollich haar = Level 2

Quelle: [W眉rttemberg, F眉rstentum]: Des F眉rstenthumbs Wirtemberg newe Landtsordnung/ gebessert vnd gemehret/ sampt darzu gedruckten der armen Casten/ auch Holtz vnnd Vorst ordnungen. [T眉bingen], 1552. In: Deutsches Textarchiv https://www.deutschestextarchiv.de/wuerttemberg_landtsordnung_1552, retrieved on 05.08.2021.

Addition:

When using editions, normalisation must also be taken into account. This example clearly shows this.

Quelle: Sammlung der w眉rttembergischen Regierungs-Gesetze / 3: Enthaltend den dritten Theil der Samml. der Regierungs-Gesetze : 鈥 Regierungs-Gesetze vom Jahre 1727 bis zum Jahre 1805 ( Th. 1, 1489 - 1634, Band 12) [http://opacplus.bsb-muenchen.de/title/BV006590720/ft/bsb10552294?page=702] retrieved on 05.08.2021.