📢 Ground truth level overview

The creation of ground truth (GT) involves several facets. Among other things, GT is subject to a specific purpose or use. The origin of the word ground truth is the German Grundwahrheit. In this sense, it generally means that everything on the printed page is reproduced in the same way. However, individual elements and regions may be reproduced in a simplified or complex and detailed manner. The extent to which this is done depends, among other things, on the purpose, use, scope, affordability, technical availability, and ... This means that in many cases an interpretation is made of individual typographic and graphemic phenomena. In order to make the interpretations made comprehensible, the GT can be transcribed according to corresponding levels of reproduction of the original or, in the case of existing GT, classified or evaluated. A description or explanation of the levels can be found in various places in these guidelines. The following overview lists these level descriptions.

Note: see: Grundwahrheit="ground truth, n.". in Oxford English Dictionary (OED) Online. September 2021. Oxford University Press. Link (accessed October 28, 2021) .

General explanation of the ground truth levels

Transcription in the corresponding level?

Structure-Ground-Truth in the corresponding level?

Explanations of certain cases such as ligatures, punctuation marks, differences between I/J...

This part of the documentation should be particularly noted:

Übersichten und Beispiele (betrifft die Transkription von Zeichen)

Level recommendation

  • OCR-D recommends Level 2 for ground truth creation/transcription.

What are the levels not?

  • In addition to transcription, the classification of the levels also serves to evaluate ground truth.
  • The levels are not a seal of quality.
  • When using levels 2 and 3, text-critical automatic recording/transcription is better possible. The precondition is that the respective model has been trained with this ground truth.
  • There is only limited compatibility between the texts in levels 1, 2 and 3. In most cases, there is only compatibility in the descending direction (3->2->1).
  • The levels can only be used to a limited extent to convert automatically between individual levels.

Problems:

Some problem cases are listed below. Due to the heterogeneity of the phenomena, completeness is not possible.

The problem of long S

siehe: https://de.wikipedia.org/wiki/Langes_s

Zitat aus Wikipedia:

  • Wachſtube (Wach·stu·be [Room of a security guard]) and Wachstube (Wachs·tu·be [Tube (see https://www.dwds.de/wb/Tube#d-1-1) filled with wax])
  • Kreiſchen (Krei·schen, Screeching) and Kreischen (Kreis·chen, a small circle)
  • Verſendung (Ver·sen·dung [Dispatch: something is sent to another place] ) and Versendung (Vers·en·dung [the end of a verse])
  • Röschenhof (Rös·chen·hof, a courtyard with small roses) and Röſchenhof (Rö·schen·hof, from given name Röschen)
  • Lachſturm (Lach·sturm [hearty laughter]) und Lachsturm (Lachs·turm [a tower made of the fish salmon])

Problem of macron above the sign

Transkription:

vnderlaſſen/ vñ fuͤrnemlich = Level 2

General rule: Level 3: ñ, Level 2 ñ, Level 1 nn (Always consider the context!!)

The normalised paragraph reads:

underlassen/ and fürnemlich = Level 1

unterlassen/ and vornemlich = possible level 1 (A very strong normalisation has been made.)

Problem of normalisation of transcriptions/text

  • ihren Haaren seine Füße, = Possible level 1
  • yhren harẽ ſeyne fuſſe/ = Level 2

  • doch soll der Vischer solch(es) Haar = Possible level 1
  • doch soll der Vischer sollich haar = Level 1
  • doch ſoll der Viſcher ſollich haar = Level 2

Quelle: [Württemberg, Fürstentum]: Des Fürstenthumbs Wirtemberg newe Landtsordnung/ gebessert vnd gemehret/ sampt darzu gedruckten der armen Casten/ auch Holtz vnnd Vorst ordnungen. [Tübingen], 1552. In: Deutsches Textarchiv https://www.deutschestextarchiv.de/wuerttemberg_landtsordnung_1552, retrieved on 05.08.2021.

Addition:

When using editions, normalisation must also be taken into account. This example clearly shows this.

Quelle: Sammlung der württembergischen Regierungs-Gesetze / 3: Enthaltend den dritten Theil der Samml. der Regierungs-Gesetze : … Regierungs-Gesetze vom Jahre 1727 bis zum Jahre 1805 ( Th. 1, 1489 - 1634, Band 12) [http://opacplus.bsb-muenchen.de/title/BV006590720/ft/bsb10552294?page=702] retrieved on 05.08.2021.