­čôó Ground truth level overview

The creation of ground truth (GT) involves several facets. Among other things, GT is subject to a specific purpose or use. The origin of the word ground truth is the German Grundwahrheit. In this sense, it generally means that everything on the printed page is reproduced in the same way. However, individual elements and regions may be reproduced in a simplified or complex and detailed manner. The extent to which this is done depends, among other things, on the purpose, use, scope, affordability, technical availability, and ... This means that in many cases an interpretation is made of individual typographic and graphemic phenomena. In order to make the interpretations made comprehensible, the GT can be transcribed according to corresponding levels of reproduction of the original or, in the case of existing GT, classified or evaluated. A description or explanation of the levels can be found in various places in these guidelines. The following overview lists these level descriptions.

Note: see: Grundwahrheit="ground truth, n.". in Oxford English Dictionary (OED) Online. September 2021. Oxford University Press. Link (accessed October 28, 2021) .

General explanation of the ground truth levels

Transcription in the corresponding level?

Structure-Ground-Truth in the corresponding level?

Explanations of certain cases such as ligatures, punctuation marks, differences between I/J...

This part of the documentation should be particularly noted:

├ťbersichten und Beispiele (betrifft die Transkription von Zeichen)

Level recommendation

  • OCR-D recommends Level 2 for ground truth creation/transcription.

What are the levels not?

  • In addition to transcription, the classification of the levels also serves to evaluate ground truth.
  • The levels are not a seal of quality.
  • When using levels 2 and 3, text-critical automatic recording/transcription is better possible. The precondition is that the respective model has been trained with this ground truth.
  • There is only limited compatibility between the texts in levels 1, 2 and 3. In most cases, there is only compatibility in the descending direction (3->2->1).
  • The levels can only be used to a limited extent to convert automatically between individual levels.

Problems:

Some problem cases are listed below. Due to the heterogeneity of the phenomena, completeness is not possible.

The problem of long S

siehe: https://de.wikipedia.org/wiki/Langes_s

Zitat aus Wikipedia:

  • Wach┼┐tube (Wach┬Ěstu┬Ěbe [Room of a security guard]) and Wachstube (Wachs┬Ětu┬Ěbe [Tube (see https://www.dwds.de/wb/Tube#d-1-1) filled with wax])
  • Krei┼┐chen (Krei┬Ěschen, Screeching) and Kreischen (Kreis┬Ěchen, a small circle)
  • Ver┼┐endung (Ver┬Ěsen┬Ědung [Dispatch: something is sent to another place] ) and Versendung (Vers┬Ěen┬Ědung [the end of a verse])
  • R├Âschenhof (R├Âs┬Ěchen┬Ěhof, a courtyard with small roses) and R├Â┼┐chenhof (R├Â┬Ěschen┬Ěhof, from given name R├Âschen)
  • Lach┼┐turm (Lach┬Ěsturm [hearty laughter]) und Lachsturm (Lachs┬Ěturm [a tower made of the fish salmon])

Problem of macron above the sign

Transkription:

vnderla┼┐┼┐en/ vn╠â fu═Ąrnemlich = Level 2

General rule: Level 3: ├▒, Level 2 ├▒, Level 1 nn (Always consider the context!!)

The normalised paragraph reads:

underlassen/ and f├╝rnemlich = Level 1

unterlassen/ and vornemlich = possible level 1 (A very strong normalisation has been made.)

Problem of normalisation of transcriptions/text

  • ihren Haaren seine F├╝├če, = Possible level 1
  • yhren hare╠â ┼┐eyne fu┼┐┼┐e/ = Level 2

  • doch soll der Vischer solch(es) Haar = Possible level 1
  • doch soll der Vischer sollich haar = Level 1
  • doch ┼┐oll der Vi┼┐cher ┼┐ollich haar = Level 2

Quelle: [W├╝rttemberg, F├╝rstentum]: Des F├╝rstenthumbs Wirtemberg newe Landtsordnung/ gebessert vnd gemehret/ sampt darzu gedruckten der armen Casten/ auch Holtz vnnd Vorst ordnungen. [T├╝bingen], 1552. In: Deutsches Textarchiv https://www.deutschestextarchiv.de/wuerttemberg_landtsordnung_1552, retrieved on 05.08.2021.

Addition:

When using editions, normalisation must also be taken into account. This example clearly shows this.

Quelle: Sammlung der w├╝rttembergischen Regierungs-Gesetze / 3: Enthaltend den dritten Theil der Samml. der Regierungs-Gesetze : ÔÇŽ Regierungs-Gesetze vom Jahre 1727 bis zum Jahre 1805 ( Th. 1, 1489 - 1634, Band 12) [http://opacplus.bsb-muenchen.de/title/BV006590720/ft/bsb10552294?page=702] retrieved on 05.08.2021.