Particularites of the printing technology and typographical aspects are not taken into account and are not documented in the Ground Truth corpus. A normalization is carried out to a greater extent. The following characters are normalized:
- long-s to round-s
- umlaute (e above a vowel) to äöüÄÖÜ
- sz to ß
- Virgel to comma
- Quotation marks are transferred to today's use and are not differentiated
- Separators are transferred to today's use and are not differentiated
- the round-r in connection with c ist dissolved to etc.
- The reproduction of spaces is limited to the separation of words.
- Punctuation marks are always used in conjunction with the preceding word.