The OCR-D Project

**OCR-D is a coordination project aimed at the further development of Optical Character Recognition (OCR) techniques for historical prints.

Workflows and methods of automatic text recognition are examined, described and, if necessary, optimized. An essential goal is the conceptual preparation of the transformation of German prints from the 16th to the 19th century into electronic full texts.

This project involves the Duke August Library Wolfenbüttel, the Berlin-Brandenburg Academy of Sciences and Humanities in Berlin, the State Library of Berlin Prussian Cultural Heritage and the Karlsruhe Institute of Technology. The Bavarian State Library was also involved until 31/08/2016. The project is supported by experts, scientists and libraries.

In recent years, scientific libraries in particular have image digitised extensive holdings. With the help of OCR procedures, searchable full texts can be automatically generated from these image data. The use of digital full texts is indispensable today in many scientific disciplines, especially in the field of (digital) humanities.

So far, however, access to the electronic full text has often been impossible, or only inadequately possible. Many historical holdings are available in digitised form through the “ Union Catalogues of Books Printed in German Speaking Countries “ (VD). Results from common OCR procedures have so far been insufficient. In particular, old print types, especially gothic types, are difficult to identify.

There is a need for development, which we have uncovered in OCR-D. On the basis of existing tools and investigations, the OCR process is to be optimized for VD prints. In addition, answers will be found to the associated technical, information scientific and organizational problems. In contrast to other OCR projects, the focus is not on developing a new, powerful OCR engine. Instead, full text digitization is seen as a process that is implemented in modular open source software. The processes and parameters can be traced and, if required, tailor-made workflows can be defined that deliver optimal results for specific titles.

The project is funded by the German Research Foundation]( and runs until July 2020. In the first phase, needs were identified and concepts for the further course were developed. The cooperation structure was consolidated and continued in the second phase. In this phase, the identified needs are addressed by eight module projects, which partly develop existing tools for the automated processing of early modern printing, partly set up new tools. In all steps, we welcome a lively exchange with colleagues from related projects and institutions as well as service providers.

At the end of the overall project, a consolidated procedure for the OCR processing of digital copies of the printed German cultural heritage of the 16th to 19th centuries is to be developed.