The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.
This project is maintained by OCR-D
The GT data has been labeled. The labeling is based on an ontology defined by the Pattern Recognition and Image Analysis Research Lab (PRImA-Research-Lab) at the University of Salford. This normalized and semantic description of the OCR-GT data can be found in the METS metadata file. The labeling metadata is created for each available page. The following labeling metadata is available for the complete collection.
Here you will find a description and explanation of the labeling metadata.
Description: In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.) Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
Description:
Description: The recognition of table/form structure and/or contents. Examples: Stock exchange data in a newspaper, Filled in questionaire form Related: OCR Object / shape recognition (e.g. table separator detection)
Description: Translation of any kind of depicted symbols to machine readable format Examples: OCR Mathematical equation recognition Related: Text processing (separate category) Table recognition Map reading
Description: Part of preceeding or succeeding object included (e.g. other page)
Description: Visible page curl (e.g. book scanning)
Description: Perspective distortions (e.g. due to camera-based acquisition)
Description: Uneven illumination leading to brightness or contrast variations
Description: Arbitrary warping (e.g. due to moisture)
Description: The contrast bwtween the paper and the page content is very low
Description: Ink from facing page was transferred to this page
Description: Annotations regarding the content
Description: The medium was stamped
Description: Paper was reapaired (e.g. with patches)
Description: Noticeable stains on medium
Description: E.g. XML
Description: Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. Examples: A text corpus, An image database
Description: Description coming soon.
Description: Description coming soon.
Description: Description coming soon.
Description: Description coming soon.
Description: Description coming soon.
Description: Description coming soon.
Description: Description coming soon.
Description:
Description: Footnotes at bottom of page
Description: Titles repeated each page
Description: Decorations of some kind
Description: Illustrations in content
Description: Multi-colour illustrations in content
Description: Drap capitals (large capitals at beginning of paragraph)
Description: More than one font size used
Description: More than one typeface used
Description: Antiqua font (more modern)
Description: More than one language used
Description: Description coming soon.
Description: Description coming soon.
Description: Region, zone, block
Description: Description coming soon.
Description: Word or partial word, if separated by line break, for example
Description: Description coming soon.
You can download the complete data here. They contain a zip file in which the components of the collection are also in zip files. Metadata for the complete collection and the components are in METS format.
💡 You can show and hide individual columns of the table. Click the corresponding button. Legend
|
|
TextLine | Page | TxtRegion | ImgRegion | GraphRegion | TabRegion | SepRegion | MathRegion | MusicRegion | NoiseRegion |
---|---|---|---|---|---|---|---|---|---|
6609 | 217 | 1648 | 1 | 74 | 3 | 141 | 1 | 4 | 17 |