<>

Workflows

There are several steps necessary to get the fulltext of a scanned print. The whole OCR process is shown in the following figure:

The following instructions describe all steps of an OCR workflow. Depending on your particular print (or rather images), not all of those steps might be necessary to obtain good results. Whether a step is required or optional is indicated in the description of each step.

Image Optimization (Page Level)

At first, the image should be prepared for OCR.

Step 0: Image Enhancement (Page Level, optional)

Optionally, you can start off your workflow by enhancing your images, which can be vital for the following binarization. In this processing step, the raw image is taken and enhanced by e.g. grayscale conversion, brightness normalization, noise filtering, etc.

Available processors

Procecssor Parameter Remark Call
ocrd-im6convert

{
  "output-format": "image/tiff" # or "image/jp2", "image/png"...
}
      

for `output-options` see [IM Documentation](https://imagemagick.org/script/command-line-options.php) ocrd-im6convert -I OCR-D-IMG -O OCR-D-ENH -p'{"output-format": "image/tiff"}'

Step 1: Binarization (Page Level)

All the images should be binarized right at the beginning of your workflow. Many of the following processors require binarized images. Some implementations (for deskewing, segmentation or recognition) may produce better results using the original image. But these can always retrieve the raw image instead of the binarized version automatically.

In this processing step, a scanned colored /gray scale document image is taken as input and a black and white binarized image is produced. This step should separate the background from the foreground.

Note: Binarization tools usually provide a threshold parameter which allows you to increase or decrease the weight of the foreground. This is optional and can be especially usefull for images which have not been enhanced.

Available processors

Procecssor Parameter Remark Call
ocrd-anybaseocr-binarize

{"threshold": float}

Fast ocrd-anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-binarize

{"noise_maxsize": float}

ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-olena-binarize

{"impl": "sauvola"}

{"impl": "sauvola-ms"}

{"impl": "sauvola-ms-fg"}

{"impl": "sauvola-ms-split"}

{"impl": "kim"}

{"impl": "wolf"}

{"impl": "niblack"}

{"impl": "singh"}

{"impl": "otsu"}

{"k": float}

Recommended ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p'{"impl": "sauvola-ms-split"}'

Step 2: Cropping (Page Level)

In this processing step, a document image is taken as input and the page is cropped to the content area only (i.e. without noise at the margins or facing pages) by marking the coordinates of the page frame.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-crop   The input image has to be binarized and
should be deskewed for the module to work.
ocrd-anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP

Step 3: Binarization (Page Level)

For better results, the cropped images can be binarized again at this point or later on (on region level).

Available processors

Procecssor Parameter Remark Call
ocrd-cis-ocropy-binarize ocrd-cis-ocropy-binarize -I OCR-D-CROP -O OCR-D-BIN2
ocrd-olena-binarize

{"impl": "sauvola"}

{"impl": "sauvola-ms"}

{"impl": "sauvola-ms-fg"}

{"impl": "sauvola-ms-split"}

{"impl": "kim"}

{"impl": "wolf"}

{"impl": "niblack"}

{"impl": "singh"}

{"impl": "otsu"}

Recommended ocrd-olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -p'{"impl": "sauvola-ms-split"}'

Step 4: Denoising (Page Level)

In this processing step, artifacts like little specks (both in foreground or background) are removed from the binarized image.

This may not be necessary for all prints, and depends heavily on the selected binarization algorithm.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-denoise {“level-of-operation”:”page”}   ocrd-cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-DENOISE

Step 5: Deskewing (Page Level)

In this processing step, a document image is taken as input and the skew of that page is corrected by annotating the detected angle (-45° .. 45°) and rotating the image. Optionally, also the orientation is corrected by annotating the detected angle (multiples of 90°) and transposing the image. The input images have to be binarized for this module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-deskew     ocrd-anybaseocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE
ocrd-tesserocr-deskew {"operation_level”:”page”} Fast, also performs a decent orientation correction ocrd-tesserocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p'{"operation_level”:”page”}'
ocrd-cis-ocropy-deskew {“level-of-operation”:”page”} Recommended ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p'{“level-of-operation”:”page”}'

Step 6: Dewarping (Page Level)

In this processing step, a document image is taken as input and the text lines are straightened or stretched if they are curved. The input image has to be binarized for the module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-dewarp
{
  "pix2pixHD":"/path/to/pix2pixHD/",
  "model_name":"/path/to/pix2pixHD/models"
}
      
For available models take a look at this site
Parameter model_name is missleading. Given directory has to contain a file named ‘latest_net_G.pth’
GPU required!
ocrd-anybaseocr-dewarp -I OCR-D-DESKEW-PAGE -O OCR-D-DEWARP-PAGE -p '{\"pix2pixHD\":\"/path/to/pix2pixHD/\",\"model_name\":\"/path/to/pix2pixHD/models\"}'

Layout Analysis

By now the image should be well prepared for segmentation.

Step 7: Page segmentation

In this processing step, an (optimized) document image is taken as an input and the image is segmented into the various regions, including columns. Segments are also classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, marginalia, heading, …).

Note: If you use ocrd-tesserocr-segment-region, which uses only bounding boxes instead of polygon coordinates, then you should post-process via ocrd-segment-repair with plausibilize=True to obtain better results without large overlaps.

Note: The ocrd-sbb-textline-detector processor does not only segment the page, but also the text lines within the detected text regions in one step. Therefore with this (and only with this!) processor you don’t need to segment into lines in an extra step.

   

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-segment-region   Recommended ocrd-tesserocr-segment-region -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair {"plausibilize":true} Only to be used after `ocrd-tesserocr-segment-region` ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"sanitize":true}'
ocrd-sbb-textline-detector   ocrd-sbb-textline-detector -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -p '{"level-of-operation":"page"}'
ocrd-anybaseocr-block-segmentation
{
  "block_segmentation_model": "/path/to/mrcnn",
  "block_segmentation_weights": "/path/to/model/block_segmentation_weights.h5"
}
      
For available models take a look at this site
Should also work for original images!?
ocrd-anybaseocr-block-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -p '{"block_segmentation_model": "/path/to/mrcnn","block_segmentation_weights": "/path/to/model/block_segmentation_weights.h5"}'
ocrd-cis-ocropy-segment {"level-of-operation":"page"}   ocrd-cis-ocropy-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -p '{"level-of-operation":"page"}'

Image Optimization (Region Level)

In the following steps, the text regions should be optimized for OCR.

Step 8: Binarization (Region Level)

In this processing step, a scanned colored /gray scale document image is taken as input and a black and white binarized image is produced. This step should separate the background from the foreground.

The binarization should be at least executed once (on page or region level). If you already binarized your image twice on page level, and have no large images, you can probably skip this step.

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-binarize {"operation_level":"region"}   ocrd-tesserocr-binarize -I OCR-D-SEG-REG -O OCR-D-BIN-REG -p '{"operation_level":"region"}'
ocrd-cis-ocropy-binarize

{"level-of-operation": "region", "noise_maxsize": float}

ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{"level-of-operation": "region"}'

Step 9: Deskewing (Region Level)

In this processing step, text region images are taken as input and their skew is corrected by annotating the detected angle (-45° .. 45°) and rotating the image. Optionally, also the orientation is corrected by annotating the detected angle (multiples of 90°) and transposing the image.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-deskew {"level-of-operation":"region"}   ocrd-cis-ocropy-deskew -I OCR-D-BIN-REG -O OCR-D-DESKEW-REG -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew   Fast, also performs a decent orientation correction ocrd-tesserocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE

Step 10: Clipping (Region Level)

In this processing step, intrusions of neighbouring non-text (e.g. separator) or text segments (e.g. ascenders/descenders) into text regions of a page can be removed. A connected component analysis is run on every text region, as well as its overlapping neighbours. Now for each conflicting binary object, a rule based on majority and proper containment determins whether it belongs to the neighbour, and can therefore be clipped to the background.

This basic text-nontext segmentation ensures that for each text region there is a clean image without interference from separators and neighbouring texts. (Cleaning via coordinates would be impossible in many common cases.)

TODO: add images

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-clip {"level-of-operation":"region"}   ocrd-cis-ocropy-clip -I OCR-D-DESKEW-REG -O OCR-D-CLIP-REG -p '{"level-of-operation":"region"}'

Step 11: Line segmentation

In this processing step, text regions are segmented into text lines. A line detection algorithm is run on every text region of every PAGE in the input file group, and a TextLine element with the resulting polygon outline is added to the annotation of the output PAGE.

Note: If you use ocrd-tesserocr-segment-line, which uses only bounding boxes instead of polygon coordinates, then you should post-process with the processors described in Step 12. If you use ocrd-cis-ocropy-segment, you can directly go on with Step 13.

Note: As described in Step 7, the ocrd-sbb-textline-detector also segments text lines. As it segments the page in a first step, too, with this (and only with this!) processor you don’t need to segment into regions in an extra step.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-segment {"level-of-operation":"region"}   ocrd-cis-ocropy-segment -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE -p '{"level-of-operation":"region"}'
ocrd-tesserocr-segment-line     ocrd-tesserocr-segment-line -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE

Step 12: Resegmentation (Line Level)

In this processing step the segmented lines can be corrected.

TODO: add images

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-clip {"level-of-operation":"line"}   ocrd-cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-CLIP-LINE -p '{"level-of-operation":"line"}'
ocrd-cis-ocropy-resegment     ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG

Step 13: Dewarping (Line Level)

In this processing step, the text line images get vertically aligned if they are curved.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-dewarp     ocrd-cis-ocropy-dewarp -I OCR-D-CLIP-LINE -O OCR-D-DEWARP-LINE

Text Recognition

Step 14: Text recognition

This processor recognizes text in segmented lines.

An overview on the existing model repositories and short descriptions on the most important models can be found here.

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-recognize

{"textequiv_level": "glyph", "overwrite_words": true,"model": "Fraktur"}

{"textequiv_level": "glyph", "overwrite_words": true, "model": "GT4HistOCR_50000000.997_191951"}

Recommended
Model can be found here
/tessdata_best/GT4HistOCR_50000000.997_191951.traineddata)
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "Fraktur"}'
ocrd-calamari-recognize {"checkpoint":"/path/to/models/\*.ckpt.json"} Recommended
Model can be found here
ocrd-calamari-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"checkpoint": "Fraktur"}'

Note: For ocrd-tesserocr the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored. (The directory should at least contain the following models: deu.traineddata, eng.taineddata, osd.traineddata)

Post Correction (Optional)

Step 15: Text alignment

In this processing step, text results from multiple OCR engines (in different annotations sharing the same line segmentation) are aligned into one annotation with TextEquiv alternatives.

Note: This step is only required if you want to do post-correction afterwards, feeding alternative character hypotheses from several OCR-engines to improve the search space. The previous recognition step must be run on glyph or at least on word level.

Available processors

Processor Parameter Remarks Call
ocrd-cis-align     ocrd-cis-align -I OCR-D-OCR1,OCR-D-OCR2 -O OCR-D-ALIGN

Step 16: Post-correction

In this processing step, the recognized text is corrected by statistical error modelling, language modelling, and word modelling (dictionaries, morphology and orthography).

Note: Most tools benefit strongly from input which includes alternative OCR hypotheses. Currently, models for ocrd-cor-asv-ann-process are optimised for input from single OCR engines, whereas ocrd-cis-post-correct.sh expects input from multi-OCR alignment.

See also: ToDo reference to the result inside talk on final workshop

Available processors

Processor Parameter Remarks Call
ocrd-cor-asv-ann-process {“textequiv_level”:”line”,”model_file”:”/path/to/model/model.h5”} Models can be found here ocrd-cor-asv-ann-process -I OCR-D-OCR -O OCR-D-PROCESS -p '{“textequiv_level”:”line”,”model_file”:”/path/to/model/model.h5”}'
ocrd-cis-post-correct.sh ??? Not tested yet! ocrd-cis-post-correct.sh -I OCR-D-ALIGN -O OCR-D-CORRECT

Evaluation (Optional)

If Ground Truth data is available, the OCR can be evaluated.

Step 17: OCR Evaluation

In this processing step, the text output of the OCR or post-correction can be evaluated by aligning with ground truth text and measuring the error rates.

Available processors

Processor Parameter Remarks Call
ocrd-dinglehopper   First input group should point to the ground truth. ocrd-dinglehopper -I OCR-D-GT,OCR-D-OCR -O OCR-D-EVAL
ocrd-cor-asv-ann-evaluate

{"metric": "Levenshtein" (default), "NFC", "NFKC", "historic-latin"} {"confusion": integer}

First input group should point to the ground truth. There is no output file group, it only uses logging. If you want to save the evaluation findings in a file, you could e.g. add `2> eval.txt` at the end of your command ocrd-cor-asv-ann-evaluate -I OCR-D-GT,OCR-D-OCR

Format Conversion (Optional)

OCR-D produces PAGE XML files which contain the recognized text as well as detailed information on the structure of the processed pages, the coordinates of the recognized elements etc. Optionally, the PAGE XML can be converted to a different output format.

Step 18: Format Conversion

In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.

Available processors

Processor Parameter Remarks Call
ocrd-fileformat-transform
        {"from-to": "alto2.0 alto3.0"} 
        # or {from-to: "alto2.0 alto3.1"}
        # or {from-to: "alto2.0 hocr"}
        # or {from-to: "alto2.1 alto3.0"}
        # or {from-to: "alto2.1 alto3.1"}
        # or {from-to: "alto2.1 hocr"}
        # or {from-to: "alto page"}
        # or {from-to: "alto text"}
        # or {from-to: "gcv hocr"}
        # or {from-to: "hocr alto2.0"}
        # or {from-to: "hocr alto2.1"}
        # or {from-to: "hocr text"}
        # or {from-to: "page alto"}
        # or {from-to: "page hocr"}
        # or {from-to: "page text"}
      
  ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO
ocrd-pagetopdf
      {
        # fix (invalid) negative coordinates
        "negative2zero": true,
        # create a single "fat" PDF
        "multipage": true,
        # render text on this level
        "textequiv_level": "word",
        # draw polygon outlines in the PDF
        "outlines": "line"
      }
      
  ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word"}'

Recommendations

All processors, with the exception of those for post-correction, were tested on selected pages of some prints from the 17th and 18th century.

The results vary quite a lot from page to page. In most cases, segmentation is a problem.

These recommendations may also work well for other prints of those centuries.

Note that for our test pages, not all steps described above werde needed to obtain the best results. Depending on your particular images, you might want to include those processors again for better results.

Best results for selected pages

The following workflow has produced best results for ‘simple’ pages (e.g. this page) (CER ~1%).

Step Processor Parameter
1 ocrd-olena-binarize {"impl": "sauvola"}
2 ocrd-anybaseocr-crop
3 ocrd-olena-binarize {"impl": "kim"}
4 ocrd-cis-ocropy-denoise {"level-of-operation":"page"}
5 ocrd-cis-ocropy-deskew {"level-of-operation":"page"}
7 ocrd-tesserocr-segment-region
7a ocrd-segment-repair {"plausibilize": true}
8 ocrd-olena-binarize {"impl": "kim"}
9 ocrd-cis-ocropy-deskew {"level-of-operation":"region"}
10 ocrd-cis-ocropy-clip {"level-of-operation":"region"}
11 ocrd-cis-ocropy-segment {"level-of-operation":"region"}
11a ocrd-segment-repair {"sanitize": true}
13 ocrd-cis-ocropy-dewarp
14 ocrd-calamari-recognize {"checkpoint":"/path/to/models/\*.ckpt.json"}

Example with ocrd-process

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola\"}'" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -p '{\"impl\": \"kim\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "cis-ocropy-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -p '{\"level-of-operation\":\"page\"}'" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{\"plausibilize\":true}'" \
  "olena-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN3 -p '{\"impl\": \"kim\"}'" \
  "cis-ocropy-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-segment -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE -p '{\"level-of-operation\":\"region\"}'" \
  "segment-repair -I OCR-D-SEG-LINE -O OCR-D-SEG-REPAIR-LINE -p '{\"sanitize\":true}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-REPAIR-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -p '{\"checkpoint\":\"/path/to/models/*.ckpt.json\"}'"

Good results for slower processors

If your computer is not that powerful you may try this workflow. It works fine for simple pages and produces also good results in shorter time.

Step Processor Parameter
1 ocrd-olena-binarize {"impl": "sauvola"}
2 ocrd-anybaseocr-crop
3 ocrd-olena-binarize {"impl": "kim"}
4 ocrd-cis-ocropy-denoise {"level-of-operation":"page"}
5 ocrd-tesserocr-deskew {"operation_level":"page"}
7 ocrd-tesserocr-segment-region
7a ocrd-segment-repair {"plausibilize": true}
9 ocrd-cis-ocropy-deskew {"level-of-operation":"region"}
10 ocrd-cis-ocropy-clip {"level-of-operation":"region"}
11 ocrd-tesserocr-segment-line
11a ocrd-segment-repair {"sanitize": true}
13 ocrd-cis-ocropy-dewarp
14 ocrd-calamari-recognize {"checkpoint":"/path/to/models/\*.ckpt.json"}

Example with ocrd-process

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola\"}'" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -p '{\"impl\": \"kim\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -p '{\"operation_level\":\"page\"}'" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{\"plausibilize\":true}'" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -p '{\"level-of-operation\":\"region\"}'" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "segment-repair -I OCR-D-SEG-LINE -O OCR-D-SEG-REPAIR-LINE -p '{\"sanitize\":true}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-REPAIR-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -p '{\"checkpoint\":\"/path/to/models/*.ckpt.json\"}'"