Workflows

There are several steps necessary to get the fulltext of a scanned print. The whole OCR process is shown in the following figure:

The following instruction describes all steps of the OCR workflow. Depending on your particular print or rather images not all of those steps will be necessary to obtain good results. Whether a step is required or optional is indicated in the description of each step.

Image Optimization

Prepare image for better OCR.

Step 1: Binarization

First, all the images should be binarized. Many of the following processors require binarized images. Note that some segmentation algorithms seem to produce better results using the original image.

This processor takes a scanned colored /gray scale document image as input and produces a black and white binarized image. This step should separate the background from the foreground.

Available processors

Procecssor Parameter Remark Call
ocrd-anybaseocr-binarize Fast ocrd-anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-binarize ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-olena-binarize

{"impl": "sauvola"}

{"impl": "sauvola-ms"}

{"impl": "sauvola-ms-fg"}

{"impl": "sauvola-ms-split"}

{"impl": "kim"}

{"impl": "wolf"}

{"impl": "niblack"}

{"impl": "singh"}

{"impl": "otsu"}

Recommended ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p'{"impl": "sauvola"}'

Step 2: Denoising

This processor removes artifacts from the binarized image.

May not be necessary for all prints.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-denoise {“level-of-operation”:”page”}   ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE

Step 3: Deskewing

This processor takes a document image as input and does the skew correction of that document. The input images have to be binarized for this module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-deskew     ocrd-anybaseocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE
ocrd-tesserocr-deskew {"operation_level”:”page”} Fast ocrd-tesserocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p'{"operation_level”:”page”}'
ocrd-cis-ocropy-deskew {“level-of-operation”:”page”} ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p'{“level-of-operation”:”page”}' Recommended

Step 4: Dewarping

This processor takes a document image as input and makes the text line straight if its curved. The input image has to be binarized for the module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-dewarp
{
  "pix2pixHD":"/path/to/pix2pixHD/",
  "model_name":"/path/to/pix2pixHD/models"
}
      
For available models take a look at this site
Parameter model_name is missleading. Given directory has to contain a file named ‘latest_net_G.pth’
GPU required!
ocrd-anybaseocr-dewarp -I OCR-D-DESKEW-PAGE -O OCR-D-DEWARP-PAGE -p '{\"pix2pixHD\":\"/path/to/pix2pixHD/\",\"model_name\":\"/path/to/pix2pixHD/models\"}'

Step 5: Cropping

This processor takes a document image as input and crops/selects the page content area only (i.e. it removes textual noise as well as any other noise around the page content area).

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-crop   The input image has to be binarized and
should be deskewed for the module to work.
ocrd-anybaseocr-crop -I OCR-D-DEWARP-PAGE -O OCR-D-CROP

Layout Analysis

Now the image should be optimized for segmentation.

Step 6: Text segmentation (page)

This processor takes an (optimized) document image as an input and segments the image into the different text blocks. During this step a classification (text, marginalia, image, …) should also be done.

   

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-segment-region   Should also work for original images!? ocrd-tesserocr-segment-region -I OCR-D-CROP -O OCR-D-SEG-REG
ocrd-anybaseocr-block-segmentation
{
  "block_segmentation_model": "/path/to/mrcnn",
  "block_segmentation_weights": "/path/to/model/block_segmentation_weights.h5"
}
      
For available models take a look at this site
Should also work for original images!?
ocrd-anybaseocr-block-segmentation -I OCR-D-CROP -O OCR-D-SEG-REG -p '{"block_segmentation_model": "/path/to/mrcnn","block_segmentation_weights": "/path/to/model/block_segmentation_weights.h5"}'
ocrd-cis-ocropy-segment {"level-of-operation":"page"}   ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{"level-of-operation":"page"}'

Image Optimization (on Block Level)

Now the blocks should be optimized for OCR.

Step 7: Binarization

This processor takes a scanned colored /gray scale block as input and produces a black and white binarized image. This step should separate the background from the foreground.

The binarization should be at least executed once (on page/block/line level).

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-binarize {"operation_level":"region"}   ocrd-tesserocr-binarize -I OCR-D-SEG-REG -O OCR-D-BIN-REG -p '{"operation_level":"region"}'

Step 8: Deskewing

This processor takes an image as input and does the skew correction for all text blocks.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-deskew {"level-of-operation":"region"}   ocrd-cis-ocrd-anybaseocr-deskew -I OCR-D-BIN-REG -O OCR-D-DESKEW-REG -p '{"level-of-operation":"region"}'

Step 9: Cliping

This processor can be used to remove intrusions of neighbouring segments in regions / lines of a workspace. It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours. For each binary object of conflict, it determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).

TODO: add images

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-clip {"level-of-operation":"region"}   ocrd-cis-ocropy-clip -I OCR-D-DESKEW-REG -O OCR-D-CLIP-REG -p '{"level-of-operation":"region"}'

Step 10: Line segmentation

This processor can be used to segment regions into lines. It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-segment {"level-of-operation":"region"}   ocrd-cis-ocropy-segment -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE -p '{"level-of-operation":"region"}'
ocrd-tesserocr-segment-line     ocrd-tesserocr-segment-line -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE

Step 11: Line correction

This processor can be used to correct the segmented lines.

TODO: add images

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-clip {"level-of-operation":"line"}   ocrd-cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-CLIP-LINE -p '{"level-of-operation":"line"}'
ocrd-cis-ocropy-resegment     ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-segment-repair {"sanitize":true}   ocrd-segment-repair -I OCR-D-SEG-LINE -O OCR-D-SEG-REPAIR -p '{"sanitize":true}'

Step 12: Dewarping (on line level)

This processor can be used to dewarp the segmented lines.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-dewarp {"operation_level":"line",
"pix2pixHD":"/path/to/pix2pixHD/",
"model_name":"/path/to/pix2pixHD/models"}
For available models take a look at this site
Parameter ‘model_name’ is missleading. Given directory has to contain a file named ‘latest_net_G.pth’
GPU required!
ocrd-anybaseocr-dewarp -I OCR-D-CLIP-LINE -O OCR-D-DEWARP-LINE -p '{"operation_level":"line","pix2pixHD":"/path/to/pix2pixHD/","model_name":"/path/to/pix2pixHD/models"}'
ocrd-cis-ocropy-dewarp     ocrd-cis-ocropy-dewarp -I OCR-D-CLIP-LINE -O OCR-D-DEWARP-LINE

Text Recognition and Optimization

Step 13: Text recognition

This processor recognizes text in segmented lines.

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-recognize

{"textequiv_level": "glyph", "overwrite_words": true,"model": "Fraktur"}

{"textequiv_level": "glyph", "overwrite_words": true, "model": "GT4HistOCR_50000000.997_191951"}

Recommended
Model can be found here
00000/tessdata_best/GT4HistOCR_50000000.997_191951.traineddata)
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "Fraktur"}'
ocrd-calamari-recognize {"checkpoint":"/path/to/models/\*.ckpt.json"} Recommended
Model can be found here
ocrd-calamari-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"checkpoint": "Fraktur"}'

Note: For ocrd-tesserocr the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored. (The directory should at least contain the following models: deu.traineddata, eng.taineddata, osd.traineddata)

Post Correction (Optional)

Step 14: Text aligning

This processor alignes texts from multiple OCR-engines in one PAGE.xml.

Available processors

Processor Parameter Remarks Call
ocrd-cis-align     ocrd-cis-align -I OCR-D-OCR1,OCR-D-OCR2 -O OCR-D-ALIGN

Step 15: Post correction

This processor tries to optimize the recognized text.

See also: ToDo reference to the result inside talk on final workshop

Available processors

Processor Parameter Remarks Call
ocrd-cor-asv-ann-process {“textequiv_level”:”line”,”model_file”:”/path/to/model/model.h5”} Models can be found here ocrd-cor-asv-ann-process -I OCR-D-ALIGN -O OCR-D-PROCESS -p '{“textequiv_level”:”line”,”model_file”:”/path/to/model/model.h5”}'
ocrd-cis-post-correct.sh ??? Not tested yet! ocrd-cis-post-correct.sh -I OCR-D-ALIGN -O OCR-D-CORRECT

Analysis (Optional)

If Ground Truth data is available, the OCR can be analysed.

Step 16: Analysis

This processor can be used to analyse the output of the OCR.

Available processors

Processor Parameter Remarks Call
ocrd-dinglehopper   First input group should point to the ground truth. ocrd-dinglehopper -I OCR-D-GT,OCR-D-OCR -O OCR-D-EVAL

Recommendations

All processors, with the exception of those for post-correction, were tested on selected pages of some prints from the 17th and 18th century.

The results vary quite a lot from page to page. In most cases, segmentation is a problem.

These recommendations may also work well for other prints of those centuries.

Note that for our test pages, not all steps described above werde needed to obtain the best results. Depending on your particular images, you might want to include those processors again for better results.

Best results for selected pages

The following workflow has produced best results for ‘simple’ pages (e.g. this page) (CER ~1%).

Step Processor Parameter
1 ocrd-olena-binarize {"impl": "sauvola-ms-split"}
2 ocrd-cis-ocropy-denoise {"level-of-operation":"page"}
3 ocrd-anybaseocr-deskew
5 ocrd-anybaseocr-crop
6 ocrd-cis-ocropy-segment {"level-of-operation":"page"}
8 ocrd-cis-ocropy-deskew {"level-of-operation":"region"}
9 ocrd-cis-ocropy-clip {"level-of-operation":"region"}
10 ocrd-cis-ocropy-segment {"level-of-operation":"region"}
11 ocrd-cis-ocropy-resegment
12 ocrd-cis-ocropy-dewarp
13 ocrd-calamari-recognize {"checkpoint":"/path/to/models/\*.ckpt.json"}

Example with ocrd-process

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola-ms-split\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW" \
  "anybaseocr-crop -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-CROP" \
  "cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'" \
  "cis-ocropy-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-segment -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-RESEG" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-RESEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -p '{\"checkpoint\":\"/path/to/models/*.ckpt.json\"}'"

Good results for all pages

Overall the results are good for all kind of pages.

Step Processor Parameter
1 ocrd-olena-binarize {"impl": "sauvola-ms-split"}
2 ocrd-cis-ocropy-denoise {"level-of-operation":"page"}
3 ocrd-anybaseocr-deskew
5 ocrd-anybaseocr-crop
6 ocrd-cis-ocropy-segment {"level-of-operation":"page"}
10 ocrd-tesserocr-segment-line
11 ocrd-cis-ocropy-clip {"level-of-operation":"line"}
12 ocrd-cis-ocropy-dewarp
13 ocrd-tesserocr-recognize {"textequiv_level":"glyph",
"overwrite_words":true,
"model":"GT4HistOCR_50000000.997_191951"}

Example with ocrd-process

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola-ms-split\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW" \
  "anybaseocr-crop -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-CROP" \
  "cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'" \
  "tesserocr-segment-line -I OCR-D-SEG-REG -O OCR-D-SEG-LINE" \
  "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -p '{\"level-of-operation\":\"line\"}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-CLIP-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-CLIP-DEWARP -O OCR-D-OCR -p '{\"textequiv_level\":\"glyph\",\"overwrite_words\":true,\"model\":\"GT4HistOCR_50000000.997_191951\"}'"

Good results for slower processors

If your computer is not that powerful you may try this workflow. It works fine for simple pages and produces also good results in shorter time.

Step Processor Parameter
1 ocrd-olena-binarize {"impl": "sauvola-ms-split"}
2 ocrd-cis-ocropy-denoise {"level-of-operation":"page"}
3 ocrd-anybaseocr-deskew
5 ocrd-anybaseocr-crop
6 ocrd-tesserocr-segment-region
8 ocrd-cis-ocropy-deskew {"level-of-operation":"region"}
10 ocrd-cis-ocropy-segment {"level-of-operation":"region"}
12 ocrd-cis-ocropy-dewarp
13 ocrd-tesserocr-recognize {"textequiv_level":"glyph",
"overwrite_words":true,
"model":"GT4HistOCR_50000000.997_191951"}

Example with ocrd-process

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola-ms-split\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-DESKEW-PAGE" \
  "anybaseocr-crop -I OCR-D-DESKEW-PAGE -O OCR-D-CROP" \
  "tesserocr-segment-region -I OCR-D-CROP -O OCR-D-SEG-REG" \
  "cis-ocropy-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-segment -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE -p '{\"level-of-operation\":\"region\"}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-DEWARP -O OCR-D-OCR -p '{\"textequiv_level\":\"glyph\",\"overwrite_words\":true,\"model\":\"GT4HistOCR_50000000.997_191951\"}'"