<>

Workflows

There are several steps necessary to get the fulltext of a scanned print. The whole OCR process is shown in the following figure:

The following instructions describe all steps of an OCR workflow. Depending on your particular print (or rather images), not all of those steps might be necessary to obtain good results. Whether a step is required or optional is indicated in the description of each step. This guide provides an overview of the available OCR-D processors and their required parameters. For more complex workflows and recommendations see the OCR-D-Website-Wiki. Feel free to add your own experiences and recommendations in the Wiki! We will regularly amend this guide with valuable contributions from the Wiki.

Note: In order to be able to run the workflows described in this guide, you need to have prepared your images in an OCR-D-workspace. We expect that you are familiar with the OCR-D-user guide which explains all preparatory steps, syntax and different solutions for executing whole workflows.

Image Optimization (Page Level)

At first, the image should be prepared for OCR.

Step 0: Image Enhancement (Page Level, optional)

Optionally, you can start off your workflow by enhancing your images, which can be vital for the following binarization. In this processing step, the raw image is taken and enhanced by e.g. grayscale conversion, brightness normalization, noise filtering, etc.

Note: ocrd-preprocess-image can be used to run arbitrary shell commands for preprocessing (original or derived) images, and can be seen as a generic OCR-D wrapper for many of the following workflow steps, provided a matching external tool exists. (The only restriction is that the tool must not change image size or the position/coordinates of its content.)

Available processors

Processor Parameter Remark Call
ocrd-im6convert -P output-format image/tiff for output-options see IM Documentation ocrd-im6convert -I OCR-D-IMG -O OCR-D-ENH -P output-format image/tiff
ocrd-preprocess-image -P input_feature_filter binarized
-P output_feature_added binarized
-P command "scribo-cli sauvola-ms-split '@INFILE' '@OUTFILE' --enable-negate-output"
for parameters and command examples (presets) see the Readme ocrd-preprocess-image -I OCR-D-IMG -O OCR-D-PREP -P output_feature_added binarized -P command "scribo-cli sauvola-ms-split @INFILE @OUTFILE --enable-negate-output"
ocrd-skimage-normalize ocrd-skimage-normalize -I OCR-D-IMG -O OCR-D-NORM
ocrd-skimage-denoise-raw ocrd-skimage-denoise-raw -I OCR-D-IMG -O OCR-D-DENOISE

Step 1: Binarization (Page Level)

All the images should be binarized right at the beginning of your workflow. Many of the following processors require binarized images. Some implementations (for deskewing, segmentation or recognition) may produce better results using the original image. But these can always retrieve the raw image instead of the binarized version automatically.

In this processing step, a scanned colored /gray scale document image is taken as input and a black and white binarized image is produced. This step should separate the background from the foreground.

Note: Binarization tools usually provide a threshold parameter which allows you to increase or decrease the weight of the foreground. This is optional and can be especially useful for images which have not been enhanced.

Available processors

Processor Parameter Remark Call
ocrd-olena-binarize -P k 0.10 Recommended ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-binarize -P threshold 0.1 Fast ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-sbb-binarize -P model pre-trained models can be downloaded from [here](https://qurator-data.de/sbb_binarization/) ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model /path/to/model
ocrd-skimage-binarize -P k 0.10 Slow ocrd-skimage-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-anybaseocr-binarize -P threshold 0.1 Fast ocrd-anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN

Step 2: Cropping (Page Level)

In this processing step, a document image is taken as input and the page is cropped to the content area only (i.e. without noise at the margins or facing pages) by marking the coordinates of the page frame. We strongly recommend to execute this step if your images are not cropped already (i.e. only show the page of a book without a ruler, footer, color scale etc.). Otherwise you might run into severe segmentation problems.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-crop The input image has to be binarized and
should be deskewed for the module to work.
ocrd-anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP
ocrd-tesserocr-crop Cannot cope well with facing pages (textual noise is detected as text). ocrd-tesserocr-crop -I OCR-D-BIN -O OCR-D-CROP

Step 3: Binarization (Page Level)

For better results, the cropped images can be binarized again at this point or later on (on region level).

Available processors

Processor Parameter Remark Call
ocrd-olena-binarize Recommended ocrd-olena-binarize -I OCR-D-CROP -O OCR-D-BIN2
ocrd-sbb-binarize -P model pre-trained models can be downloaded from [here](https://qurator-data.de/sbb_binarization/) ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model /path/to/model
ocrd-skimage-binarize ocrd-skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2
ocrd-cis-ocropy-binarize ocrd-cis-ocropy-binarize -I OCR-D-CROP -O OCR-D-BIN2

Step 4: Denoising (Page Level)

In this processing step, artifacts like little specks (both in foreground or background) are removed from the binarized image. (Not to be confused with raw denoising in step 0.)

This may not be necessary for all prints, and depends heavily on the selected binarization algorithm.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-denoise -P noise_maxsize 3.0 ocrd-cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-DENOISE
ocrd-skimage-denoise -P maxsize 3.0 Slow ocrd-skimage-denoise -I OCR-D-BIN2 -O OCR-D-DENOISE

Step 5: Deskewing (Page Level)

In this processing step, a document image is taken as input and the skew of that page is corrected by annotating the detected angle (-45° .. 45°) and rotating the image. Optionally, also the orientation is corrected by annotating the detected angle (multiples of 90°) and transposing the image. The input images have to be binarized for this module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-deskew -P level-of-operation page Recommended ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -P level-of-operation page
ocrd-tesserocr-deskew -P operation_level page Fast, also performs a decent orientation correction ocrd-tesserocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -P operation_level page
ocrd-anybaseocr-deskew     ocrd-anybaseocr-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE

Step 6: Dewarping (Page Level)

In this processing step, a document image is taken as input and the text lines are straightened or stretched if they are curved. The input image has to be binarized for the module to work.

   

Available processors

Processor Parameter Remarks Call
ocrd-anybaseocr-dewarp -P pix2pixHD /path/to/pix2pixHD/
-P model_name:/path/to/pix2pixHD/models
For available models take a look at this site
Parameter model_name is misleading. Given directory has to contain a file named ‘latest_net_G.pth’
GPU required!
ocrd-anybaseocr-dewarp -I OCR-D-DESKEW-PAGE -O OCR-D-DEWARP-PAGE -p '{\"pix2pixHD\":\"/path/to/pix2pixHD/\",\"model_name\":\"/path/to/pix2pixHD/models\"}'

Layout Analysis

By now the image should be well prepared for segmentation.

Step 7: Region segmentation

In this processing step, an (optimized) document image is taken as an input and the image is segmented into the various regions, including columns. Segments are also classified, either coarse (text, separator, image, table, …) or fine-grained (paragraph, marginalia, heading, …).

Note: If you use ocrd-tesserocr-segment-region, which uses only bounding boxes instead of polygon coordinates, then you should post-process via ocrd-segment-repair with plausibilize=True to obtain better results without large overlaps.

Note: The ocrd-sbb-textline-detector and ocrd-cis-ocropy-segment processors do not only segment the page, but also the text lines within the detected text regions in one step. Therefore with those (and only with those!) processors you don’t need to segment into lines in an extra step.

   

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-segment-region -P find_tables false Recommended ocrd-tesserocr-segment-region -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -P plausibilize true Only to be used after ocrd-tesserocr-segment-region ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true
ocrd-sbb-textline-detector -P model /path/to/model Models can be found here;
For model you need to pass the local path on your hard drive as parameter value.
ocrd-sbb-textline-detector -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P model /path/to/model
ocrd-cis-ocropy-segment -P level-of-operation page ocrd-cis-ocropy-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P level-of-operation page
ocrd-anybaseocr-block-segmentation -P block_segmentation_model /path/to/mrcnn -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5</code> For available models take a look at this site; you need to pass the local path on your hard drive as parameter value. ocrd-anybaseocr-block-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P block_segmentation_model /path/to/mrcnn -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5
ocrd-pc-segmentation ocrd-pc-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG

Image Optimization (Region Level)

In the following steps, the text regions should be optimized for OCR.

Step 8: Binarization (Region Level)

In this processing step, a scanned colored /gray scale document image is taken as input and a black and white binarized image is produced. This step should separate the background from the foreground.

The binarization should be at least executed once (on page or region level). If you already binarized your image twice on page level, and have no large images, you can probably skip this step.

Available processors

Processor Parameter Remarks Call
ocrd-skimage-binarize -P level-of-operation region ocrd-skimage-binarize -I OCR-D-SEG-REG -O OCR-D-BIN-REG -P level-of-operation region
ocrd-sbb-binarize -P model -P operation_level region pre-trained models can be downloaded from [here](https://qurator-data.de/sbb_binarization/) ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model /path/to/model -P operation-level region
ocrd-preprocess-image -P level-of-operation region
-P "output_feature_added" binarized
-P command "scribo-cli sauvola-ms-split '@INFILE' '@OUTFILE' --enable-negate-output"
  ocrd-preprocess-image -I OCR-D-SEG-REG -O OCR-D-BIN-REG -P level-of-operation region -P output_feature_added binarized -P command "scribo-cli sauvola-ms-split @INFILE @OUTFILE --enable-negate-output"
ocrd-cis-ocropy-binarize -P level-of-operation region
-P "noise_maxsize": float
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REG -O OCR-D-BIN-REG -P level-of-operation region

Step 9: Deskewing (Region Level)

In this processing step, text region images are taken as input and their skew is corrected by annotating the detected angle (-45° .. 45°) and rotating the image. Optionally, also the orientation is corrected by annotating the detected angle (multiples of 90°) and transposing the image.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-deskew -P level-of-operation region ocrd-cis-ocropy-deskew -I OCR-D-BIN-REG -O OCR-D-DESKEW-REG -P level-of-operation region
ocrd-tesserocr-deskew Fast, also performs a decent orientation correction ocrd-tesserocr-deskew -I OCR-D-BIN-REG -O OCR-D-DESKEW-REG

Step 10: Clipping (Region Level)

In this processing step, intrusions of neighbouring non-text (e.g. separator) or text segments (e.g. ascenders/descenders) into text regions of a page (or text lines or a text region) can be removed. A connected component analysis is run on every segment, as well as its overlapping neighbours. Now for each conflicting binary object, a rule based on majority and proper containment determines whether it belongs to the neighbour, and can therefore be clipped to the background.

This basic text-nontext segmentation ensures that for each text region there is a clean image without interference from separators and neighbouring texts. (On the region level, cleaning via coordinates would be impossible in many common cases.) On the line level, this can be seen as an alternative to resegmentation.

Note: Clipping must be applied before any processor that produces derived images for the same hierarchy level (region/line). Annotations on the next higher level (page/region) are fine of course.

Available processors

>
Processor Parameter Remarks Call
ocrd-cis-ocropy-clip -P level-of-operation region   ocrd-cis-ocropy-clip -I OCR-D-DESKEW-REG -O OCR-D-CLIP-REG -P level-of-operation region

Step 11: Line segmentation

In this processing step, text regions are segmented into text lines. A line detection algorithm is run on every text region of every PAGE in the input file group, and a TextLine element with the resulting polygon outline is added to the annotation of the output PAGE.

Note: If you use ocrd-tesserocr-segment-line, which uses only bounding boxes instead of polygon coordinates, then you should post-process with the processors described in Step 12. If you use ocrd-cis-ocropy-segment, you can directly go on with Step 13.

Note: As described in Step 7, ocrd-sbb-textline-detector and ocrd-cis-ocropy-segment do not only segment the page, but also the text lines within the detected text regions in one step. Therefore with those (and only with those!) processors you don’t need to segment into lines in an extra step.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-segment -P level-of-operation region   ocrd-cis-ocropy-segment -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE -P level-of-operation region
ocrd-tesserocr-segment-line     ocrd-tesserocr-segment-line -I OCR-D-CLIP-REG -O OCR-D-SEG-LINE

Step 12: Resegmentation (Line Level)

In this processing step the segmented text lines can be corrected in order to reduce their overlap.

This can be done either via coordinates (polygonalizing the bounding boxes tightly around the glyphs) – which is what ocrd-cis-ocropy-resegment offers – or via derived images (clipping pixels that do not belong to a text line to the background color) – which is what ocrd-cis-ocropy-clip (on the line level) offers. The former is usually more accurate, but not always possible (for example, when neighbors intersect heavily, creating non-contiguous contours). The latter is only possible if no preceding workflow step has already annotated derived images (AlternativeImage references) on the line level (see also region-level clipping).

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-clip -P level-of-operation line ocrd-cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-CLIP-LINE -P level-of-operation line
ocrd-cis-ocropy-resegment ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG

Step 13: Dewarping (Line Level)

In this processing step, the text line images get vertically aligned if they are curved.

   

Available processors

Processor Parameter Remarks Call
ocrd-cis-ocropy-dewarp     ocrd-cis-ocropy-dewarp -I OCR-D-CLIP-LINE -O OCR-D-DEWARP-LINE

Text Recognition

Step 14: Text recognition

This processor recognizes text in segmented lines.

An overview on the existing model repositories and short descriptions on the most important models can be found here.

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-recognize -P model GT4HistOCR_50000000.997_191951 Recommended
Model can be found here
a faster variant is here
TESSDATA_PREFIX="/test/data/tesseractmodels/" ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -P model Fraktur+Latin
ocrd-calamari-recognize -P checkpoint "/path/to/models/*.ckpt.json" Recommended
Model can be found here;
For checkpoint you need to pass the local path on your hard drive as parameter value, and keep the verbatim asterisk (*).
ocrd-calamari-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json

Note: For ocrd-tesserocr the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored unless the default directory (normally $VIRTUAL_ENV/share/tessdata) is used. The directory should at least contain the following models: deu.traineddata, eng.traineddata, osd.traineddata.

Note: Faster models for tesserocr-recognize are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/. A good and currently the fastest model is Fraktur-fast. UB Mannheim provides many more models online which were trained on different GT data sets, for example from Austrian Newspapers.

Note: If you want to go on with the optional post correction, you should also set the textequiv_level to glyph or in the case of ocrd-calamari-recognize at least word (which is already the default for ocrd-tesserocr-recognize).

Post Correction (Optional)

Step 15: Text alignment

In this processing step, text results from multiple OCR engines (in different annotations sharing the same line segmentation) are aligned into one annotation with TextEquiv alternatives.

Note: This step is only required if you want to do post-correction afterwards, feeding alternative character hypotheses from several OCR-engines to improve the search space. The previous recognition step must be run on glyph or at least on word level.

Available processors

Processor Parameter Remarks Call
ocrd-cis-align     ocrd-cis-align -I OCR-D-OCR1,OCR-D-OCR2 -O OCR-D-ALIGN

Step 16: Post-correction

In this processing step, the recognized text is corrected by statistical error modelling, language modelling, and word modelling (dictionaries, morphology and orthography).

Note: Most tools benefit strongly from input which includes alternative OCR hypotheses. Currently, models for ocrd-cor-asv-ann-process are optimised for input from single OCR engines, whereas ocrd-cis-postcorrect expects input from multi-OCR alignment.

Available processors

Processor Parameter Remarks Call
ocrd-cor-asv-ann-process -P textequiv_level word -P model_file /path/to/model/model.h5 Pre-trained models can be found here;
For model_file you need to pass the local path on your hard drive as parameter value. (Relative paths are resolved from the workspace directory or the environment variable CORASVANN_DATA.) There is no default model_file.
ocrd-cor-asv-ann-process -I OCR-D-OCR -O OCR-D-PROCESS -P textequiv_level word -P model_file /path/to/model/model.h5
ocrd-cis-postcorrect -P profilerPath /path/to/profiler.bash -P profilerConfig ignored -P nOCR 2 -P model /path/to/model/model.zip The profilerConfig parameters can be specified in a JSON file. If you do not want to use a profiler, you can set the value for profilerConfig to ignored. In this case, your profiler.bash should look like this:

#!/bin/bash
cat > /dev/null
echo '{}'
For model you need to pass the local path on your hard drive as parameter value. There is no default model.
ocrd-cis-postcorrect -I OCR-D-ALIGN -O OCR-D-CORRECT -p postcorrect.json

Evaluation (Optional)

If Ground Truth data is available, the OCR can be evaluated.

Step 17: OCR Evaluation

In this processing step, the text output of the OCR or post-correction can be evaluated by aligning with ground truth text and measuring the error rates.

Available processors

Processor Parameter Remarks Call
ocrd-dinglehopper For page-wise visual comparison (2 file groups). First input group should point to the ground truth. ocrd-dinglehopper -I OCR-D-GT,OCR-D-OCR -O OCR-D-EVAL
ocrd-cor-asv-ann-evaluate -P metric historic-latin -P confusion 20 For document-wide aggregation (N file groups). First input group should point to the ground truth.
There is no output file group, it only uses logging. If you want to save the evaluation findings in a file, you could e.g. add 2> eval.txt at the end of your command (or use ocrd-make).
ocrd-cor-asv-ann-evaluate -I OCR-D-GT,OCR-D-OCR

Generic Data Management (Optional)

OCR-D produces PAGE XML files which contain the recognized text as well as detailed information on the structure of the processed pages, the coordinates of the recognized elements etc. Optionally, the output can be converted to other formats, or copied verbatim (re-generating PAGE-XML)

Step 18: Adaptation of Coordinates

All OCR-D processors are required to relate coordinates to the original image for each page, and to keep the original image reference (Page/@imageFilename). However, sometimes it may be necessary to deviate from that strict requirement in order to get the overall workflow to work.

For example, if you have a page-level dewarping step, it is currently impossible to correctly relate to the original image’s coordinates for any segments annotated after that, because there is no descriptive annotation of the underlying coordinate transform in PAGE-XML. Therefore, it is better to replace the original image of the output PAGE-XML by the dewarped image before proceeding with the workflow. If the dewarped image has also been cropped or deskewed, then of course all existing coordinates are re-calculated accordingly as well.

Another use case is exporting PAGE-XML for tools that cannot apply cropping or deskewing, like LAREX or Transkribus.

Available processors

Processor Parameter Remarks Call
ocrd-segment-replace-original     ocrd-segment-replace-original -I OCR-D-SEG-LINE -O OCR-D-SUBST

Step 19: Format Conversion

In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.

Available processors

Processor Parameter Remarks Call
ocrd-fileformat-transform
-P from-to "alto2.0 alto3.0"
      # or "alto2.0 alto3.1"
      # or "alto2.0 hocr"
      # or "alto2.1 alto3.0"
      # or "alto2.1 alto3.1"
      # or "alto2.1 hocr"
      # or "alto page"
      # or "alto text"
      # or "gcv hocr"
      # or "hocr alto2.0"
      # or "hocr alto2.1"
      # or "hocr text"
      # or "page alto"
      # or "page hocr"
      # or "page text"
      
As the value consists of two words, when using -P form it has to be enclosed in quotation marks.
If you want to save all OCR results in one file, you can use the following command: `cat OCR* > full.txt`
ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO
ocrd-pagetopdf
{
  # font file name to use for rendering text
  "font": "AletheiaSans.ttf",
  # fix (invalid) negative coordinates
  "negative2zero": true,
  # concatenate to multi-page PDF (empty for none)
  "multipage": "name_of_pdf",
  # multi-page PDF page labels
  "pagelabel": "pageId",
  # render text on this hierarchy level
  "textequiv_level": "word",
  # draw polygon outlines in the PDF (empty for none)
  "outlines": "line"
}
ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word
ocrd-export-larex Create a file group with PAGE alongside image files (differing only in file name suffix) to accommodate LAREX' bookpath directory assumptions. ocrd-export-larex -I OCR-D-OCR -O OCR-D-LAREX
ocrd-segment-extract-pages -P mimetype image/png -P transparency true Get page images (cropped and deskewed as annotated; raw and binarized) and mask images (color-coded for regions) along with JSON files for region annotations (custom and COCO format). ocrd-segment-extract-pages -I OCR-D-SEG-REGION -O OCR-D-IMG-PAGE,OCR-D-IMG-PAGE-BIN,OCR-D-IMG-PAGE-MASK
ocrd-segment-extract-regions -P mimetype image/png -P transparency true Get region images (cropped, masked and deskewed as annotated) along with JSON files for region annotations (custom format). ocrd-segment-extract-regions -I OCR-D-SEG-REGION -O OCR-D-IMG-REGION
ocrd-segment-extract-lines -P mimetype image/png -P transparency true Get text line images (cropped, masked and deskewed as annotated) along with JSON files for line annotations (custom format). ocrd-segment-extract-lines -I OCR-D-SEG-LINE -O OCR-D-IMG-LINE
ocrd-segment-from-masks
-P colordict '{
  "#969696": "TableRegion", 
  "#00FF00": "TextRegion:page-number", 
  "#FFFF00": "TextRegion:heading", 
  "#00FFFF": "GraphicRegion:logo", 
  "#0000FF": "TextRegion:subject", 
  "#FF0000": "TextRegion:catch-word", 
  "#FF00FF": "TextRegion:footnote", 
  "#646464": "TextRegion:paragraph" }'
Import mask images as region segmentation. If colordict is empty, defaults to PageViewer color scheme (also written by ocrd-segment-extract-pages). ocrd-segment-from-masks -I OCR-D-SEG-PAGE,OCR-D-IMG-PAGE-MASK -O OCR-D-SEG-REGION
ocrd-segment-from-coco Import COCO format region segmentation (also written by ocrd-segment-extract-pages). ocrd-segment-from-coco -I OCR-D-SEG-PAGE,OCR-D-SEG-COCO -O OCR-D-SEG-REGION

Step 20: Archiving

After you have successfully processed your images, the results should be saved and archived. OLA-HD is a longterm archive system which works as a mixture between an archive system and a repository. For further details on OLA-HD see the extensive concept paper. You can also check out the prototype to make sure, OLA-HD meets your needs and requirements. To use the prototype, specify http://141.5.98.232/api as the endpoint parameter in your call.

Available processors

Processor Parameter Remarks Call
ocrd-olahd-client { "endpoint": "URL of your OLA-HD instance", "username": "X", "password": "*" } the parameters should be written to a json file:
echo '{ "endpoint": "URL of your OLA-HD instance", "username": "X", "password": "*"}' > olahd.json
ocrd-olahd-client -I OCR-D-OCR -p olahd.json

Step 21: Dummy Processing

Sometimes it can be useful to have a dummy processor, which takes the files in an Input fileGrp and copies them the a new Output fileGrp, re-generating the PAGE XML from the current namespace schema/model.

Available processors

Processor Parameter Remarks Call
ocrd-dummy     ocrd-dummy -I OCR-D-FILEGRP -O OCR-D-DUMMY

Recommendations

In order to facilitate the usage of OCR-D and the configuration of workflows, we provide two workflows which can be used as a start for your OCR-D-tests. They were determined by testing the processors listed above on selected pages of some prints from the 17th and 18th century.

The results vary quite a lot from page to page. In most cases, segmentation is a problem.

Note that for our test pages, not all steps described above werde needed to obtain the best results. Depending on your particular images, you might want to include those processors again for better results.

We are currently working on regression tests with the help of which we will be able to provide more profound workflows soon, which will replace those interm solutions.

Best results for selected pages

The following workflow has produced best results for ‘simple’ pages (e.g. this page) (CER ~1%).

Step Processor Parameter
1 ocrd-cis-ocropy-binarize
2 ocrd-anybaseocr-crop
3 ocrd-skimage-binarize -P method li
4 ocrd-skimage-denoise P level-of-operation page
5 ocrd-tesserocr-deskew -P level-of-operation page
7 ocrd-cis-ocropy-segment -P level-of-operation page
9 ocrd-tesserocr-deskew
13 ocrd-cis-ocropy-dewarp
14 ocrd-calamari-recognize -P checkpoint /path/to/models/\*.ckpt.json

Example with ocrd-process

ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \
  "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json"

Note: (1) This workflow expects your images to be stored in a folder called OCR-D-IMG. If your images are saved in a different folder, you need to adjust -I OCR-D-IMG in the second line of the call above with the name of your folder, e.g. -I MAX (2) For the last processor in this workflow, ocrd-calamari-recognize, you need to specify your local path to the model on your hard drive as parameter value! The last line of the ocrd-process call above could e.g. look like this:

  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /test/data/calamari_models/\*.ckpt.json"

All the other lines can just be copied and pasted.

Good results for slower processors

If your computer is not that powerful you may try this workflow. It works fine for simple pages and produces also good results in shorter time.

Step Processor Parameter
1 ocrd-cis-ocropy-binarize
2 ocrd-anybaseocr-crop
3 ocrd-skimage-binarize -P method li
4 ocrd-skimage-denoise -P level-of-operation page
5 ocrd-tesserocr-deskew -P level-of-operation page
7 ocrd-tesserocr-segment-region
7a ocrd-segment-repair -P plausibilize true
9 ocrd-tesserocr-deskew
10 ocrd-cis-ocropy-clip
11 ocrd-tesserocr-segment-line
12 ocrd-cis-ocropy-clip -P level-of-operation line
13 ocrd-cis-ocropy-dewarp
14 ocrd-tesserocr-recognize -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000.997_191951

Example with ocrd-process

ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "tesserocr-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -P level-of-operation line" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000.997_191951}"

Note: (1) This workflow expects your images to be stored in a folder called OCR-D-IMG. If your images are saved in a different folder, you need to adjust -I OCR-D-IMG in the second line of the call above with the name of your folder, e.g. -I my_images (2) For the last processor in this workflow, ocrd-tesserocr-recognize, the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored if they are not in the default location.