ocrd.processor.base module

Processor base class and helper functions.

class ocrd.processor.base.Processor(workspace, ocrd_tool=None, parameter=None, input_file_grp='INPUT', output_file_grp='OUTPUT', page_id=None, show_resource=None, list_resources=False, show_help=False, show_version=False, dump_json=False, version=None)[source]

Bases: object

A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing. That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or requested physical pages of the input fileGrp(s), and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.

Instantiate, but do not process. Unless list_resources or show_resource or show_help or show_version or dump_json is true, setup for processing (parsing and validating parameters, entering the workspace directory).

show_help()[source]
show_version()[source]
verify()[source]

Verify that the input fulfills the processor’s requirements.

process()[source]

Process the workspace from the given :py:attr:`input_file_grp`s to the given :py:attr:`output_file_grp`s under the given :py:attr:`parameter`s.

(This contains the main functionality and needs to be overridden by subclasses.)

add_metadata(pcgts)[source]

Add PAGE-XML MetadataItem describing the processing step and runtime parameters to ocrd_models.ocrd_page.PcGtsType pcgts.

resolve_resource(val)[source]

Resolve a resource name to an absolute file path with the algorithm in https://ocr-d.de/en/spec/ocrd_tool#file-parameters

Parameters

val (string) – resource value to resolve

list_all_resources()[source]

List all resources found in the filesystem

property input_files

List the input files (for single-valued input_file_grp).

For each physical page: - If there is a single PAGE-XML for the page, take it (and forget about all

other files for that page)

  • Else if there is a single image file, take it (and forget about all other files for that page)

  • Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)

Algorithm <https://github.com/cisocrgroup/ocrd_cis/pull/57#issuecomment-656336593>_

Returns

A list of ocrd_models.ocrd_file.OcrdFile objects.

zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]

List tuples of input files (for multi-valued input_file_grp).

Processors that expect/need multiple input file groups, cannot use input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.

Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via mimetype. But still, multiple matching files per page are an error.

Single-page multiple-file errors are handled according to on_error: - if ‘skip’, then the page for the respective fileGrp will be

silently skipped (as if there was no match at all)

  • if ‘first’, then the first matching file for the page will be silently selected (as if the first was the only match)

  • if ‘last’, then the last matching file for the page will be silently selected (as if the last was the only match)

  • if ‘abort’, then an exception will be raised.

Multiple matches for PAGE-XML will always raise an exception.

Keyword Arguments
  • require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.

  • mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.

Returns

A list of ocrd_models.ocrd_file.OcrdFile tuples.

ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None)[source]

Generate a string describing the full CLI of this processor including params.

Parameters
  • ocrd_tool (dict) – this processor’s tools section of the module’s ocrd-tool.json

  • processor_instance (object, optional) – the processor implementation (for adding any module/class/function docstrings)

ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None)[source]

Create a workspace for mets_url and run MP CLI through it

ocrd.processor.base.run_processor(processorClass, ocrd_tool=None, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, show_resource=None, list_resources=False, parameter=None, parameter_override=None, working_dir=None)[source]

Create a workspace for mets_url and run processor through it

Parameters

parameter (string) – URL to the parameter