Processor base class and helper functions.
Processor(workspace, ocrd_tool=None, parameter=None, input_file_grp='INPUT', output_file_grp='OUTPUT', page_id=None, show_resource=None, list_resources=False, show_help=False, show_version=False, dump_json=False, version=None)¶
A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing. That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or requested physical pages of the input fileGrp(s), and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
Add PAGE-XML MetadataItem describing the processing step and runtime parameters to
List the input files (for single-valued
For each physical page: - If there is a single PAGE-XML for the page, take it (and forget about all
other files for that page)
Else if there is a single image file, take it (and forget about all other files for that page)
Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)
A list of
List all resources found in the filesystem
(This contains the main functionality and needs to be overridden by subclasses.)
Resolve a resource name to an absolute file path with the algorithm in https://ocr-d.de/en/spec/ocrd_tool#file-parameters
val (string) – resource value to resolve
Verify that the input fulfills the processor’s requirements.
zip_input_files(require_first=True, mimetype=None, on_error='skip')¶
List tuples of input files (for multi-valued
Processors that expect/need multiple input file groups, cannot use
input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.
Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via
mimetype. But still, multiple matching files per page are an error.
Single-page multiple-file errors are handled according to
on_error: - if ‘skip’, then the page for the respective fileGrp will be
silently skipped (as if there was no match at all)
if ‘first’, then the first matching file for the page will be silently selected (as if the first was the only match)
if ‘last’, then the last matching file for the page will be silently selected (as if the last was the only match)
if ‘abort’, then an exception will be raised.
Multiple matches for PAGE-XML will always raise an exception.
- Keyword Arguments
require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.
mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.
A list of
Generate a string describing the full CLI of this processor including params.
ocrd_tool (dict) – this processor’s
toolssection of the module’s
processor_instance (object, optional) – the processor implementation (for adding any module/class/function docstrings)
run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None)¶
Create a workspace for mets_url and run MP CLI through it
run_processor(processorClass, ocrd_tool=None, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, show_resource=None, list_resources=False, parameter=None, parameter_override=None, working_dir=None)¶
Create a workspace for mets_url and run processor through it
parameter (string) – URL to the parameter