ocrd.processor.base module

Processor base class and helper functions.

class ocrd.processor.base.Processor(workspace: Workspace | None, ocrd_tool=None, parameter=None, input_file_grp=None, output_file_grp=None, page_id=None, download_files=True, version=None)[source]

Bases: object

A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing.

That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or selected physical pages of the input fileGrp(s), computes additional annotation, and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.

Instantiate, but do not setup (neither for processing nor other usage). If given, do parse and validate parameter.

Parameters:

workspace (Workspace) – The workspace to process. If not None, then chdir to that directory. Deprecated since version 3.0: Should be None here, but then needs to be set before processing.

Keyword Arguments:
  • parameter (string) – JSON of the runtime choices for ocrd-tool parameters. Can be None even for processing, but then needs to be set before running.

  • input_file_grp (string) – comma-separated list of METS fileGrp used for input. Deprecated since version 3.0: Should be None here, but then needs to be set before processing.

  • output_file_grp (string) – comma-separated list of METS fileGrp used for output. Deprecated since version 3.0: Should be None here, but then needs to be set before processing.

  • page_id (string) – comma-separated list of METS physical page IDs to process (or empty for all pages). Deprecated since version 3.0: Should be None here, but then needs to be set before processing.

  • download_files (boolean) – Whether input files will be downloaded prior to processing, defaults to ocrd_utils.config.OCRD_DOWNLOAD_INPUT which is True by default

max_instances: int = -1

maximum number of cached instances (ignored if negative), to be applied on top of OCRD_MAX_PROCESSOR_CACHE (i.e. whatever is smaller).

(Override this if you know how many instances fit into memory - GPU / CPU RAM - at once.)

max_workers: int = -1

maximum number of processor forks for page-parallel processing (ignored if negative), to be applied on top of OCRD_MAX_PARALLEL_PAGES (i.e. whatever is smaller).

(Override this if you know how many pages fit into processing units - GPU shaders / CPU cores - at once, or if your class already creates threads prior to forking, e.g. during setup.)

max_page_seconds: int = -1

maximum number of seconds may be spent processing a single page (ignored if negative), to be applied on top of OCRD_PROCESSING_PAGE_TIMEOUT (i.e. whatever is smaller).

(Override this if you know how costly this processor may be, irrespective of image size or complexity of the page.)

property metadata_filename: str

Relative location of the ocrd-tool.json file inside the package.

Used by metadata_location.

(Override if ocrd-tool.json is not in the root of the module, e.g. namespace/ocrd-tool.json or data/ocrd-tool.json).

property metadata_location: Path

Absolute path of the ocrd-tool.json file as distributed with the package.

Used by metadata_rawdict.

(Override if ocrd-tool.json is not distributed with the Python package.)

property metadata_rawdict: dict

Raw (unvalidated, unexpanded) ocrd-tool.json dict contents of the package.

Used by metadata.

(Override if ocrd-tool.json is not in a file.)

property metadata: dict

The ocrd-tool.json dict contents of the package, according to the OCR-D spec for processor tools.

After deserialisation, it also gets validated against the schema with all defaults expanded.

Used by ocrd_tool and version.

(Override if you want to provide metadata programmatically instead of a JSON file.)

property ocrd_tool: dict

The ocrd-tool.json dict contents of this processor tool. Usually the executable key of the tools part of metadata.

(Override if you do not want to use metadata lookup mechanism.)

property executable: str

The executable name of this processor tool. Taken from the runtime filename.

Used by ocrd_tool for lookup in metadata.

(Override if your entry-point name deviates from the executable name, or the processor gets instantiated from another runtime.)

property version: str

The program version of the package. Usually the version part of metadata.

(Override if you do not want to use metadata lookup mechanism.)

property parameter: dict | None

the runtime parameter dict to be used by this processor

show_help(subcommand=None)[source]

Print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.

show_version()[source]

Print information on this processor’s version and OCR-D version.

verify()[source]

Verify that input_file_grp and output_file_grp fulfill the processor’s requirements.

dump_json()[source]

Print ocrd_tool on stdout.

dump_module_dir()[source]

Print moduledir on stdout.

list_resources()[source]

Find all installed resource files in the search paths and print their path names.

setup() None[source]

Prepare the processor for actual data processing, prior to changing to the workspace directory but after parsing parameters.

(Override this to load models into memory etc.)

shutdown() None[source]

Bring down the processor after data processing, after to changing back from the workspace directory but before exiting (or setting up with different parameters).

(Override this to unload models from memory etc.)

process() None[source]

Process all files of the workspace from the given input_file_grp to the given output_file_grp for the given page_id (or all pages) under the given parameter.

(This contains the main functionality and needs to be overridden by subclasses.)

process_workspace(workspace: Workspace) None[source]

Process all files of the given workspace, from the given input_file_grp to the given output_file_grp for the given page_id (or all pages) under the given parameter.

Delegates to process_workspace_submit_tasks() and process_workspace_handle_tasks().

(This will iterate over pages and files, calling process_page_file() and handling exceptions. It should be overridden by subclasses to handle cases like post-processing or computation across pages.)

process_workspace_submit_tasks(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int) Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]][source]

Look up all input files of the given workspace from the given input_file_grp for the given page_id (or all pages), and schedules calling process_page_file() on them for each page via executor (enforcing a per-page time limit of max_seconds).

When running with OCRD_MAX_PARALLEL_PAGES>1 and the workspace via METS Server, the executor will fork this many worker parallel subprocesses each processing one page at a time. (Interprocess communication is done via task and result queues.)

Otherwise, tasks are run sequentially in the current process.

Delegates to zip_input_files() to get the input files for each page, and then calls process_workspace_submit_page_task().

Returns a dict mapping the per-page tasks (i.e. futures submitted to the executor) to their corresponding pageId and input files.

process_workspace_submit_page_task(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int, input_file_tuple: List[OcrdFile | ClientSideOcrdFile | None]) Tuple[DummyFuture | Future, str, List[OcrdFile | ClientSideOcrdFile | None]][source]

Ensure all input files for a single page are downloaded to the workspace, then schedule process_process_file() to be run on them via executor (enforcing a per-page time limit of max_seconds).

Delegates to process_page_file() (wrapped in _page_worker() to share the processor instance across forked processes).

 Returns a tuple of: - the scheduled future object, - the corresponding pageId, - the corresponding input files.

process_workspace_handle_tasks(tasks: Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]]) Tuple[int, int, Dict[str, int], int][source]

Look up scheduled per-page futures one by one, handle errors (exceptions) and gather results.

 Enforces policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).

 Returns a tuple of: - the number of successfully processed pages - the number of failed (i.e. skipped or copied) pages - a dict of the type and corresponding number of exceptions seen - the number of total requested pages (i.e. success+fail+existing).

Delegates to process_workspace_handle_page_task() for each page.

process_workspace_handle_page_task(page_id: str, input_files: List[OcrdFile | ClientSideOcrdFile | None], task: DummyFuture | Future) bool | Exception[source]

 Await a single page result and handle errors (exceptions), enforcing policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).

 Returns - true in case of success - false in case the output already exists - the exception in case of failure

process_page_file(*input_files: OcrdFile | ClientSideOcrdFile | None) None[source]

Process the given input_files of the workspace, representing one physical page (passed as one opened OcrdFile per input fileGrp) under the given parameter, and make sure the results get added accordingly.

(This uses process_page_pcgts(), but should be overridden by subclasses to handle cases like multiple output fileGrps, non-PAGE input etc.)

process_page_pcgts(*input_pcgts: OcrdPage | None, page_id: str | None = None) OcrdPageResult[source]

Process the given input_pcgts of the workspace, representing one physical page (passed as one parsed OcrdPage per input fileGrp) under the given parameter, and return the resulting OcrdPageResult.

Optionally, add to the images attribute of the resulting OcrdPageResult instances of OcrdPageResultImage, which have required fields for pil (PIL.Image image data), file_id_suffix (used for generating IDs of the saved image) and alternative_image (reference of the ocrd_models.ocrd_page.AlternativeImageType for setting the filename of the saved image).

(This contains the main functionality and must be overridden by subclasses, unless it does not get called by some overriden process_page_file().)

add_metadata(pcgts: OcrdPage) None[source]

Add PAGE-XML MetadataItemType MetadataItem describing the processing step and runtime parameters to OcrdPage pcgts.

resolve_resource(val)[source]

Resolve a resource name to an absolute file path with the algorithm in spec

Parameters:

val (string) – resource value to resolve

show_resource(val)[source]

Resolve a resource name to a file path with the algorithm in spec, then print its contents to stdout.

Parameters:

val (string) – resource value to show

list_all_resources()[source]

List all resources found in the filesystem and matching content-type by filename suffix

property module

The top-level module this processor belongs to.

property moduledir

The filesystem path of the module directory.

property input_files

List the input files (for single-valued input_file_grp).

For each physical page:

  • If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)

  • Else if there is a single image file, take it (and forget about all other files for that page)

  • Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)

See algorithm

Returns:

A list of ocrd_models.ocrd_file.OcrdFile objects.

zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]

List tuples of input files (for multi-valued input_file_grp).

Processors that expect/need multiple input file groups, cannot use input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.

Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via mimetype. But still, multiple matching files per page are an error.

Single-page multiple-file errors are handled according to on_error:

  • if skip, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)

  • if first, then the first matching file for the page will be silently selected (as if the first was the only match)

  • if last, then the last matching file for the page will be silently selected (as if the last was the only match)

  • if abort, then an exception will be raised.

Multiple matches for PAGE-XML will always raise an exception.

Keyword Arguments:
  • require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.

  • on_error (string) – How to handle multiple file matches per page.

  • mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.

Returns:

A list of ocrd_models.ocrd_file.OcrdFile tuples.

ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None, subcommand=None)[source]

Generate a string describing the full CLI of this processor including params.

Parameters:
  • ocrd_tool (dict) – this processor’s tools section of the module’s ocrd-tool.json

  • processor_instance – the processor implementation (for adding any module/class/function docstrings)

ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, debug=None, log_level=None, log_filename=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None)[source]

Open a workspace and run a processor on the command line.

If workspace is not none, reuse that. Otherwise, instantiate an Workspace for mets_url (and working_dir) by using ocrd.Resolver.workspace_from_url() (i.e. open or clone local workspace).

Run the processor CLI executable on the workspace, passing: - the workspace, - page_id - input_file_grp - output_file_grp - parameter (after applying any parameter_override settings)

(Will create output files and update the in the filesystem).

Parameters:

executable (string) – Executable name of the module processor.

ocrd.processor.base.run_processor(processorClass, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None, instance_caching=False)[source]

Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.

If workspace is not none, reuse that. Otherwise, instantiate an Workspace for mets_url (and working_dir) by using ocrd.Resolver.workspace_from_url() (i.e. open or clone local workspace).

Instantiate a Python object for processorClass, passing: - the workspace, - page_id - input_file_grp - output_file_grp - parameter (after applying any parameter_override settings)

Warning: Avoid setting the instance_caching flag to True. It may have unexpected side effects. This flag is used for an experimental feature we would like to adopt in future.

Run the processor on the workspace (creating output files in the filesystem).

Finally, write back the workspace (updating the METS in the filesystem).

Parameters:

processorClass (object) – Python class of the module processor.