ocrd.processor.base module¶
Processor base class and helper functions.
- class ocrd.processor.base.Processor(workspace: Workspace, ocrd_tool=None, parameter=None, input_file_grp=None, output_file_grp=None, page_id=None, resolve_resource=None, show_resource=None, list_resources=False, show_help=False, subcommand=None, show_version=False, dump_json=False, dump_module_dir=False, version=None)[source]¶
Bases:
object
A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing. That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or requested physical pages of the input fileGrp(s), and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
Instantiate, but do not process. Unless
list_resources
orshow_resource
orshow_help
orshow_version
ordump_json
ordump_module_dir
is true, setup for processing (parsing and validating parameters, entering the workspace directory).- Parameters:
workspace (
Workspace
) – The workspace to process. Can beNone
even for processing (esp. on multiple workspaces), but then needs to be set before running.- Keyword Arguments:
ocrd_tool (string) – JSON of the ocrd-tool description for that processor. Can be
None
for processing, but needs to be set before running.parameter (string) – JSON of the runtime choices for ocrd-tool
parameters
. Can beNone
even for processing, but then needs to be set before running.input_file_grp (string) – comma-separated list of METS ``fileGrp``s used for input.
output_file_grp (string) – comma-separated list of METS ``fileGrp``s used for output.
page_id (string) – comma-separated list of METS physical
page
IDs to process (or empty for all pages).resolve_resource (string) – If not
None
, then instead of processing, resolve given resource by name and print its full path to stdout.show_resource (string) – If not
None
, then instead of processing, resolve given resource by name and print its contents to stdout.list_resources (boolean) – If true, then instead of processing, find all installed resource files in the search paths and print their path names.
show_help (boolean) – If true, then instead of processing, print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.
subcommand (string) – ‘worker’ or ‘server’, only used here for the right –help output
show_version (boolean) – If true, then instead of processing, print information on this processor’s version and OCR-D version. Exit afterwards.
dump_json (boolean) – If true, then instead of processing, print
ocrd_tool
on stdout.dump_module_dir (boolean) – If true, then instead of processing, print
moduledir
on stdout.
- process() None [source]¶
Process the
workspace
from the giveninput_file_grp
to the givenoutput_file_grp
for the givenpage_id
under the givenparameter
.(This contains the main functionality and needs to be overridden by subclasses.)
- add_metadata(pcgts)[source]¶
Add PAGE-XML
MetadataItemType
MetadataItem
describing the processing step and runtime parameters toPcGtsType
pcgts
.
- resolve_resource(val)[source]¶
Resolve a resource name to an absolute file path with the algorithm in https://ocr-d.de/en/spec/ocrd_tool#file-parameters
- Parameters:
val (string) – resource value to resolve
- list_all_resources()[source]¶
List all resources found in the filesystem and matching content-type by filename suffix
- property module¶
The top-level module this processor belongs to.
- property moduledir¶
The filesystem path of the module directory.
- property input_files¶
List the input files (for single-valued
input_file_grp
).For each physical page:
If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)
Else if there is a single image file, take it (and forget about all other files for that page)
Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)
Algorithm <https://github.com/cisocrgroup/ocrd_cis/pull/57#issuecomment-656336593>_
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFile
objects.
- zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]¶
List tuples of input files (for multi-valued
input_file_grp
).Processors that expect/need multiple input file groups, cannot use
input_files
. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via
mimetype
. But still, multiple matching files per page are an error.Single-page multiple-file errors are handled according to
on_error
:if
skip
, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)if
first
, then the first matching file for the page will be silently selected (as if the first was the only match)if
last
, then the last matching file for the page will be silently selected (as if the last was the only match)if
abort
, then an exception will be raised.
Multiple matches for PAGE-XML will always raise an exception.
- Keyword Arguments:
require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.
mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFile
tuples.
- ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None, subcommand=None)[source]¶
Generate a string describing the full CLI of this processor including params.
- Parameters:
ocrd_tool (dict) – this processor’s
tools
section of the module’socrd-tool.json
processor_instance – the processor implementation (for adding any module/class/function docstrings)
- ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, log_level=None, log_filename=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None)[source]¶
Open a workspace and run a processor on the command line.
If
workspace
is not none, reuse that. Otherwise, instantiate anWorkspace
formets_url
(andworking_dir
) by usingocrd.Resolver.workspace_from_url()
(i.e. open or clone local workspace).Run the processor CLI
executable
on the workspace, passing: - the workspace, -page_id
-input_file_grp
-output_file_grp
-parameter
(after applying anyparameter_override
settings)(Will create output files and update the in the filesystem).
- Parameters:
executable (string) – Executable name of the module processor.
- ocrd.processor.base.run_processor(processorClass, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, show_resource=None, list_resources=False, parameter=None, parameter_override=None, working_dir=None, mets_server_url=None, instance_caching=False)[source]¶
Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.
If
workspace
is not none, reuse that. Otherwise, instantiate anWorkspace
formets_url
(andworking_dir
) by usingocrd.Resolver.workspace_from_url()
(i.e. open or clone local workspace).Instantiate a Python object for
processorClass
, passing: - the workspace, -page_id
-input_file_grp
-output_file_grp
-parameter
(after applying anyparameter_override
settings)Warning: Avoid setting the instance_caching flag to True. It may have unexpected side effects. This flag is used for an experimental feature we would like to adopt in future.
Run the processor on the workspace (creating output files in the filesystem).
Finally, write back the workspace (updating the METS in the filesystem).
- Parameters:
processorClass (object) – Python class of the module processor.