ocrd.processor.base module¶
Processor base class and helper functions.
- class ocrd.processor.base.Processor(workspace: Workspace | None, ocrd_tool=None, parameter=None, input_file_grp=None, output_file_grp=None, page_id=None, download_files=True, version=None)[source]¶
Bases:
objectA processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing.
That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or selected physical pages of the input fileGrp(s), computes additional annotation, and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
Instantiate, but do not setup (neither for processing nor other usage). If given, do parse and validate
parameter.- Parameters:
workspace (
Workspace) – The workspace to process. If notNone, then chdir to that directory. Deprecated since version 3.0: Should beNonehere, but then needs to be set before processing.- Keyword Arguments:
parameter (string) – JSON of the runtime choices for ocrd-tool
parameters. Can beNoneeven for processing, but then needs to be set before running.input_file_grp (string) – comma-separated list of METS
fileGrpused for input. Deprecated since version 3.0: Should beNonehere, but then needs to be set before processing.output_file_grp (string) – comma-separated list of METS
fileGrpused for output. Deprecated since version 3.0: Should beNonehere, but then needs to be set before processing.page_id (string) – comma-separated list of METS physical
pageIDs to process (or empty for all pages). Deprecated since version 3.0: Should beNonehere, but then needs to be set before processing.download_files (boolean) – Whether input files will be downloaded prior to processing, defaults to
ocrd_utils.config.OCRD_DOWNLOAD_INPUTwhich isTrueby default
- max_instances: int = -1¶
maximum number of cached instances (ignored if negative), to be applied on top of
OCRD_MAX_PROCESSOR_CACHE(i.e. whatever is smaller).(Override this if you know how many instances fit into memory - GPU / CPU RAM - at once.)
- max_workers: int = -1¶
maximum number of processor forks for page-parallel processing (ignored if negative), to be applied on top of
OCRD_MAX_PARALLEL_PAGES(i.e. whatever is smaller).(Override this if you know how many pages fit into processing units - GPU shaders / CPU cores - at once, or if your class already creates threads prior to forking, e.g. during
setup.)
- max_page_seconds: int = -1¶
maximum number of seconds may be spent processing a single page (ignored if negative), to be applied on top of
OCRD_PROCESSING_PAGE_TIMEOUT(i.e. whatever is smaller).(Override this if you know how costly this processor may be, irrespective of image size or complexity of the page.)
- property metadata_filename: str¶
Relative location of the
ocrd-tool.jsonfile inside the package.Used by
metadata_location.(Override if
ocrd-tool.jsonis not in the root of the module, e.g.namespace/ocrd-tool.jsonordata/ocrd-tool.json).
- property metadata_location: Path¶
Absolute path of the
ocrd-tool.jsonfile as distributed with the package.Used by
metadata_rawdict.(Override if
ocrd-tool.jsonis not distributed with the Python package.)
- property metadata_rawdict: dict¶
Raw (unvalidated, unexpanded)
ocrd-tool.jsondict contents of the package.Used by
metadata.(Override if
ocrd-tool.jsonis not in a file.)
- property metadata: dict¶
The
ocrd-tool.jsondict contents of the package, according to the OCR-D spec for processor tools.After deserialisation, it also gets validated against the schema with all defaults expanded.
Used by
ocrd_toolandversion.(Override if you want to provide metadata programmatically instead of a JSON file.)
- property ocrd_tool: dict¶
The
ocrd-tool.jsondict contents of this processor tool. Usually theexecutablekey of thetoolspart ofmetadata.(Override if you do not want to use
metadatalookup mechanism.)
- property executable: str¶
The executable name of this processor tool. Taken from the runtime filename.
Used by
ocrd_toolfor lookup inmetadata.(Override if your entry-point name deviates from the
executablename, or the processor gets instantiated from another runtime.)
- property version: str¶
The program version of the package. Usually the
versionpart ofmetadata.(Override if you do not want to use
metadatalookup mechanism.)
- property parameter: dict | None¶
the runtime parameter dict to be used by this processor
- show_help(subcommand=None)[source]¶
Print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.
- verify()[source]¶
Verify that
input_file_grpandoutput_file_grpfulfill the processor’s requirements.
- list_resources()[source]¶
Find all installed resource files in the search paths and print their path names.
- setup() None[source]¶
Prepare the processor for actual data processing, prior to changing to the workspace directory but after parsing parameters.
(Override this to load models into memory etc.)
- shutdown() None[source]¶
Bring down the processor after data processing, after to changing back from the workspace directory but before exiting (or setting up with different parameters).
(Override this to unload models from memory etc.)
- process() None[source]¶
Process all files of the
workspacefrom the giveninput_file_grpto the givenoutput_file_grpfor the givenpage_id(or all pages) under the givenparameter.(This contains the main functionality and needs to be overridden by subclasses.)
- process_workspace(workspace: Workspace) None[source]¶
Process all files of the given
workspace, from the giveninput_file_grpto the givenoutput_file_grpfor the givenpage_id(or all pages) under the givenparameter.Delegates to
process_workspace_submit_tasks()andprocess_workspace_handle_tasks().(This will iterate over pages and files, calling
process_page_file()and handling exceptions. It should be overridden by subclasses to handle cases like post-processing or computation across pages.)
- process_workspace_submit_tasks(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int) Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]][source]¶
Look up all input files of the given
workspacefrom the giveninput_file_grpfor the givenpage_id(or all pages), and schedules callingprocess_page_file()on them for each page via executor (enforcing a per-page time limit of max_seconds).When running with OCRD_MAX_PARALLEL_PAGES>1 and the workspace via METS Server, the executor will fork this many worker parallel subprocesses each processing one page at a time. (Interprocess communication is done via task and result queues.)
Otherwise, tasks are run sequentially in the current process.
Delegates to
zip_input_files()to get the input files for each page, and then callsprocess_workspace_submit_page_task().Returns a dict mapping the per-page tasks (i.e. futures submitted to the executor) to their corresponding pageId and input files.
- process_workspace_submit_page_task(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int, input_file_tuple: List[OcrdFile | ClientSideOcrdFile | None]) Tuple[DummyFuture | Future, str, List[OcrdFile | ClientSideOcrdFile | None]][source]¶
Ensure all input files for a single page are downloaded to the workspace, then schedule
process_process_file()to be run on them via executor (enforcing a per-page time limit of max_seconds).Delegates to
process_page_file()(wrapped in_page_worker()to share the processor instance across forked processes).Returns a tuple of: - the scheduled future object, - the corresponding pageId, - the corresponding input files.
- process_workspace_handle_tasks(tasks: Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]]) Tuple[int, int, Dict[str, int], int][source]¶
Look up scheduled per-page futures one by one, handle errors (exceptions) and gather results.
Enforces policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).
Returns a tuple of: - the number of successfully processed pages - the number of failed (i.e. skipped or copied) pages - a dict of the type and corresponding number of exceptions seen - the number of total requested pages (i.e. success+fail+existing).
Delegates to
process_workspace_handle_page_task()for each page.
- process_workspace_handle_page_task(page_id: str, input_files: List[OcrdFile | ClientSideOcrdFile | None], task: DummyFuture | Future) bool | Exception[source]¶
Await a single page result and handle errors (exceptions), enforcing policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).
Returns - true in case of success - false in case the output already exists - the exception in case of failure
- process_page_file(*input_files: OcrdFile | ClientSideOcrdFile | None) None[source]¶
Process the given
input_filesof theworkspace, representing one physical page (passed as one openedOcrdFileper input fileGrp) under the givenparameter, and make sure the results get added accordingly.(This uses
process_page_pcgts(), but should be overridden by subclasses to handle cases like multiple output fileGrps, non-PAGE input etc.)
- process_page_pcgts(*input_pcgts: OcrdPage | None, page_id: str | None = None) OcrdPageResult[source]¶
Process the given
input_pcgtsof theworkspace, representing one physical page (passed as one parsedOcrdPageper input fileGrp) under the givenparameter, and return the resultingOcrdPageResult.Optionally, add to the
imagesattribute of the resultingOcrdPageResultinstances ofOcrdPageResultImage, which have required fields forpil(PIL.Imageimage data),file_id_suffix(used for generating IDs of the saved image) andalternative_image(reference of theocrd_models.ocrd_page.AlternativeImageTypefor setting the filename of the saved image).(This contains the main functionality and must be overridden by subclasses, unless it does not get called by some overriden
process_page_file().)
- add_metadata(pcgts: OcrdPage) None[source]¶
Add PAGE-XML
MetadataItemTypeMetadataItemdescribing the processing step and runtime parameters toOcrdPagepcgts.
- resolve_resource(val)[source]¶
Resolve a resource name to an absolute file path with the algorithm in spec
- Parameters:
val (string) – resource value to resolve
- show_resource(val)[source]¶
Resolve a resource name to a file path with the algorithm in spec, then print its contents to stdout.
- Parameters:
val (string) – resource value to show
- list_all_resources()[source]¶
List all resources found in the filesystem and matching content-type by filename suffix
- property module¶
The top-level module this processor belongs to.
- property moduledir¶
The filesystem path of the module directory.
- property input_files¶
List the input files (for single-valued
input_file_grp).For each physical page:
If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)
Else if there is a single image file, take it (and forget about all other files for that page)
Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)
See algorithm
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFileobjects.
- zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]¶
List tuples of input files (for multi-valued
input_file_grp).Processors that expect/need multiple input file groups, cannot use
input_files. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via
mimetype. But still, multiple matching files per page are an error.Single-page multiple-file errors are handled according to
on_error:if
skip, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)if
first, then the first matching file for the page will be silently selected (as if the first was the only match)if
last, then the last matching file for the page will be silently selected (as if the last was the only match)if
abort, then an exception will be raised.
Multiple matches for PAGE-XML will always raise an exception.
- Keyword Arguments:
require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.
on_error (string) – How to handle multiple file matches per page.
mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFiletuples.
- ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None, subcommand=None)[source]¶
Generate a string describing the full CLI of this processor including params.
- Parameters:
ocrd_tool (dict) – this processor’s
toolssection of the module’socrd-tool.jsonprocessor_instance – the processor implementation (for adding any module/class/function docstrings)
- ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, debug=None, log_level=None, log_filename=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None)[source]¶
Open a workspace and run a processor on the command line.
If
workspaceis not none, reuse that. Otherwise, instantiate anWorkspaceformets_url(andworking_dir) by usingocrd.Resolver.workspace_from_url()(i.e. open or clone local workspace).Run the processor CLI
executableon the workspace, passing: - the workspace, -page_id-input_file_grp-output_file_grp-parameter(after applying anyparameter_overridesettings)(Will create output files and update the in the filesystem).
- Parameters:
executable (string) – Executable name of the module processor.
- ocrd.processor.base.run_processor(processorClass, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None, instance_caching=False)[source]¶
Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.
If
workspaceis not none, reuse that. Otherwise, instantiate anWorkspaceformets_url(andworking_dir) by usingocrd.Resolver.workspace_from_url()(i.e. open or clone local workspace).Instantiate a Python object for
processorClass, passing: - the workspace, -page_id-input_file_grp-output_file_grp-parameter(after applying anyparameter_overridesettings)Warning: Avoid setting the instance_caching flag to True. It may have unexpected side effects. This flag is used for an experimental feature we would like to adopt in future.
Run the processor on the workspace (creating output files in the filesystem).
Finally, write back the workspace (updating the METS in the filesystem).
- Parameters:
processorClass (object) – Python class of the module processor.