ocrd.processor.base module¶
Processor base class and helper functions.
- class ocrd.processor.base.Processor(workspace: Workspace | None, ocrd_tool=None, parameter=None, input_file_grp=None, output_file_grp=None, page_id=None, download_files=True, version=None)[source]¶
Bases:
object
A processor is a tool that implements the uniform OCR-D command-line interface for run-time data processing.
That is, it executes a single workflow step, or a combination of workflow steps, on the workspace (represented by local METS). It reads input files for all or selected physical pages of the input fileGrp(s), computes additional annotation, and writes output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
Instantiate, but do not setup (neither for processing nor other usage). If given, do parse and validate
parameter
.- Parameters:
workspace (
Workspace
) – The workspace to process. If notNone
, then chdir to that directory. Deprecated since version 3.0: Should beNone
here, but then needs to be set before processing.- Keyword Arguments:
parameter (string) – JSON of the runtime choices for ocrd-tool
parameters
. Can beNone
even for processing, but then needs to be set before running.input_file_grp (string) – comma-separated list of METS
fileGrp
used for input. Deprecated since version 3.0: Should beNone
here, but then needs to be set before processing.output_file_grp (string) – comma-separated list of METS
fileGrp
used for output. Deprecated since version 3.0: Should beNone
here, but then needs to be set before processing.page_id (string) – comma-separated list of METS physical
page
IDs to process (or empty for all pages). Deprecated since version 3.0: Should beNone
here, but then needs to be set before processing.download_files (boolean) – Whether input files will be downloaded prior to processing, defaults to
ocrd_utils.config.OCRD_DOWNLOAD_INPUT
which isTrue
by default
- max_instances: int = -1¶
maximum number of cached instances (ignored if negative), to be applied on top of
OCRD_MAX_PROCESSOR_CACHE
(i.e. whatever is smaller).(Override this if you know how many instances fit into memory - GPU / CPU RAM - at once.)
- max_workers: int = -1¶
maximum number of processor forks for page-parallel processing (ignored if negative), to be applied on top of
OCRD_MAX_PARALLEL_PAGES
(i.e. whatever is smaller).(Override this if you know how many pages fit into processing units - GPU shaders / CPU cores - at once, or if your class already creates threads prior to forking, e.g. during
setup
.)
- max_page_seconds: int = -1¶
maximum number of seconds may be spent processing a single page (ignored if negative), to be applied on top of
OCRD_PROCESSING_PAGE_TIMEOUT
(i.e. whatever is smaller).(Override this if you know how costly this processor may be, irrespective of image size or complexity of the page.)
- property metadata_filename: str¶
Relative location of the
ocrd-tool.json
file inside the package.Used by
metadata_location
.(Override if
ocrd-tool.json
is not in the root of the module, e.g.namespace/ocrd-tool.json
ordata/ocrd-tool.json
).
- property metadata_location: Path¶
Absolute path of the
ocrd-tool.json
file as distributed with the package.Used by
metadata_rawdict
.(Override if
ocrd-tool.json
is not distributed with the Python package.)
- property metadata_rawdict: dict¶
Raw (unvalidated, unexpanded)
ocrd-tool.json
dict contents of the package.Used by
metadata
.(Override if
ocrd-tool.json
is not in a file.)
- property metadata: dict¶
The
ocrd-tool.json
dict contents of the package, according to the OCR-D spec for processor tools.After deserialisation, it also gets validated against the schema with all defaults expanded.
Used by
ocrd_tool
andversion
.(Override if you want to provide metadata programmatically instead of a JSON file.)
- property ocrd_tool: dict¶
The
ocrd-tool.json
dict contents of this processor tool. Usually theexecutable
key of thetools
part ofmetadata
.(Override if you do not want to use
metadata
lookup mechanism.)
- property executable: str¶
The executable name of this processor tool. Taken from the runtime filename.
Used by
ocrd_tool
for lookup inmetadata
.(Override if your entry-point name deviates from the
executable
name, or the processor gets instantiated from another runtime.)
- property version: str¶
The program version of the package. Usually the
version
part ofmetadata
.(Override if you do not want to use
metadata
lookup mechanism.)
- property parameter: dict | None¶
the runtime parameter dict to be used by this processor
- show_help(subcommand=None)[source]¶
Print a usage description including the standard CLI and all of this processor’s ocrd-tool parameters and docstrings.
- verify()[source]¶
Verify that
input_file_grp
andoutput_file_grp
fulfill the processor’s requirements.
- list_resources()[source]¶
Find all installed resource files in the search paths and print their path names.
- setup() None [source]¶
Prepare the processor for actual data processing, prior to changing to the workspace directory but after parsing parameters.
(Override this to load models into memory etc.)
- shutdown() None [source]¶
Bring down the processor after data processing, after to changing back from the workspace directory but before exiting (or setting up with different parameters).
(Override this to unload models from memory etc.)
- process() None [source]¶
Process all files of the
workspace
from the giveninput_file_grp
to the givenoutput_file_grp
for the givenpage_id
(or all pages) under the givenparameter
.(This contains the main functionality and needs to be overridden by subclasses.)
- process_workspace(workspace: Workspace) None [source]¶
Process all files of the given
workspace
, from the giveninput_file_grp
to the givenoutput_file_grp
for the givenpage_id
(or all pages) under the givenparameter
.Delegates to
process_workspace_submit_tasks()
andprocess_workspace_handle_tasks()
.(This will iterate over pages and files, calling
process_page_file()
and handling exceptions. It should be overridden by subclasses to handle cases like post-processing or computation across pages.)
- process_workspace_submit_tasks(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int) Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]] [source]¶
Look up all input files of the given
workspace
from the giveninput_file_grp
for the givenpage_id
(or all pages), and schedules callingprocess_page_file()
on them for each page via executor (enforcing a per-page time limit of max_seconds).When running with OCRD_MAX_PARALLEL_PAGES>1 and the workspace via METS Server, the executor will fork this many worker parallel subprocesses each processing one page at a time. (Interprocess communication is done via task and result queues.)
Otherwise, tasks are run sequentially in the current process.
Delegates to
zip_input_files()
to get the input files for each page, and then callsprocess_workspace_submit_page_task()
.Returns a dict mapping the per-page tasks (i.e. futures submitted to the executor) to their corresponding pageId and input files.
- process_workspace_submit_page_task(executor: DummyExecutor | ProcessPoolExecutor, max_seconds: int, input_file_tuple: List[OcrdFile | ClientSideOcrdFile | None]) Tuple[DummyFuture | Future, str, List[OcrdFile | ClientSideOcrdFile | None]] [source]¶
Ensure all input files for a single page are downloaded to the workspace, then schedule
process_process_file()
to be run on them via executor (enforcing a per-page time limit of max_seconds).Delegates to
process_page_file()
(wrapped in_page_worker()
to share the processor instance across forked processes).Returns a tuple of: - the scheduled future object, - the corresponding pageId, - the corresponding input files.
- process_workspace_handle_tasks(tasks: Dict[DummyFuture | Future, Tuple[str, List[OcrdFile | ClientSideOcrdFile | None]]]) Tuple[int, int, Dict[str, int], int] [source]¶
Look up scheduled per-page futures one by one, handle errors (exceptions) and gather results.
Enforces policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).
Returns a tuple of: - the number of successfully processed pages - the number of failed (i.e. skipped or copied) pages - a dict of the type and corresponding number of exceptions seen - the number of total requested pages (i.e. success+fail+existing).
Delegates to
process_workspace_handle_page_task()
for each page.
- process_workspace_handle_page_task(page_id: str, input_files: List[OcrdFile | ClientSideOcrdFile | None], task: DummyFuture | Future) bool | Exception [source]¶
Await a single page result and handle errors (exceptions), enforcing policies configured by the following environment variables: - OCRD_EXISTING_OUTPUT (abort/skip/overwrite) - OCRD_MISSING_OUTPUT (abort/skip/fallback-copy) - OCRD_MAX_MISSING_OUTPUTS (abort after all).
Returns - true in case of success - false in case the output already exists - the exception in case of failure
- process_page_file(*input_files: OcrdFile | ClientSideOcrdFile | None) None [source]¶
Process the given
input_files
of theworkspace
, representing one physical page (passed as one openedOcrdFile
per input fileGrp) under the givenparameter
, and make sure the results get added accordingly.(This uses
process_page_pcgts()
, but should be overridden by subclasses to handle cases like multiple output fileGrps, non-PAGE input etc.)
- process_page_pcgts(*input_pcgts: OcrdPage | None, page_id: str | None = None) OcrdPageResult [source]¶
Process the given
input_pcgts
of theworkspace
, representing one physical page (passed as one parsedOcrdPage
per input fileGrp) under the givenparameter
, and return the resultingOcrdPageResult
.Optionally, add to the
images
attribute of the resultingOcrdPageResult
instances ofOcrdPageResultImage
, which have required fields forpil
(PIL.Image
image data),file_id_suffix
(used for generating IDs of the saved image) andalternative_image
(reference of theocrd_models.ocrd_page.AlternativeImageType
for setting the filename of the saved image).(This contains the main functionality and must be overridden by subclasses, unless it does not get called by some overriden
process_page_file()
.)
- add_metadata(pcgts: OcrdPage) None [source]¶
Add PAGE-XML
MetadataItemType
MetadataItem
describing the processing step and runtime parameters toOcrdPage
pcgts
.
- resolve_resource(val)[source]¶
Resolve a resource name to an absolute file path with the algorithm in spec
- Parameters:
val (string) – resource value to resolve
- show_resource(val)[source]¶
Resolve a resource name to a file path with the algorithm in spec, then print its contents to stdout.
- Parameters:
val (string) – resource value to show
- list_all_resources()[source]¶
List all resources found in the filesystem and matching content-type by filename suffix
- property module¶
The top-level module this processor belongs to.
- property moduledir¶
The filesystem path of the module directory.
- property input_files¶
List the input files (for single-valued
input_file_grp
).For each physical page:
If there is a single PAGE-XML for the page, take it (and forget about all other files for that page)
Else if there is a single image file, take it (and forget about all other files for that page)
Otherwise raise an error (complaining that only PAGE-XML warrants having multiple images for a single page)
See algorithm
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFile
objects.
- zip_input_files(require_first=True, mimetype=None, on_error='skip')[source]¶
List tuples of input files (for multi-valued
input_file_grp
).Processors that expect/need multiple input file groups, cannot use
input_files
. They must align (zip) input files across pages. This includes the case where not all pages are equally present in all file groups. It also requires making a consistent selection if there are multiple files per page.Following the OCR-D functional model, this function tries to find a single PAGE file per page, or fall back to a single image file per page. In either case, multiple matches per page are an error (see error handling below). This default behaviour can be changed by using a fixed MIME type filter via
mimetype
. But still, multiple matching files per page are an error.Single-page multiple-file errors are handled according to
on_error
:if
skip
, then the page for the respective fileGrp will be silently skipped (as if there was no match at all)if
first
, then the first matching file for the page will be silently selected (as if the first was the only match)if
last
, then the last matching file for the page will be silently selected (as if the last was the only match)if
abort
, then an exception will be raised.
Multiple matches for PAGE-XML will always raise an exception.
- Keyword Arguments:
require_first (boolean) – If true, then skip a page entirely whenever it is not available in the first input fileGrp.
on_error (string) – How to handle multiple file matches per page.
mimetype (string) – If not None, filter by the specified MIME type (literal or regex prefixed by //). Otherwise prefer PAGE or image.
- Returns:
A list of
ocrd_models.ocrd_file.OcrdFile
tuples.
- ocrd.processor.base.generate_processor_help(ocrd_tool, processor_instance=None, subcommand=None)[source]¶
Generate a string describing the full CLI of this processor including params.
- Parameters:
ocrd_tool (dict) – this processor’s
tools
section of the module’socrd-tool.json
processor_instance – the processor implementation (for adding any module/class/function docstrings)
- ocrd.processor.base.run_cli(executable, mets_url=None, resolver=None, workspace=None, page_id=None, overwrite=None, debug=None, log_level=None, log_filename=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None)[source]¶
Open a workspace and run a processor on the command line.
If
workspace
is not none, reuse that. Otherwise, instantiate anWorkspace
formets_url
(andworking_dir
) by usingocrd.Resolver.workspace_from_url()
(i.e. open or clone local workspace).Run the processor CLI
executable
on the workspace, passing: - the workspace, -page_id
-input_file_grp
-output_file_grp
-parameter
(after applying anyparameter_override
settings)(Will create output files and update the in the filesystem).
- Parameters:
executable (string) – Executable name of the module processor.
- ocrd.processor.base.run_processor(processorClass, mets_url=None, resolver=None, workspace=None, page_id=None, log_level=None, input_file_grp=None, output_file_grp=None, parameter=None, working_dir=None, mets_server_url=None, instance_caching=False)[source]¶
Instantiate a Pythonic processor, open a workspace, run the processor and save the workspace.
If
workspace
is not none, reuse that. Otherwise, instantiate anWorkspace
formets_url
(andworking_dir
) by usingocrd.Resolver.workspace_from_url()
(i.e. open or clone local workspace).Instantiate a Python object for
processorClass
, passing: - the workspace, -page_id
-input_file_grp
-output_file_grp
-parameter
(after applying anyparameter_override
settings)Warning: Avoid setting the instance_caching flag to True. It may have unexpected side effects. This flag is used for an experimental feature we would like to adopt in future.
Run the processor on the workspace (creating output files in the filesystem).
Finally, write back the workspace (updating the METS in the filesystem).
- Parameters:
processorClass (object) – Python class of the module processor.