ocrd_validators package

Validators for various OCR-D related data structures.

class ocrd_validators.ParameterValidator(ocrd_tool)[source]

Bases: JsonValidator

JsonValidator validating parametersagains ocrd-tool.json.

Construct a ParameterValidator.

Parameters:

ocrd_tool (dict) – Parsed ocrd-tool.json.

validate(*args, **kwargs)[source]

Validate a parameter dict against a parameter schema from an ocrd-tool.json

Parameters:
  • obj (dict)

  • schema (dict)

class ocrd_validators.WorkspaceValidator(resolver, mets_url, src_dir=None, skip=None, download=False, page_strictness='strict', page_coordinate_consistency='poly', include_fileGrp=None, exclude_fileGrp=None)[source]

Bases: object

Validator for OcrdMets <../ocrd_models/ocrd_models.ocrd_mets.html>.

Construct a new WorkspaceValidator.

Parameters:
  • resolver (Resolver)

  • mets_url (string)

  • src_dir (string)

  • skip (list)

  • download (boolean)

  • page_strictness ("strict"|"lax"|"fix"|"off") – how strict to check multi-level TextEquiv consistency of PAGE XML files

  • page_coordinate_consistency ("poly"|"baseline"|"both"|"off") –

    check whether each segment’s coords are fully contained within its parent’s:

    • ”poly”: *Region/TextLine/Word/Glyph in Border/*Region/TextLine/Word

    • ”baseline”: Baseline in TextLine

    • ”both”: both poly and baseline checks

    • ”off”: no coordinate checks

  • include_fileGrp (list[str]) – filegrp whitelist

  • exclude_fileGrp (list[str]) – filegrp blacklist

static check_file_grp(workspace, input_file_grp=None, output_file_grp=None, page_id=None, report=None)[source]

Return a report on whether input_file_grp is/are in workspace.mets and output_file_grp is/are not. To be run before processing

Parameters:
  • workspacec (Workspace)

  • input_file_grp (list|string)

  • output_file_grp (list|string)

  • page_id (list|string)

static validate(*args, **kwargs)[source]

Validates the workspace of a METS URL against the specs

Parameters:
  • resolver (ocrd.Resolver) – Resolver

  • mets_url (string) – URL of the METS file

  • src_dir (string, None) – Directory containing mets file

  • skip (list) – Validation checks to omit. One or more of ‘mets_unique_identifier’, ‘mets_file_group_names’, ‘mets_files’, ‘pixel_density’, ‘dimension’, ‘url’, ‘multipage’, ‘page’, ‘page_xsd’, ‘mets_xsd’, ‘mets_fileid_page_pcgtsid’

  • download (boolean) – Whether to download remote file references temporarily during validation (like a processor would)

Returns:

report (ValidationReport) Report on the validity

class ocrd_validators.PageValidator[source]

Bases: object

Validator for OcrdPage <../ocrd_models/ocrd_models.ocrd_page.html>.

static validate(filename=None, ocrd_page=None, ocrd_file=None, page_textequiv_consistency='strict', page_textequiv_strategy='first', check_baseline=True, check_coords=True)[source]

Validates a PAGE file for consistency by filename, OcrdFile or passing OcrdPage directly.

Parameters:
  • filename (string) – Path to PAGE

  • ocrd_page (OcrdPage) – OcrdPage instance

  • ocrd_file (OcrdFile) – OcrdFile instance wrapping OcrdPage

  • page_textequiv_consistency (string) – ‘strict’, ‘lax’, ‘fix’ or ‘off’

  • page_textequiv_strategy (string) – Currently only ‘first’

  • check_baseline (bool) – whether Baseline must be fully within TextLine/Coords

  • check_coords (bool) – whether *Region/TextLine/Word/Glyph must each be fully contained within Border/*Region/TextLine/Word, resp.

Returns:

report (ValidationReport) Report on the validity

class ocrd_validators.OcrdToolValidator(schema, validator_class=<class 'jsonschema.validators.Draft6Validator'>)[source]

Bases: JsonValidator

JsonValidator validating against the ocrd-tool.json schema.

Construct a JsonValidator.

Parameters:
  • schema (dict)

  • validator_class (Draft6Validator|DefaultValidatingDraft6Validator)

static validate(obj, schema={'additionalProperties': False, 'description': 'Schema for tools by OCR-D MP', 'properties': {'dockerhub': {'description': 'DockerHub image', 'type': 'string'}, 'git_url': {'description': 'Github/Gitlab URL', 'format': 'url', 'type': 'string'}, 'tools': {'additionalProperties': False, 'patternProperties': {'ocrd-.*': {'additionalProperties': False, 'properties': {'categories': {'description': 'Tools belong to this categories, representing modules within the OCR-D project structure', 'items': {'enum': ['Image preprocessing', 'Layout analysis', 'Text recognition and optimization', 'Model training', 'Long-term preservation', 'Quality assurance'], 'type': 'string'}, 'type': 'array'}, 'description': {'description': 'Concise description what the tool does'}, 'executable': {'description': 'The name of the CLI executable in $PATH', 'type': 'string'}, 'input_file_grp': {'description': 'Input fileGrp@USE this tool expects by default', 'items': {'type': 'string'}, 'type': 'array'}, 'output_file_grp': {'description': 'Output fileGrp@USE this tool produces by default', 'items': {'type': 'string'}, 'type': 'array'}, 'parameters': {'description': 'Object describing the parameters of a tool. Keys are parameter names, values sub-schemas.', 'patternProperties': {'.*': {'additionalProperties': False, 'properties': {'additionalProperties': {'description': 'Whether an object value may contain properties not explicitly defined', 'type': 'boolean'}, 'cacheable': {'default': False, 'description': "If parameter is reference to file: Whether the file should be cached, e.g. because it is large and won't change.", 'type': 'boolean'}, 'content-type': {'default': 'application/octet-stream', 'description': 'The media type of resources this processor expects for this parameter. Most processors use files for resources (e.g.  `*.traineddata` for `ocrd-tesserocr-recognize`) while others use directories of files (e.g. `default` for `ocrd-eynollah-segment`).  If a parameter requires directories, it must set `content-type` to `text/directory`.\n', 'type': 'string'}, 'default': {'description': 'Default value when not provided by the user'}, 'description': {'description': 'Concise description of syntax and semantics of this parameter'}, 'enum': {'description': 'List the allowed values if a fixed list.', 'type': 'array'}, 'exclusiveMaximum': {'description': 'Maximum value for number parameters, excluding the maximum', 'type': 'number'}, 'exclusiveMinimum': {'description': 'Minimum value for number parameters, excluding the minimum', 'type': 'number'}, 'format': {'description': 'Subtype, such as `float` for type `number` or `uri` for type `string`.'}, 'items': {'description': 'describe the items of an array further', 'type': 'object'}, 'maximum': {'description': 'Maximum value for number parameters, including the maximum', 'type': 'number'}, 'minimum': {'description': 'Minimum value for number parameters, including the minimum', 'type': 'number'}, 'multipleOf': {'description': 'For number values, those values must be multiple of this number', 'type': 'number'}, 'properties': {'description': 'Describe the properties of an object value', 'type': 'object'}, 'required': {'description': 'Whether this parameter is required', 'type': 'boolean'}, 'type': {'description': 'Data type of this parameter', 'enum': ['string', 'number', 'boolean', 'object', 'array'], 'type': 'string'}}, 'required': ['description', 'type'], 'type': 'object'}}, 'type': 'object'}, 'resource_locations': {'default': ['data', 'cwd', 'system', 'module'], 'description': 'The locations in the filesystem this processor supports for resource lookup', 'items': {'enum': ['data', 'cwd', 'system', 'module'], 'type': 'string'}, 'type': 'array'}, 'resources': {'description': 'Resources for this processor', 'items': {'additionalProperties': False, 'properties': {'description': {'description': 'A description of the resource', 'type': 'string'}, 'name': {'description': 'Name to store the resource as', 'type': 'string'}, 'parameter_usage': {'default': 'as-is', 'description': 'Defines how the parameter is to be used', 'enum': ['as-is', 'without-extension'], 'type': 'string'}, 'path_in_archive': {'default': '.', 'description': 'if type is archive, the resource is at this location in the archive', 'type': 'string'}, 'size': {'description': 'Size of the resource in bytes', 'type': 'number'}, 'type': {'default': 'file', 'description': 'Type of the URL', 'enum': ['file', 'directory', 'archive'], 'type': 'string'}, 'url': {'description': 'URLs of all components of this resource', 'type': 'string'}, 'version_range': {'default': '>= 0.0.1', 'description': 'Range of supported versions, syntax like in PEP 440', 'type': 'string'}}, 'required': ['url', 'description', 'name', 'size'], 'type': 'object'}, 'type': 'array'}, 'steps': {'description': 'This tool can be used at these steps in the OCR-D functional model', 'items': {'enum': ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis'], 'type': 'string'}, 'type': 'array'}}, 'required': ['description', 'steps', 'executable', 'categories', 'input_file_grp'], 'type': 'object'}}, 'type': 'object'}, 'version': {'description': 'Version of the tool, expressed as MAJOR.MINOR.PATCH.', 'pattern': '^[0-9]+\\.[0-9]+\\.[0-9]+$', 'type': 'string'}}, 'required': ['version', 'git_url', 'tools'], 'type': 'object'})[source]

Validate against ocrd-tool.json schema.

class ocrd_validators.OcrdResourceListValidator(schema, validator_class=<class 'jsonschema.validators.Draft6Validator'>)[source]

Bases: JsonValidator

JsonValidator validating against the resource_list.yml schema.

Construct a JsonValidator.

Parameters:
  • schema (dict)

  • validator_class (Draft6Validator|DefaultValidatingDraft6Validator)

static validate(obj, schema={'additionalProperties': False, 'patternProperties': {'^ocrd-.*': {'description': 'Resources for this processor', 'items': {'additionalProperties': False, 'properties': {'description': {'description': 'A description of the resource', 'type': 'string'}, 'name': {'description': 'Name to store the resource as', 'type': 'string'}, 'parameter_usage': {'default': 'as-is', 'description': 'Defines how the parameter is to be used', 'enum': ['as-is', 'without-extension'], 'type': 'string'}, 'path_in_archive': {'default': '.', 'description': 'if type is archive, the resource is at this location in the archive', 'type': 'string'}, 'size': {'description': 'Size of the resource in bytes', 'type': 'number'}, 'type': {'default': 'file', 'description': 'Type of the URL', 'enum': ['file', 'directory', 'archive'], 'type': 'string'}, 'url': {'description': 'URLs of all components of this resource', 'type': 'string'}, 'version_range': {'default': '>= 0.0.1', 'description': 'Range of supported versions, syntax like in PEP 440', 'type': 'string'}}, 'required': ['url', 'description', 'name', 'size'], 'type': 'object'}, 'type': 'array'}}, 'type': 'object'})[source]

Validate against resource_list.schema.yml schema.

class ocrd_validators.OcrdZipValidator(resolver, path_to_zip)[source]

Bases: object

Validate conformance with BagIt and OCR-D bagit profile.

See:
Parameters:
  • resolver (Resolver) – resolver

  • path_to_zip (string) – Path to the OCRD-ZIP file

validate(skip_checksums=False, skip_bag=False, skip_unzip=False, skip_delete=False, processes=2)[source]

Validate an OCRD-ZIP file for profile, bag and workspace conformance

Parameters:
  • skip_bag (boolean) – Whether to skip all checks of manifests and files

  • skip_checksums (boolean) – Whether to omit checksum checks but still check basic BagIt conformance

  • skip_unzip (boolean) – Whether the OCRD-ZIP is unzipped, i.e. a directory

  • skip_delete (boolean) – Whether to skip deleting the unpacked OCRD-ZIP dir after valdiation

  • processes (integer) – Number of processes used for checksum validation

class ocrd_validators.XsdValidator(schema_url)[source]

Bases: object

XML Schema validator.

Construct an XsdValidator.

Parameters:

schema_url (str) – URI of XML schema to validate against.

classmethod instance(schema_url)[source]
classmethod validate(schema_url, doc)[source]

Validate an XML document against a schema.

Parameters:
  • doc (etree.ElementTree|str|bytes)

  • schema_url (str) – URI of XML schema to validate against.

class ocrd_validators.XsdMetsValidator(schema_url)[source]

Bases: XsdValidator

XML Schema validator.

Construct an XsdValidator.

Parameters:

schema_url (str) – URI of XML schema to validate against.

classmethod validate(doc)[source]

Validate an XML document against a schema

Parameters:

doc (etree.ElementTree|str|bytes)

class ocrd_validators.XsdPageValidator(schema_url)[source]

Bases: XsdValidator

XML Schema validator.

Construct an XsdValidator.

Parameters:

schema_url (str) – URI of XML schema to validate against.

classmethod validate(doc)[source]

Validate an XML document against a schema

Parameters:

doc (etree.ElementTree|str|bytes)

class ocrd_validators.ProcessingServerConfigValidator(schema, validator_class=<class 'jsonschema.validators.Draft6Validator'>)[source]

Bases: JsonValidator

JsonValidator validating against the schema for the Processing Server

Construct a JsonValidator.

Parameters:
  • schema (dict)

  • validator_class (Draft6Validator|DefaultValidatingDraft6Validator)

static validate(obj, schema={'$defs': {'address': {'anyOf': [{'format': 'hostname'}, {'format': 'ipv4'}], 'type': 'string'}, 'credentials': {'additionalProperties': False, 'properties': {'password': {'type': 'string'}, 'username': {'type': 'string'}}, 'required': ['username', 'password'], 'type': 'object'}, 'port': {'maximum': 65535, 'minimum': 1, 'type': 'integer'}, 'ssh': {'additionalProperties': False, 'oneOf': [{'required': ['username', 'password']}, {'required': ['username', 'path_to_privkey']}], 'properties': {'password': {'type': 'string'}, 'path_to_privkey': {'description': 'Path to private key file', 'type': 'string'}, 'username': {'type': 'string'}}, 'type': 'object'}}, '$id': 'https://ocr-d.de/spec/web-api/config.schema.yml', '$schema': 'https://json-schema.org/draft/2020-12/schema', 'additionalProperties': False, 'description': 'Schema for the Processing Broker configuration file', 'properties': {'database': {'additionalProperties': False, 'description': 'Information about the MongoDB', 'properties': {'address': {'$ref': '#/$defs/address', 'description': 'The IP address or domain name of the machine where MongoDB is deployed'}, 'credentials': {'$ref': '#/$defs/credentials', 'description': 'The credentials for the MongoDB'}, 'port': {'$ref': '#/$defs/port', 'description': 'The port number of the MongoDB'}, 'skip_deployment': {'description': 'set to true to deploy database yourself', 'type': 'boolean'}, 'ssh': {'$ref': '#/$defs/ssh', 'description': 'Information required for an SSH connection'}}, 'required': ['address', 'port'], 'type': 'object'}, 'hosts': {'description': 'A list of hosts where Processing Servers will be deployed', 'items': {'additionalProperties': False, 'anyOf': [{'required': ['workers']}, {'required': ['servers']}], 'description': 'A host where one or many Processing Servers will be deployed', 'oneOf': [{'required': ['password']}, {'required': ['path_to_privkey']}], 'properties': {'address': {'$ref': '#/$defs/address', 'description': 'The IP address or domain name of the target machine'}, 'password': {'type': 'string'}, 'path_to_privkey': {'description': 'Path to private key file', 'type': 'string'}, 'servers': {'description': 'List of processor servers that will be deployed', 'items': {'additionalProperties': False, 'properties': {'deploy_type': {'default': 'native', 'description': 'Should the processor server be deployed natively or with Docker', 'enum': ['native', 'docker'], 'type': 'string'}, 'name': {'description': 'Name of the processor', 'examples': ['ocrd-cis-ocropy-binarize', 'ocrd-olena-binarize'], 'pattern': '^ocrd-.*$', 'type': 'string'}, 'port': {'$ref': '#/$defs/port', 'description': 'The port number to be deployed on the host'}}, 'required': ['name', 'port'], 'type': 'object'}, 'minItems': 1, 'type': 'array'}, 'username': {'type': 'string'}, 'workers': {'description': 'List of processing workers that will be deployed', 'items': {'additionalProperties': False, 'properties': {'deploy_type': {'default': 'native', 'description': 'Should the processing worker be deployed natively or with Docker', 'enum': ['native', 'docker'], 'type': 'string'}, 'name': {'description': 'Name of the processor', 'examples': ['ocrd-cis-ocropy-binarize', 'ocrd-olena-binarize'], 'pattern': '^ocrd-.*$', 'type': 'string'}, 'number_of_instance': {'default': 1, 'description': 'Number of instances to be deployed', 'minimum': 1, 'type': 'integer'}}, 'required': ['name'], 'type': 'object'}, 'minItems': 1, 'type': 'array'}}, 'required': ['address', 'username'], 'type': 'object'}, 'type': 'array'}, 'internal_callback_url': {'description': 'optionally set the host for the internal_callback_url, for example "http://172.17.0.1:8080"', 'type': 'string'}, 'process_queue': {'additionalProperties': False, 'description': 'Information about the Message Queue', 'properties': {'address': {'$ref': '#/$defs/address', 'description': 'The IP address or domain name of the machine where the Message Queue is deployed'}, 'credentials': {'$ref': '#/$defs/credentials', 'description': 'The credentials for the Message Queue'}, 'port': {'$ref': '#/$defs/port', 'description': 'The port number of the Message Queue'}, 'skip_deployment': {'description': 'set to true to deploy queue yourself', 'type': 'boolean'}, 'ssh': {'$ref': '#/$defs/ssh', 'description': 'Information required for an SSH connection'}}, 'required': ['address', 'port'], 'type': 'object'}, 'use_tcp_mets': {'description': 'optionally use tcp mets-server-instead of uds-mets-server', 'type': 'boolean'}}, 'required': ['process_queue'], 'type': 'object'})[source]

Validate against schema for Processing-Server

class ocrd_validators.OcrdNetworkMessageValidator(schema, validator_class=<class 'jsonschema.validators.Draft6Validator'>)[source]

Bases: JsonValidator

JsonValidator validating against the ocrd network message schemas

Construct a JsonValidator.

Parameters:
  • schema (dict)

  • validator_class (Draft6Validator|DefaultValidatingDraft6Validator)

static validate_message_processing(obj)[source]
static validate_message_result(obj)[source]

Submodules