ocrd_validators.page_validator module

API for validating OcrdPage.

exception ocrd_validators.page_validator.ConsistencyError(tag, ID, file_id, actual, expected)[source]

Bases: Exception

Exception representing a consistency error in textual transcription across levels of a PAGE-XML. (Element text strings must be the concatenation of their children’s text strings, joined by white space.)

exception ocrd_validators.page_validator.CoordinateConsistencyError(tag, ID, file_id, outer, inner)[source]

Bases: Exception

Exception representing a consistency error in coordinate confinement across levels of a PAGE-XML. (Element coordinate polygons must be properly contained in their parents’ coordinate polygons.)

exception ocrd_validators.page_validator.CoordinateValidityError(tag, ID, file_id, points, reason='unknown')[source]

Bases: Exception

Exception representing a validity error of an element’s coordinates in PAGE-XML. (Element coordinate polygons must have at least 3 points, and must not

self-intersect or be non-contiguous or be negative.)

class ocrd_validators.page_validator.PageValidator[source]

Bases: object

Validator for OcrdPage <../ocrd_models/ocrd_models.ocrd_page.html>.

static validate(*args, **kwargs)[source]

Validates a PAGE file for consistency by filename, OcrdFile or passing OcrdPage directly.

Parameters
  • filename (string) – Path to PAGE

  • ocrd_page (OcrdPage) – OcrdPage instance

  • ocrd_file (OcrdFile) – OcrdFile instance wrapping OcrdPage

  • page_textequiv_consistency (string) – ‘strict’, ‘lax’, ‘fix’ or ‘off’

  • page_textequiv_strategy (string) – Currently only ‘first’

  • check_baseline (bool) – whether Baseline must be fully within TextLine/Coords

  • check_coords (bool) – whether *Region/TextLine/Word/Glyph must each be fully contained within Border/*Region/TextLine/Word, resp.

Returns

report (ValidationReport) Report on the validity

ocrd_validators.page_validator.compare_without_whitespace(a, b)[source]

Compare two strings, ignoring all whitespace.

ocrd_validators.page_validator.concatenate(nodes, concatenate_with, page_textequiv_strategy, joins=None)[source]

Concatenate nodes textually according to https://ocr-d.github.io/page#consistency-of-text-results-on-different-levels

ocrd_validators.page_validator.get_text(node, page_textequiv_strategy='first')[source]

Get the first or most confident among text results (depending on page_textequiv_strategy). For the strategy best, return the string of the highest scoring result. For the strategy first, return the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, return the empty string.

ocrd_validators.page_validator.make_line(line_points)[source]

Instantiate a LineString from a list of point pairs, or return an error string

ocrd_validators.page_validator.make_poly(polygon_points)[source]

Instantiate a Polygon from a list of point pairs, or return an error string

ocrd_validators.page_validator.page_get_reading_order(ro, rogroup)[source]

Add all elements from the given reading order group to the given dictionary.

Given a dict ro from layout element IDs to ReadingOrder element objects, and an object rogroup with additional ReadingOrder element objects, add all references to the dict, traversing the group recursively.

ocrd_validators.page_validator.set_text(node, text, page_textequiv_strategy)[source]

Set the first or most confident among text results (depending on page_textequiv_strategy). For the strategy best, set the string of the highest scoring result. For the strategy first, set the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, add a new one.

ocrd_validators.page_validator.validate_consistency(node, page_textequiv_consistency, page_textequiv_strategy, check_baseline, check_coords, report, file_id, joinRelations=None, readingOrder=None, textLineOrder=None, readingDirection=None)[source]

Check whether the text results on an element is consistent with its child element text results, and whether the coordinates of an element are fully within its parent element coordinates.