ocrd_validators.page_validator module¶
API for validating OcrdPage.
- exception ocrd_validators.page_validator.ConsistencyError(tag, ID, file_id, actual, expected)[source]¶
Bases:
Exception
Exception representing a consistency error in textual transcription across levels of a PAGE-XML. (Element text strings must be the concatenation of their children’s text strings, joined by white space.)
Construct a new ConsistencyError.
- Parameters:
tag (string) – Level of the inconsistent element (parent)
ID (string) –
ID
of the inconsistent element (parent)file_id (string) –
mets:id
of the PAGE fileactual (string) – Value of parent’s TextEquiv[0]/Unicode
expected (string) – Concatenated values of children’s TextEquiv[0]/Unicode, joined by white-space
- exception ocrd_validators.page_validator.CoordinateConsistencyError(tag, ID, file_id, outer, inner)[source]¶
Bases:
Exception
Exception representing a consistency error in coordinate confinement across levels of a PAGE-XML. (Element coordinate polygons must be properly contained in their parents’ coordinate polygons.)
Construct a new CoordinateConsistencyError.
- Parameters:
tag (string) – Level of the offending element (child)
ID (string) –
ID
of the offending element (child)file_id (string) –
mets:id
of the PAGE fileouter (string) – Coordinate points of the parent
inner (string) – Coordinate points of the child
- exception ocrd_validators.page_validator.CoordinateValidityError(tag, ID, file_id, points, reason='unknown')[source]¶
Bases:
Exception
Exception representing a validity error of an element’s coordinates in PAGE-XML. (Element coordinate polygons must have at least 3 points, and must not
self-intersect or be non-contiguous or be negative.)
Construct a new CoordinateValidityError.
- Parameters:
tag (string) – Level of the offending element (child)
ID (string) –
ID
of the offending element (child)points (string) – Coordinate points
reason (string) – description of the problem
- ocrd_validators.page_validator.compare_without_whitespace(a, b)[source]¶
Compare two strings, ignoring all whitespace.
- ocrd_validators.page_validator.page_get_reading_order(ro, rogroup)[source]¶
Add all elements from the given reading order group to the given dictionary.
Given a dict
ro
from layout element IDs to ReadingOrder element objects, and an objectrogroup
with additional ReadingOrder element objects, add all references to the dict, traversing the group recursively.
- ocrd_validators.page_validator.make_poly(polygon_points)[source]¶
Instantiate a Polygon from a list of point pairs, or return an error string
- ocrd_validators.page_validator.make_line(line_points)[source]¶
Instantiate a LineString from a list of point pairs, or return an error string
- ocrd_validators.page_validator.validate_consistency(node, page_textequiv_consistency, page_textequiv_strategy, check_baseline, check_coords, report, file_id, joinRelations=None, readingOrder=None, textLineOrder=None, readingDirection=None)[source]¶
Check whether the text results on an element is consistent with its child element text results, and whether the coordinates of an element are fully within its parent element coordinates.
- ocrd_validators.page_validator.concatenate(nodes, concatenate_with, page_textequiv_strategy, joins=None)[source]¶
Concatenate nodes textually according to https://ocr-d.github.io/page#consistency-of-text-results-on-different-levels
- ocrd_validators.page_validator.get_text(node, page_textequiv_strategy='first')[source]¶
Get the first or most confident among text results (depending on
page_textequiv_strategy
). For the strategybest
, return the string of the highest scoring result. For the strategyfirst
, return the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, return the empty string.
- ocrd_validators.page_validator.set_text(node, text, page_textequiv_strategy)[source]¶
Set the first or most confident among text results (depending on
page_textequiv_strategy
). For the strategybest
, set the string of the highest scoring result. For the strategyfirst
, set the string of the lowest indexed result. If there are no scores/indexes, use the first result. If there are no results, add a new one.
- class ocrd_validators.page_validator.PageValidator[source]¶
Bases:
object
Validator for OcrdPage <../ocrd_models/ocrd_models.ocrd_page.html>.
- static validate(filename=None, ocrd_page=None, ocrd_file=None, page_textequiv_consistency='strict', page_textequiv_strategy='first', check_baseline=True, check_coords=True)[source]¶
Validates a PAGE file for consistency by filename, OcrdFile or passing OcrdPage directly.
- Parameters:
filename (string) – Path to PAGE
ocrd_page (OcrdPage) – OcrdPage instance
ocrd_file (OcrdFile) – OcrdFile instance wrapping OcrdPage
page_textequiv_consistency (string) – ‘strict’, ‘lax’, ‘fix’ or ‘off’
page_textequiv_strategy (string) – Currently only ‘first’
check_baseline (bool) – whether Baseline must be fully within TextLine/Coords
check_coords (bool) – whether *Region/TextLine/Word/Glyph must each be fully contained within Border/*Region/TextLine/Word, resp.
- Returns:
report (
ValidationReport
) Report on the validity