Kwalitee - OCR-D

GitHub Last update Number of contributors

cor-asv-ann

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>nil, "README.md"=>"# cor-asv-ann\n    OCR post-correction with encoder-attention-decoder LSTMs\n\n[![CircleCI](https://circleci.com/gh/ASVLeipzig/cor-asv-ann.svg?style=svg)](https://circleci.com/gh/ASVLeipzig/cor-asv-ann)\n\n## Introduction\n\nThis is a tool for automatic OCR _post-correction_ (reducing optical character recognition errors) with recurrent neural networks. It uses sequence-to-sequence transduction on the _character level_ with a model architecture akin to neural machine translation, i.e. a stacked **encoder-decoder** network with attention mechanism. \n\nThe **attention model** always applies to full lines (in a _global_ configuration), and uses a linear _additive_ alignment model. (This transfers information between the encoder and decoder hidden layer states, and calculates a _soft alignment_ between input and output characters. It is imperative for character-level processing, because with a simple final-initial transfer, models tend to start \"forgetting\" the input altogether at some point in the line and behave like unconditional LM generators.)\n\n...FIXME: mention: \n- stacked architecture (with bidirectional bottom and attentional top), configurable depth/width\n- weight tying\n- underspecification and gap\n- confidence input and alternative input\n- CPU/GPU option\n- incremental training, LM transfer, shallow transfer\n- evaluation (CER, PPL)\n\n### Processing PAGE annotations\n\nWhen applied on PAGE-XML (as OCR-D workspace processor), this component also allows processing below the `TextLine` hierarchy level, i.e. on `Word` or `Glyph` level. For that it uses the soft alignment scores to calculate an optimal hard alignment path for characters, and thereby distributes the transduction onto the lower level elements (keeping their coordinates and other meta-data), while changing Word segmentation if necessary.\n\n...\n\n### Architecture\n\n...FIXME: show!\n\n### Input with confidence and/or alternatives\n\n...FIXME: explain!\n\n### Multi-OCR input\n\nnot yet!\n\n### Modes\n\nWhile the _encoder_ can always be run in parallel over a batch of lines and by passing the full sequence of characters in one tensor (padded to the longest line in the batch), which is very efficient with Keras backends like Tensorflow, a **beam-search** _decoder_ requires passing initial/final states character-by-character, with parallelism employed to capture multiple history hypotheses of a single line. However, one can also **greedily** use the best output only for each position (without beam search). And in doing so, another option is to feed back the softmax output directly into the decoder input instead of its argmax unit vector. This effectively passes the full probability distribution from state to state, which (not very surprisingly) can increase correction accuracy quite a lot – it can get as good as a medium-sized beam search results. This latter option also allows to run in parallel again, which is also much faster – consuming up to ten times less CPU time.\n\nThererfore, the backend function `lib.Sequence2Sequence.correct_lines` can operate the encoder-decoder network in either of the following modes:\n\n#### _fast_\n\nDecode greedily, but feeding back the full softmax distribution in batch mode.\n\n#### _greedy_\n\nDecode greedily, but feeding back the argmax unit vectors for each line separately.\n\n#### _default_\n\nDecode beamed, feeding back the argmax unit vectors for the best history/output hypotheses of each line. More specifically:\n\n> Start decoder with start-of-sequence, then keep decoding until\n> end-of-sequence is found or output length is way off, repeatedly.\n> Decode by using the best predicted output characters and several next-best\n> alternatives (up to some degradation threshold) as next input.\n> Follow-up on the N best overall candidates (estimated by accumulated\n> score, normalized by length and prospective cost), i.e. do A*-like\n> breadth-first search, with N equal `batch_size`.\n> Pass decoder initial/final states from character to character,\n> for each candidate respectively.\n> Reserve 1 candidate per iteration for running through `source_seq`\n> (as a rejection fallback) to ensure that path does not fall off the\n> beam and at least one solution can be found within the search limits.\n\n### Evaluation\n\nText lines can be compared (by aligning and computing a distance under some metric) across multiple inputs. (This would typically be GT and OCR vs post-correction.) This can be done both on plain text files (`cor-asv-ann-eval`) and PAGE-XML annotations (`ocrd-cor-asv-ann-evaluate`). \n\nDistances are accumulated (as micro-averages) as character error rate (CER) mean and stddev, but only on the character level.\n\nThere are a number of distance metrics available (all operating on grapheme clusters, not mere codepoints):\n- `Levenshtein`:  \n  simple unweighted edit distance (fastest, standard; GT level 3)\n- `NFC`:  \n  like `Levenshtein`, but apply Unicode normal form with canonical composition before (i.e. less than GT level 2)\n- `NFKC`:  \n  like `Levenshtein`, but apply Unicode normal form with compatibility composition before (i.e. less than GT level 2, except for `ſ`, which is already normalized to `s`)\n- `historic_latin`:  \n  like `Levenshtein`, but decomposing non-vocalic ligatures before and treating as equivalent (i.e. zero distances) confusions of certain semantically close characters often found in historic texts (e.g. umlauts with combining letter `e` as in `Wuͤſte` instead of  to `Wüſte`, `ſ` vs `s`, or quotation/citation marks; GT level 1)\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-venv`` or ``python3-venv``)\n\nCreate and activate a virtualenv as usual.\n\nTo install Python dependencies:\n```shell\nmake deps\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\n```\n\nTo install this module, then do:\n```shell\nmake install\n```\nWhich is the equivalent of:\n```shell\npip install .\n```\n\n## Usage\n\nThis packages has the following user interfaces:\n\n### command line interface `cor-asv-ann-train`\n\nTo be used with string arguments and plain-text files.\n\n...\n\n### command line interface `cor-asv-ann-eval`\n\nTo be used with string arguments and plain-text files.\n\n...\n\n### command line interface `cor-asv-ann-repl`\n\ninteractive\n\n...\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-cor-asv-ann-process`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Input could be anything with a textual annotation (`TextEquiv` on the given `textequiv_level`). \n\n...\n\n```json\n    \"ocrd-cor-asv-ann-process\": {\n      \"executable\": \"ocrd-cor-asv-ann-process\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/post-correction\"\n      ],\n      \"description\": \"Improve text annotation by character-level encoder-attention-decoder ANN model\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-ASV\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with cor-asv-ann-train\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to read/write TextEquiv input/output on\"\n        }\n      }\n    }\n```\n\n...\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-cor-asv-ann-evaluate`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Inputs could be anything with a textual annotation (`TextEquiv` on the line level), but at least 2. The first in the list of input file groups will be regarded as reference/GT.\n\n...\n\n```json\n    \"ocrd-cor-asv-ann-evaluate\": {\n      \"executable\": \"ocrd-cor-asv-ann-evaluate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/evaluation\"\n      ],\n      \"description\": \"Align different textline annotations and compute distance\",\n      \"parameters\": {\n        \"metric\": {\n          \"type\": \"string\",\n          \"enum\": [\"Levenshtein\", \"NFC\", \"NFKC\", \"historic_latin\"],\n          \"default\": \"Levenshtein\",\n          \"description\": \"Distance metric to calculate and aggregate: historic_latin for GT level 1, NFKC for GT level 2 (except ſ-s), Levenshtein for GT level 3\"\n        }\n      }\n    }\n```\n\n...\n\n## Testing\n\nnot yet!\n...\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/ASVLeipzig/cor-asv-ann\",\n  \"version\": \"0.1.2\",\n  \"tools\": {\n    \"ocrd-cor-asv-ann-process\": {\n      \"executable\": \"ocrd-cor-asv-ann-process\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/post-correction\"\n      ],\n      \"description\": \"Improve text annotation by character-level encoder-attention-decoder ANN model\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-ASV\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with cor-asv-ann-train\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to read/write TextEquiv input/output on\"\n        }\n      }\n    },\n    \"ocrd-cor-asv-ann-evaluate\": {\n      \"executable\": \"ocrd-cor-asv-ann-evaluate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/evaluation\"\n      ],\n      \"description\": \"Align different textline annotations and compute distance\",\n      \"parameters\": {\n        \"metric\": {\n          \"type\": \"string\",\n          \"enum\": [\"Levenshtein\", \"NFC\", \"NFKC\", \"historic_latin\"],\n          \"default\": \"Levenshtein\",\n          \"description\": \"Distance metric to calculate and aggregate: historic_latin for GT level 1, NFKC for GT level 2 (except ſ-s), Levenshtein for GT level 3\"\n        },\n        \"confusion\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"minimum\": 0,\n          \"default\": 0,\n          \"description\": \"Count edits and show that number of most frequent confusions (non-identity) in the end.\"\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n    - cor-asv-ann-train\n    - cor-asv-ann-eval\n    - cor-asv-ann-repl\n    - ocrd-cor-asv-ann-process\n    - ocrd-cor-asv-ann-evaluate\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\ninstall_requires = open('requirements.txt').read().split('\\n')\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_cor_asv_ann',\n    version='0.1.2',\n    description='sequence-to-sequence translator for noisy channel error correction',\n    long_description=README,\n    author='Robert Sachunsky',\n    author_email='sachunsky@informatik.uni-leipzig.de',\n    url='https://github.com/ASVLeipzig/cor-asv-ann',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=install_requires,\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'cor-asv-ann-train=ocrd_cor_asv_ann.scripts.train:cli',\n            'cor-asv-ann-eval=ocrd_cor_asv_ann.scripts.eval:cli',\n            'cor-asv-ann-repl=ocrd_cor_asv_ann.scripts.repl:cli',\n            'ocrd-cor-asv-ann-process=ocrd_cor_asv_ann.wrapper.cli:ocrd_cor_asv_ann_process',\n            'ocrd-cor-asv-ann-evaluate=ocrd_cor_asv_ann.wrapper.cli:ocrd_cor_asv_ann_evaluate',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Fri Jan 24 00:58:56 2020 +0100", "latest_tag"=>"", "number_of_commits"=>"49", "url"=>"https://github.com/ASVLeipzig/cor-asv-ann.git"}, "name"=>"cor-asv-ann", "ocrd_tool"=>{"git_url"=>"https://github.com/ASVLeipzig/cor-asv-ann", "tools"=>{"ocrd-cor-asv-ann-evaluate"=>{"categories"=>["Text recognition and optimization"], "description"=>"Align different textline annotations and compute distance", "executable"=>"ocrd-cor-asv-ann-evaluate", "parameters"=>{"confusion"=>{"default"=>0, "description"=>"Count edits and show that number of most frequent confusions (non-identity) in the end.", "format"=>"integer", "minimum"=>0, "type"=>"number"}, "metric"=>{"default"=>"Levenshtein", "description"=>"Distance metric to calculate and aggregate: historic_latin for GT level 1, NFKC for GT level 2 (except ſ-s), Levenshtein for GT level 3", "enum"=>["Levenshtein", "NFC", "NFKC", "historic_latin"], "type"=>"string"}}, "steps"=>["recognition/evaluation"]}, "ocrd-cor-asv-ann-process"=>{"categories"=>["Text recognition and optimization"], "description"=>"Improve text annotation by character-level encoder-attention-decoder ANN model", "executable"=>"ocrd-cor-asv-ann-process", "input_file_grp"=>["OCR-D-OCR-TESS", "OCR-D-OCR-KRAK", "OCR-D-OCR-OCRO", "OCR-D-OCR-CALA", "OCR-D-OCR-ANY"], "output_file_grp"=>["OCR-D-COR-ASV"], "parameters"=>{"model_file"=>{"cacheable"=>true, "content-type"=>"application/x-hdf;subtype=bag", "description"=>"path of h5py weight/config file for model trained with cor-asv-ann-train", "format"=>"uri", "required"=>true, "type"=>"string"}, "textequiv_level"=>{"default"=>"glyph", "description"=>"PAGE XML hierarchy level to read/write TextEquiv input/output on", "enum"=>["line", "word", "glyph"], "type"=>"string"}}, "steps"=>["recognition/post-correction"]}}, "version"=>"0.1.2"}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [tools.ocrd-cor-asv-ann-evaluate] 'input_file_grp' is a required property\n  [tools.ocrd-cor-asv-ann-evaluate.parameters.confusion] Additional properties are not allowed ('minimum' was unexpected)\n  [tools.ocrd-cor-asv-ann-evaluate.steps.0] 'recognition/evaluation' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n</report>", "official"=>true, "org_plus_name"=>"ASVLeipzig/cor-asv-ann", "python"=>{"author"=>"Robert Sachunsky", "author-email"=>"sachunsky@informatik.uni-leipzig.de", "name"=>"ocrd_cor_asv_ann", "pypi"=>nil, "url"=>"https://github.com/ASVLeipzig/cor-asv-ann"}, "url"=>"https://github.com/ASVLeipzig/cor-asv-ann"}

cor-asv-fst

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>nil, "README.md"=>"# cor-asv-fst\n    OCR post-correction with error/lexicon Finite State Transducers and\n    chararacter-level LSTM language models\n\n## Introduction\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-virtualenv`` or ``python3-virtualenv``)\n\nCreate and activate a virtualenv as usual.\n\nTo install Python dependencies and this module, then do:\n```shell\nmake deps install\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\npip install -e .\n```\n\nIn addition to the requirements listed in `requirements.txt`, the tool\nrequires the\n[pynini](http://www.opengrm.org/twiki/bin/view/GRM/Pynini)\nlibrary, which has to be installed from source.\n\n## Usage\n\nThe package has two user interfaces:\n\n### Command Line Interface\n\nThe package contains a suite of CLI tools to work with plaintext data (prefix:\n`cor-asv-fst-*`). The minimal working examples and data formats are described\nbelow. Additionally, each tool has further optional parameters - for a detailed\ndescription, call the tool with the `--help` option.\n\n#### `cor-asv-fst-train`\n\nTrain FST models. The basic invocation is as follows:\n\n```shell\ncor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -t TRAINING_FILE\n```\n\nThis will create two transducers, which will be stored in `LEXICON_FILE` and\n`ERROR_MODEL_FILE`, respectively. As the training of the lexicon and the error\nmodel is done independently, any of them can be skipped by omitting the\nrespective parameter.\n\n`TRAINING_FILE` is a plain text file in tab-separated, two-column format\ncontaining a line of OCR-output and the corresponding ground truth line:\n\n```\n» Bergebt mir, daß ih niht weiß, wie\t»Vergebt mir, daß ich nicht weiß, wie\naus dem (Geiſte aller Nationen Mahrunq\taus dem Geiſte aller Nationen Nahrung\nKannſt Du mir die re<hée Bahn niché zeigen ?\tKannſt Du mir die rechte Bahn nicht zeigen?\nfrag zu bringen. —\ttrag zu bringen. —\nſie ins irdij<he Leben hinein, Mit leichtem,\tſie ins irdiſche Leben hinein. Mit leichtem,\n```\n\nEach line is treated independently. Alternatively to the above, the training\ndata may also be supplied as two files:\n\n```shell\ncor-asv-fst-train -l LEXICON_FILE -e ERROR_MODEL_FILE -i INPUT_FILE -g GT_FILE\n```\n\nIn this variant, `INPUT_FILE` and `GT_FILE` are both in tab-separated,\ntwo-column format, in which the first column is the line ID and the second the\nline:\n\n```\n>=== INPUT_FILE ===<\nalexis_ruhe01_1852_0018_022     ih denke. Aber was die ſelige Frau Geheimräth1n\nalexis_ruhe01_1852_0035_019     „Das fann ich niht, c’esl absolument impos-\nalexis_ruhe01_1852_0087_027     rend. In dem Augenbli> war 1hr niht wohl zu\nalexis_ruhe01_1852_0099_012     ür die fle ſich ſchlugen.“\nalexis_ruhe01_1852_0147_009     ſollte. Nur Über die Familien, wo man ſie einführen\n\n>=== GT_FILE ===<\nalexis_ruhe01_1852_0018_022     ich denke. Aber was die ſelige Frau Geheimräthin\nalexis_ruhe01_1852_0035_019     „Das kann ich nicht, c'est absolument impos—\nalexis_ruhe01_1852_0087_027     rend. Jn dem Augenblick war ihr nicht wohl zu\nalexis_ruhe01_1852_0099_012     für die ſie ſich ſchlugen.“\nalexis_ruhe01_1852_0147_009     ſollte. Nur über die Familien, wo man ſie einführen\n```\n\n#### `cor-asv-fst-process`\n\nThis tool applies a trained model to correct plaintext data on a line basis.\nThe basic invocation is:\n\n```shell\ncor-asv-fst-process -i INPUT_FILE -o OUTPUT_FILE -l LEXICON_FILE -e ERROR_MODEL_FILE (-m LM_FILE)\n```\n\n`INPUT_FILE` is in the same format as for the training procedure. `OUTPUT_FILE`\ncontains the post-correction results in the same format.\n\n`LM_FILE` is a `ocrd_keraslm` language model - if supplied, it is used for\nrescoring.\n\n#### `cor-asv-fst-evaluate`\n\nThis tool can be used to evaluate the post-correction results. The minimal\nworking invocation is:\n\n```shell\ncor-asv-fst-evaluate -i INPUT_FILE -o OUTPUT_FILE -g GT_FILE\n```\n\nAdditionally, the parameter `-M` can be used to select the evaluation measure\n(`Levenshtein` by default). The files should be in the same two-column format\nas described above.\n\n### [OCR-D processor](https://ocr-d.github.io/cli) interface `ocrd-cor-asv-fst-process`\n\nTo be used with [PageXML](https://github.com/PRImA-Research-Lab/PAGE-XML)\ndocuments in an [OCR-D](https://ocr-d.github.io) annotation workflow.\nInput files need a textual annotation (`TextEquiv`) on the given\n`textequiv_level` (currently _only_ `word`!).\n\n...\n\n```json\n  \"tools\": {\n    \"cor-asv-fst-process\": {\n      \"executable\": \"cor-asv-fst-process\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/post-correction\"\n      ],\n      \"description\": \"Improve text annotation by FST error and lexicon model with character-level LSTM language model\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-ASV\"\n      ],\n      \"parameters\": {\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"word\"],\n          \"default\": \"word\",\n          \"description\": \"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)\"\n        },\n        \"errorfst_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/vnd.openfst\",\n          \"description\": \"path of FST file for error model\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"lexiconfst_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/vnd.openfst\",\n          \"description\": \"path of FST file for lexicon model\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"pruning_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"transition weight for pruning the hypotheses in each word window FST\",\n          \"default\": 5.0\n        },\n        \"rejection_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"transition weight (per character) for unchanged input in each word window FST\",\n          \"default\": 1.5\n        },\n        \"keraslm_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for language model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during beam search in language modelling\",\n          \"default\": 100\n        },\n        \"lm_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"share of the LM scores over the FST output confidences\",\n          \"default\": 0.5\n        }\n      }\n    }\n  }\n```\n\n...\n\n## Testing\n\n...\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/ASVLeipzig/cor-asv-fst\",\n  \"version\": \"0.1.1\",\n  \"tools\": {\n    \"ocrd-cor-asv-fst-process\": {\n      \"executable\": \"ocrd-cor-asv-fst-process\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/post-correction\"\n      ],\n      \"description\": \"Improve text annotation by FST error and lexicon model with character-level LSTM language model\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-ASV\"\n      ],\n      \"parameters\": {\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"word\"],\n          \"default\": \"word\",\n          \"description\": \"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)\"\n        },\n        \"errorfst_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/vnd.openfst\",\n          \"description\": \"path of FST file for error model\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"lexiconfst_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/vnd.openfst\",\n          \"description\": \"path of FST file for lexicon model\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"pruning_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"transition weight for pruning the hypotheses in each word window FST\",\n          \"default\": 5.0\n        },\n        \"rejection_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"transition weight (per character) for unchanged input in each word window FST\",\n          \"default\": 1.5\n        },\n        \"keraslm_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for language model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during beam search in language modelling\",\n          \"default\": 100\n        },\n        \"lm_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"share of the LM scores over the FST output confidences\",\n          \"default\": 0.5\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n    - cor-asv-fst-train\n    - cor-asv-fst-process\n    - cor-asv-fst-evaluate\n    - ocrd-cor-asv-fst-process\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\ninstall_requires = open('requirements.txt').read().split('\\n')\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_cor_asv_fst',\n    version='0.2.0',\n    description='OCR post-correction with error/lexicon Finite State '\n                'Transducers and character-level LSTMs',\n    long_description=README,\n    long_description_content_type='text/markdown',\n    author='Maciej Sumalvico, Robert Sachunsky',\n    author_email='sumalvico@informatik.uni-leipzig.de, '\n                 'sachunsky@informatik.uni-leipzig.de',\n    url='https://github.com/ASVLeipzig/cor-asv-fst',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=install_requires,\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    test_suite='tests',\n    entry_points={\n        'console_scripts': [\n            'cor-asv-fst-train=ocrd_cor_asv_fst.scripts.train:main',\n            'cor-asv-fst-process=ocrd_cor_asv_fst.scripts.process:main',\n            'cor-asv-fst-evaluate=ocrd_cor_asv_fst.scripts.evaluate:main',\n            'ocrd-cor-asv-fst-process=ocrd_cor_asv_fst.wrapper.cli:ocrd_cor_asv_fst',\n        ]\n    }\n)\n"}, "git"=>{"last_commit"=>"Wed Jan 8 17:54:58 2020 +0100", "latest_tag"=>"", "number_of_commits"=>"178", "url"=>"https://github.com/ASVLeipzig/cor-asv-fst.git"}, "name"=>"cor-asv-fst", "ocrd_tool"=>{"git_url"=>"https://github.com/ASVLeipzig/cor-asv-fst", "tools"=>{"ocrd-cor-asv-fst-process"=>{"categories"=>["Text recognition and optimization"], "description"=>"Improve text annotation by FST error and lexicon model with character-level LSTM language model", "executable"=>"ocrd-cor-asv-fst-process", "input_file_grp"=>["OCR-D-OCR-TESS", "OCR-D-OCR-KRAK", "OCR-D-OCR-OCRO", "OCR-D-OCR-CALA", "OCR-D-OCR-ANY"], "output_file_grp"=>["OCR-D-COR-ASV"], "parameters"=>{"beam_width"=>{"default"=>100, "description"=>"maximum number of best partial paths to consider during beam search in language modelling", "format"=>"integer", "type"=>"number"}, "errorfst_file"=>{"cacheable"=>true, "content-type"=>"application/vnd.openfst", "description"=>"path of FST file for error model", "format"=>"uri", "required"=>true, "type"=>"string"}, "keraslm_file"=>{"cacheable"=>true, "content-type"=>"application/x-hdf;subtype=bag", "description"=>"path of h5py weight/config file for language model trained with keraslm", "format"=>"uri", "required"=>true, "type"=>"string"}, "lexiconfst_file"=>{"cacheable"=>true, "content-type"=>"application/vnd.openfst", "description"=>"path of FST file for lexicon model", "format"=>"uri", "required"=>true, "type"=>"string"}, "lm_weight"=>{"default"=>0.5, "description"=>"share of the LM scores over the FST output confidences", "format"=>"float", "type"=>"number"}, "pruning_weight"=>{"default"=>5.0, "description"=>"transition weight for pruning the hypotheses in each word window FST", "format"=>"float", "type"=>"number"}, "rejection_weight"=>{"default"=>1.5, "description"=>"transition weight (per character) for unchanged input in each word window FST", "format"=>"float", "type"=>"number"}, "textequiv_level"=>{"default"=>"word", "description"=>"PAGE XML hierarchy level to read TextEquiv input on (output will always be word level)", "enum"=>["word"], "type"=>"string"}}, "steps"=>["recognition/post-correction"]}}, "version"=>"0.1.1"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>true, "org_plus_name"=>"ASVLeipzig/cor-asv-fst", "python"=>{"author"=>"Maciej Sumalvico, Robert Sachunsky", "author-email"=>"sumalvico@informatik.uni-leipzig.de, sachunsky@informatik.uni-leipzig.de", "name"=>"ocrd_cor_asv_fst", "pypi"=>nil, "url"=>"https://github.com/ASVLeipzig/cor-asv-fst"}, "url"=>"https://github.com/ASVLeipzig/cor-asv-fst"}

ocrd_calamari

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"FROM ocrd/core:edge\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\nENV LC_ALL C.UTF-8\nENV LANG C.UTF-8\n\nWORKDIR /build\nCOPY Makefile .\nCOPY setup.py .\nCOPY ocrd-tool.json .\nCOPY requirements.txt .\nCOPY ocrd_calamari ocrd_calamari\n\nRUN make calamari/build\nRUN pip3 install .\n\nENTRYPOINT [\"/usr/local/bin/ocrd-calamari-recognize\"]\n\n", "README.md"=>"# ocrd_calamari\n\n> Recognize text using [Calamari OCR](https://github.com/Calamari-OCR/calamari).\n\n[![image](https://circleci.com/gh/OCR-D/ocrd_calamari.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_calamari)\n[![image](https://img.shields.io/pypi/v/ocrd_calamari.svg)](https://pypi.org/project/ocrd_calamari/)\n[![image](https://codecov.io/gh/OCR-D/ocrd_calamari/branch/master/graph/badge.svg)](https://codecov.io/gh/OCR-D/ocrd_calamari)\n\n## Introduction\n\nThis offers a OCR-D compliant workspace processor for some of the functionality of Calamari OCR.\n\nThis processor only operates on the text line level and so needs a line segmentation (and by extension a binarized \nimage) as its input.\n\n## Installation\n\n### From PyPI\n\n```\npip install ocrd_calamari\n```\n\n### From Repo\n\n```sh\npip install .\n```\n\n## Install models\n\nDownload models trained on GT4HistOCR data:\n\n```\nmake gt4histocr-calamari\nls gt4histocr-calamari\n```\n\n## Example Usage\n\n~~~\nocrd-calamari-recognize -p test-parameters.json -m mets.xml -I OCR-D-SEG-LINE -O OCR-D-OCR-CALAMARI\n~~~\n\nWith `test-parameters.json`:\n~~~\n{\n    \"checkpoint\": \"/path/to/some/trained/models/*.ckpt.json\"\n}\n~~~\n\n## Development & Testing\nFor information regarding development and testing, please see\n[README-DEV.md](README-DEV.md).\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/kba/ocrd_calamari\",\n  \"version\": \"0.0.3\",\n  \"tools\": {\n    \"ocrd-calamari-recognize\": {\n      \"executable\": \"ocrd-calamari-recognize\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Recognize lines with Calamari\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-LINE\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-OCR-CALAMARI\"\n      ],\n      \"parameters\": {\n        \"checkpoint\": {\n          \"description\": \"The calamari model files (*.ckpt.json)\",\n          \"type\": \"string\", \"format\": \"file\", \"cacheable\": true\n        },\n        \"voter\": {\n          \"description\": \"The voting algorithm to use\",\n          \"type\": \"string\", \"default\": \"confidence_voter_default_ctc\"\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\nfrom pathlib import Path\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_calamari',\n    version='0.0.3',\n    description='Calamari bindings',\n    long_description=Path('README.md').read_text(),\n    long_description_content_type='text/markdown',\n    author='Konstantin Baierer, Mike Gerber',\n    author_email='unixprog@gmail.com, mike.gerber@sbb.spk-berlin.de',\n    url='https://github.com/kba/ocrd_calamari',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=Path('requirements.txt').read_text().split('\\n'),\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-calamari-recognize=ocrd_calamari.cli:ocrd_calamari_recognize',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Thu Jan 16 16:14:13 2020 +0100", "latest_tag"=>"v0.0.3", "number_of_commits"=>"84", "url"=>"https://github.com/OCR-D/ocrd_calamari.git"}, "name"=>"ocrd_calamari", "ocrd_tool"=>{"git_url"=>"https://github.com/kba/ocrd_calamari", "tools"=>{"ocrd-calamari-recognize"=>{"categories"=>["Text recognition and optimization"], "description"=>"Recognize lines with Calamari", "executable"=>"ocrd-calamari-recognize", "input_file_grp"=>["OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-OCR-CALAMARI"], "parameters"=>{"checkpoint"=>{"cacheable"=>true, "description"=>"The calamari model files (*.ckpt.json)", "format"=>"file", "type"=>"string"}, "voter"=>{"default"=>"confidence_voter_default_ctc", "description"=>"The voting algorithm to use", "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}}, "version"=>"0.0.3"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_calamari", "python"=>{"author"=>"Konstantin Baierer, Mike Gerber", "author-email"=>"unixprog@gmail.com, mike.gerber@sbb.spk-berlin.de", "name"=>"ocrd_calamari", "pypi"=>{"info"=>{"author"=>"Konstantin Baierer, Mike Gerber", "author_email"=>"unixprog@gmail.com, mike.gerber@sbb.spk-berlin.de", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_calamari\n\nRecognize text using [Calamari OCR](https://github.com/Calamari-OCR/calamari).\n\n## Introduction\n\nThis offers a OCR-D compliant workspace processor for some of the functionality of Calamari OCR.\n\nThis processor only operates on the text line level and so needs a line segmentation (and by extension a binarized \nimage) as its input.\n\n## Example Usage\n\n```sh\nocrd-calamari-recognize -p test-parameters.json -m mets.xml -I OCR-D-SEG-LINE -O OCR-D-OCR-CALAMARI\n```\n\nWith `test-parameters.json`:\n\n```json\n{\n    \"checkpoint\": \"/path/to/some/trained/models/*.ckpt.json\"\n}\n```\n\nTODO\n----\n\n* Support Calamari's \"extended prediction data\" output\n* Currently, the processor only supports a prediction using confidence voting of multiple models. While this is\n  superior, it makes sense to support single model prediction, too.\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/kba/ocrd_calamari", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-calamari", "package_url"=>"https://pypi.org/project/ocrd-calamari/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-calamari/", "project_urls"=>{"Homepage"=>"https://github.com/kba/ocrd_calamari"}, "release_url"=>"https://pypi.org/project/ocrd-calamari/0.0.3/", "requires_dist"=>["numpy", "tensorflow-gpu (==1.14.0)", "calamari-ocr (==0.3.5)", "setuptools (>=41.0.0)", "click", "ocrd (>=1.0.0b11)"], "requires_python"=>"", "summary"=>"Calamari bindings", "version"=>"0.0.3"}, "last_serial"=>6229919, "releases"=>{"0.0.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"a247c6638d77f7590453855f8414a97b", "sha256"=>"cf08ec027390519d465f6be861e5672b48e7b39b3d1f8e13e54cb401034355b6"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"a247c6638d77f7590453855f8414a97b", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>9320, "upload_time"=>"2019-10-26T20:18:11", "upload_time_iso_8601"=>"2019-10-26T20:18:11.044376Z", "url"=>"https://files.pythonhosted.org/packages/30/62/d8efee35233443d444fc49f7f89792979234c1d735285d599f989e63cee1/ocrd_calamari-0.0.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"1daa1956ba64485b65d9d69a149dcb6a", "sha256"=>"51a09088d677799258d8c796dbaba8a1b44a318d06c060314499f708fa37bdd4"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.1.tar.gz", "has_sig"=>false, "md5_digest"=>"1daa1956ba64485b65d9d69a149dcb6a", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>3884, "upload_time"=>"2019-10-26T20:18:13", "upload_time_iso_8601"=>"2019-10-26T20:18:13.643406Z", "url"=>"https://files.pythonhosted.org/packages/46/1a/b5f02d113aa7810cb773f0b586d1202c254d22e4bf3c6b829d937da2c1b0/ocrd_calamari-0.0.1.tar.gz"}], "0.0.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"24e8cab9e429576704a02890f6ebffb2", "sha256"=>"454164c6b1c063b76c5189ae596115499bffd6e944c896dee3b03f08852f5680"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"24e8cab9e429576704a02890f6ebffb2", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>5247, "upload_time"=>"2019-12-02T12:22:56", "upload_time_iso_8601"=>"2019-12-02T12:22:56.460224Z", "url"=>"https://files.pythonhosted.org/packages/39/53/c05186a309284a22d4f1f0399a5fb241d7b11fb0e5b94c33fa8ae229a6fc/ocrd_calamari-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"7a101d8f9626784f9e54af6dad37179d", "sha256"=>"39e0f5b334a735fb8fa20e5490dcd07a96a620bc785c8e2b31f64a23fa13a6fe"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"7a101d8f9626784f9e54af6dad37179d", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>3952, "upload_time"=>"2019-12-02T12:22:57", "upload_time_iso_8601"=>"2019-12-02T12:22:57.972949Z", "url"=>"https://files.pythonhosted.org/packages/9d/cc/de53bfd3c2b666cab5ef199c93902c85bb83ee03d923e9ef7abe87377857/ocrd_calamari-0.0.2.tar.gz"}], "0.0.3"=>[{"comment_text"=>"", "digests"=>{"md5"=>"7bb2ae998a57e2301011073fd532445e", "sha256"=>"4b6e0be66b0fdd9f64f5f02e8aac952c1e77f78b39fc4ed9c90f8c9f9a117967"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"7bb2ae998a57e2301011073fd532445e", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>9384, "upload_time"=>"2019-12-02T17:28:38", "upload_time_iso_8601"=>"2019-12-02T17:28:38.092102Z", "url"=>"https://files.pythonhosted.org/packages/23/85/34b1b520bd8ad7688915d5844caf20e89435fd17a3489963ceec14c06f14/ocrd_calamari-0.0.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"8a435811e11f37b47eec5a5f8a433e99", "sha256"=>"e57cea7935340bcf090e62642a38aa41b0bf68d31afe95ba9e42a18be53ca80d"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.3.tar.gz", "has_sig"=>false, "md5_digest"=>"8a435811e11f37b47eec5a5f8a433e99", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>3909, "upload_time"=>"2019-12-02T17:28:39", "upload_time_iso_8601"=>"2019-12-02T17:28:39.643369Z", "url"=>"https://files.pythonhosted.org/packages/32/15/e01d70177d89e9d0c0ec07ea8a2a31194f46154758788af781724c5b3354/ocrd_calamari-0.0.3.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"7bb2ae998a57e2301011073fd532445e", "sha256"=>"4b6e0be66b0fdd9f64f5f02e8aac952c1e77f78b39fc4ed9c90f8c9f9a117967"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"7bb2ae998a57e2301011073fd532445e", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>9384, "upload_time"=>"2019-12-02T17:28:38", "upload_time_iso_8601"=>"2019-12-02T17:28:38.092102Z", "url"=>"https://files.pythonhosted.org/packages/23/85/34b1b520bd8ad7688915d5844caf20e89435fd17a3489963ceec14c06f14/ocrd_calamari-0.0.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"8a435811e11f37b47eec5a5f8a433e99", "sha256"=>"e57cea7935340bcf090e62642a38aa41b0bf68d31afe95ba9e42a18be53ca80d"}, "downloads"=>-1, "filename"=>"ocrd_calamari-0.0.3.tar.gz", "has_sig"=>false, "md5_digest"=>"8a435811e11f37b47eec5a5f8a433e99", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>3909, "upload_time"=>"2019-12-02T17:28:39", "upload_time_iso_8601"=>"2019-12-02T17:28:39.643369Z", "url"=>"https://files.pythonhosted.org/packages/32/15/e01d70177d89e9d0c0ec07ea8a2a31194f46154758788af781724c5b3354/ocrd_calamari-0.0.3.tar.gz"}]}, "url"=>"https://github.com/kba/ocrd_calamari"}, "url"=>"https://github.com/OCR-D/ocrd_calamari"}

ocrd_im6convert

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\n\nENV PREFIX=/usr/local\n\nWORKDIR /build\nCOPY ocrd-im6convert .\nCOPY ocrd-tool.json .\nCOPY Makefile .\n\nRUN apt-get update && \\\n    apt-get -y install apt-utils && \\\n    apt-get -y install --no-install-recommends \\\n    ca-certificates \\\n    make\n\nRUN make deps-ubuntu install\n\nENV DEBIAN_FRONTEND teletype\n\n# no fixed entrypoint (e.g. also allow `convert` etc)\nCMD [\"/usr/local/bin/ocrd-im6convert\", \"--help\"]\n", "README.md"=>"# ocrd_imageconvert\n\n> Thin wrapper around convert(1)\n\n## Introduction\n\n[ImageMagick's](https://imagemagick.org) `convert` CLI contains a treasure trove of image operations. This wrapper aims to provide much of that as an [OCR-D compliant processor](https://ocr-d.github.io/CLI).\n\n## Installation\n\nThis module requires GNU make (for installation) and the ImageMagick command line tools (at runtime). On Ubuntu 18.04 (or similar), you can install them by running:\n\n    sudo apt-get install make\n    sudo make deps-ubuntu # or: apt-get install imagemagick\n\nMoreover, an installation of [OCR-D core](https://github.com/OCR-D/core) is needed:\n\n    make deps # or: pip install ocrd\n\nThis will install the Python package `ocrd` in your current environment. (Setting up a [venv](https://ocr-d.github.io/docs/guide#python-setup) is strongly recommended.)\n\nLastly, the provided shell script `ocrd-im6convert` works best when copied into your `PATH`, referencing its ocrd-tool.json under a known path. This can be done by running:\n\n    make install\n\nThis will copy the binary and JSON file under `$PREFIX`, which variable you can override to your needs. The default value is to use `PREFIX=$VIRTUAL_ENV` if you have already activated a venv, or `PREFIX=$PWD/.local` (i.e. under the current working directory).\n\n## Usage\n\nThis package provides `ocrd-im6convert` as a [OCR-D processor](https://ocr-d.github.com/cli) (command line interface). It uses the following parameters:\n\n```JSON\n    \"ocrd-im6convert\": {\n      \"executable\": \"ocrd-im6convert\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization\"],\n      \"description\": \"Convert and transform images\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-IMG\"\n      ],\n      \"parameters\": {\n        \"input-options\": {\n          \"type\": \"string\",\n          \"description\": \"e.g. -density 600x600 -wavelet-denoise 1%x0.1\",\n          \"default\": \"\"\n        },\n        \"output-format\": {\n          \"type\": \"string\",\n          \"description\": \"Desired media type of output\",\n          \"required\": true,\n          \"enum\": [\"image/tiff\", \"image/jp2\", \"image/png\"]\n        },\n        \"output-options\": {\n          \"type\": \"string\",\n          \"description\": \"e.g. -resample 300x300 -alpha deactivate -normalize -despeckle -noise 2 -negate -morphology close diamond\",\n          \"default\": \"\"\n        }\n      }\n    }\n```\n\nCf. [IM documentation](https://imagemagick.org/script/command-line-options.php) or man-page `convert(1)` for formats and options.\n\n### Example\n\n    ocrd-im6convert -I OCR-D-IMG -O OCR-D-IMG-SMALL -p '{ \"output-format\": \"image/png\", \"output-options\": \"-resize 24%\" }'\n\n(This downscales the images in the input file group `OCR-D-IMG` to 24% and stores them as PNG files under the output file group `OCR-D-IMG-SMALL`.)\n\n## Testing\n\nNone yet\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_im6convert\",\n  \"version\": \"0.0.2\",\n  \"tools\": {\n\n    \"ocrd-im6convert\": {\n      \"executable\": \"ocrd-im6convert\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization\"],\n      \"description\": \"Convert and transform images\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-IMG\"\n      ],\n      \"parameters\": {\n        \"input-options\": {\n          \"type\": \"string\",\n          \"description\": \"e.g. -density 600x600 -wavelet-denoise 1%x0.1\",\n          \"default\": \"\"\n        },\n        \"output-format\": {\n          \"type\": \"string\",\n          \"description\": \"Desired media type of output\",\n          \"required\": true,\n          \"enum\": [\"image/tiff\", \"image/jp2\", \"image/png\"]\n        },\n        \"output-options\": {\n          \"type\": \"string\",\n          \"description\": \"e.g. -resample 300x300 -alpha deactivate -normalize -despeckle -noise 2 -negate -morphology close diamond\",\n          \"default\": \"\"\n        }\n      }\n    }\n\n  }\n}\n", "setup.py"=>nil}, "git"=>{"last_commit"=>"Fri Dec 27 13:38:58 2019 +0100", "latest_tag"=>"v0.0.2", "number_of_commits"=>"27", "url"=>"https://github.com/OCR-D/ocrd_im6convert.git"}, "name"=>"ocrd_im6convert", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_im6convert", "tools"=>{"ocrd-im6convert"=>{"categories"=>["Image preprocessing"], "description"=>"Convert and transform images", "executable"=>"ocrd-im6convert", "input_file_grp"=>["OCR-D-IMG"], "output_file_grp"=>["OCR-D-IMG"], "parameters"=>{"input-options"=>{"default"=>"", "description"=>"e.g. -density 600x600 -wavelet-denoise 1%x0.1", "type"=>"string"}, "output-format"=>{"description"=>"Desired media type of output", "enum"=>["image/tiff", "image/jp2", "image/png"], "required"=>true, "type"=>"string"}, "output-options"=>{"default"=>"", "description"=>"e.g. -resample 300x300 -alpha deactivate -normalize -despeckle -noise 2 -negate -morphology close diamond", "type"=>"string"}}, "steps"=>["preprocessing/optimization"]}}, "version"=>"0.0.2"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_im6convert", "url"=>"https://github.com/OCR-D/ocrd_im6convert"}

ocrd_keraslm

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>nil, "README.md"=>"# ocrd_keraslm\n    character-level language modelling using Keras\n\n[![CircleCI](https://circleci.com/gh/OCR-D/ocrd_keraslm.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_keraslm)\n\n## Introduction\n\nThis is a tool for statistical _language modelling_ (predicting text from context) with recurrent neural networks. It models probabilities not on the word level but the _character level_ so as to allow open vocabulary processing (avoiding morphology, historic orthography and word segmentation problems). It manages a vocabulary of mapped characters, which can be easily extended by training on more text. Above that, unmapped characters are treated with underspecification.\n\nIn addition to character sequences, (meta-data) context variables can be configured as extra input. \n\n### Architecture\n\nThe model consists of:\n\n0. an input layer: characters are represented as indexes from the vocabulary mapping, in windows of a number `length` of characters,\n1. a character embedding layer: window sequences are converted into dense vectors by looking up the indexes in an embedding weight matrix,\n2. a context embedding layer: context variables are converted into dense vectors by looking up the indexes in an embedding weight matrix, \n3. character and context vector sequences are concatenated,\n4. a number `depth` of hidden layers: each with a number `width` of hidden recurrent units of _LSTM cells_ (Long Short-term Memory) connected on top of each other,\n5. an output layer derived from the transposed character embedding matrix (weight tying): hidden activations are projected linearly to vectors of dimensionality equal to the character vocabulary size, then softmax is applied returning a probability for each possible value of the next character, respectively.\n\n![model graph depiction](model-graph.png \"graph with 1 context variable\")\n\nThe model is trained by feeding windows of text in index representation to the input layer, calculating output and comparing it to the same text shifted backward by 1 character, and represented as unit vectors (\"one-hot coding\") as target. The loss is calculated as the (unweighted) cross-entropy between target and output. Backpropagation yields error gradients for each layer, which is used to iteratively update the weights (stochastic gradient descent).\n\nThis is implemented in [Keras](https://keras.io) with [Tensorflow](https://www.tensorflow.org/) as backend. It automatically uses a fast CUDA-optimized LSTM implementation (Nividia GPU and Tensorflow installation with GPU support, see below), both in learning and in prediction phase, if available.\n\n\n### Modes of operation\n\nNotably, this model (by default) runs _statefully_, i.e. by implicitly passing hidden state from one window (batch of samples) to the next. That way, the context available for predictions can be arbitrarily long (above `length`, e.g. the complete document up to that point), or short (below `length`, e.g. at the start of a text). (However, this is a passive perspective above `length`, because errors are never back-propagated any further in time during gradient-descent training.) This is favourable to stateless mode because all characters can be output in parallel, and no partial windows need to be presented during training (which slows down).\n\nBesides stateful mode, the model can also be run _incrementally_, i.e. by explicitly passing hidden state from the caller. That way, multiple alternative hypotheses can be processed together. This is used for generation (sampling from the model) and alternative decoding (finding the best path through a sequence of alternatives).\n\n### Context conditioning\n\nEvery text has meta-data like time, author, text type, genre, production features (e.g. print vs typewriter vs digital born rich text, OCR version), language, structural element (e.g. title vs heading vs paragraph vs footer vs marginalia), font family (e.g. Antiqua vs Fraktura) and font shape (e.g. bold vs letter-spaced vs italic vs normal) etc. \n\nThis information (however noisy) can be very useful to facilitate stochastic modelling, since language has an extreme diversity and complexity. To that end, models can be conditioned on extra inputs here, termed _context variables_. The model learns to represent these high-dimensional discrete values as low-dimensional continuous vectors (embeddings), also entering the recurrent hidden layers (as a form of simple additive adaptation).\n\n### Underspecification\n\nIndex zero is reserved for unmapped characters (unseen contexts). During training, its embedding vector is regularised to occupy a center position of all mapped characters (all other contexts), and the hidden layers get to see it every now and then by random degradation. At runtime, therefore, some unknown character (some unknown context) represented as zero does not disturb follow-up predictions too much.\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-virtualenv`` or ``python3-virtualenv``)\n\nCreate and activate a virtualenv as usual.\n\nIf you need a custom version of ``keras`` or ``tensorflow`` (like [GPU support](https://www.tensorflow.org/install/install_sources)), install them via `pip` now.\n\nTo install Python dependencies and this module, then do:\n```shell\nmake deps install\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\npip install -e .\n```\n\nUseful environment variables are:\n- ``TF_CPP_MIN_LOG_LEVEL`` (set to `1` to suppress most of Tensorflow's messages\n- ``CUDA_VISIBLE_DEVICES`` (set empty to force CPU even in a GPU installation)\n\n\n## Usage\n\nThis packages has two user interfaces:\n\n### command line interface `keraslm-rate`\n\nTo be used with string arguments and plain-text files.\n\n```shell\nUsage: keraslm-rate [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  train                           train a language model\n  test                            get overall perplexity from language model\n  apply                           get individual probabilities from language model\n  generate                        sample characters from language model\n  print-charset                   Print the mapped characters\n  prune-charset                   Delete one character from mapping\n  plot-char-embeddings-similarity\n                                  Paint a heat map of character embeddings\n  plot-context-embeddings-similarity\n                                  Paint a heat map of context embeddings\n  plot-context-embeddings-projection\n                                  Paint a 2-d PCA projection of context embeddings\n```\n\nExamples:\n```shell\nkeraslm-rate train --width 64 --depth 4 --length 256 --model model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/*.tcf.txt\nkeraslm-rate generate -m model_dta_64_4_256.h5 --number 6 \"für die Wiſſen\"\nkeraslm-rate apply -m model_dta_64_4_256.h5 \"so schädlich ist es Borkickheile zu pflanzen\"\nkeraslm-rate test -m model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/grimm_*.tcf.txt\n```\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-keraslm-rate`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Input could be anything with a textual annotation (`TextEquiv` on the given `textequiv_level`). The LM rater could be used for both quality control (without alternative decoding, using only each first index `TextEquiv`) and part of post-correction (with `alternative_decoding=True`, finding the best path among `TextEquiv` indexes).\n\n```json\n  \"tools\": {\n    \"ocrd-keraslm-rate\": {\n      \"executable\": \"ocrd-keraslm-rate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Rate elements of the text with a character-level LSTM language model in Keras\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\",\n        \"OCR-D-COR-CIS\",\n        \"OCR-D-COR-ASV\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-LM\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to evaluate TextEquiv sequences on\"\n        },\n        \"alternative_decoding\": {\n          \"type\": \"boolean\",\n          \"description\": \"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative\",\n          \"default\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during search with alternative_decoding\",\n          \"default\": 100\n        }\n      }\n    }\n  }\n```\n\nExamples:\n```shell\nmake deps-test # installs ocrd_tesserocr\nmake test/assets # downloads GT, imports PageXML, builds workspaces\nocrd workspace clone -a test/assets/kant_aufklaerung_1784/mets.xml ws1\ncd ws1\nocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK\nocrd-tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-WORD -p '{ \"textequiv_level\" : \"word\", \"model\" : \"Fraktur\" }'\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-GLYPH -p '{ \"textequiv_level\" : \"glyph\", \"model\" : \"deu-frak\" }'\n# get confidences and perplexity:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-WORD -O OCR-D-OCR-LM-WORD -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"word\", \"alternative_decoding\": false }'\n# also get best path:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-GLYPH -O OCR-D-OCR-LM-GLYPH -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"glyph\", \"alternative_decoding\": true, \"beam_width\": 10 }'\n```\n\n## Testing\n\n```shell\nmake deps-test test\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements_test.txt\ntest -e test/assets || test/prepare_gt.bash test/assets\ntest -f model_dta_test.h5 || keraslm-rate train -m model_dta_test.h5 test/assets/*.txt\nkeraslm-rate test -m model_dta_test.h5 test/assets/*.txt\npython -m pytest test $(PYTEST_ARGS)\n```\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_keraslm\",\n  \"version\": \"0.3.1\",\n  \"tools\": {\n    \"ocrd-keraslm-rate\": {\n      \"executable\": \"ocrd-keraslm-rate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Rate elements of the text with a character-level LSTM language model in Keras\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\",\n        \"OCR-D-COR-CIS\",\n        \"OCR-D-COR-ASV\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-LM\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to evaluate TextEquiv sequences on\"\n        },\n        \"alternative_decoding\": {\n          \"type\": \"boolean\",\n          \"description\": \"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative\",\n          \"default\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during search with alternative_decoding\",\n          \"default\": 10\n        },\n        \"lm_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"share of the LM scores over the input confidences\",\n          \"default\": 0.5\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n    - keraslm-rate\n    - ocrd-keraslm-rate\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_keraslm',\n    version='0.3.2',\n    description='character-level language modelling in Keras',\n    long_description=README,\n    long_description_content_type='text/markdown',\n    author='Robert Sachunsky, Konstantin Baierer, Kay-Michael Würzner',\n    author_email='sachunsky@informatik.uni-leipzig.de, unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_keraslm',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=open('requirements.txt').read().split('\\n'),\n    extras_require={\n        'plotting': [\n            'sklearn',\n            'matplotlib',\n            ]\n    },\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'keraslm-rate=ocrd_keraslm.scripts.run:cli',\n            'ocrd-keraslm-rate=ocrd_keraslm.wrapper.cli:ocrd_keraslm_rate',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Thu Jan 9 10:13:52 2020 +0100", "latest_tag"=>"0.3.1", "number_of_commits"=>"91", "url"=>"https://github.com/OCR-D/ocrd_keraslm.git"}, "name"=>"ocrd_keraslm", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_keraslm", "tools"=>{"ocrd-keraslm-rate"=>{"categories"=>["Text recognition and optimization"], "description"=>"Rate elements of the text with a character-level LSTM language model in Keras", "executable"=>"ocrd-keraslm-rate", "input_file_grp"=>["OCR-D-OCR-TESS", "OCR-D-OCR-KRAK", "OCR-D-OCR-OCRO", "OCR-D-OCR-CALA", "OCR-D-OCR-ANY", "OCR-D-COR-CIS", "OCR-D-COR-ASV"], "output_file_grp"=>["OCR-D-COR-LM"], "parameters"=>{"alternative_decoding"=>{"default"=>true, "description"=>"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative", "type"=>"boolean"}, "beam_width"=>{"default"=>10, "description"=>"maximum number of best partial paths to consider during search with alternative_decoding", "format"=>"integer", "type"=>"number"}, "lm_weight"=>{"default"=>0.5, "description"=>"share of the LM scores over the input confidences", "format"=>"float", "type"=>"number"}, "model_file"=>{"cacheable"=>true, "content-type"=>"application/x-hdf;subtype=bag", "description"=>"path of h5py weight/config file for model trained with keraslm", "format"=>"uri", "required"=>true, "type"=>"string"}, "textequiv_level"=>{"default"=>"glyph", "description"=>"PAGE XML hierarchy level to evaluate TextEquiv sequences on", "enum"=>["region", "line", "word", "glyph"], "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}}, "version"=>"0.3.1"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_keraslm", "python"=>{"author"=>"Robert Sachunsky, Konstantin Baierer, Kay-Michael Würzner", "author-email"=>"sachunsky@informatik.uni-leipzig.de, unixprog@gmail.com, wuerzner@gmail.com", "name"=>"ocrd_keraslm", "pypi"=>{"info"=>{"author"=>"Robert Sachunsky, Konstantin Baierer, Kay-Michael Würzner", "author_email"=>"sachunsky@informatik.uni-leipzig.de, unixprog@gmail.com, wuerzner@gmail.com", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_keraslm\n    character-level language modelling using Keras\n\n\n## Introduction\n\nThis is a tool for statistical _language modelling_ (predicting text from context) with recurrent neural networks. It models probabilities not on the word level but the _character level_ so as to allow open vocabulary processing (avoiding morphology, historic orthography and word segmentation problems). It manages a vocabulary of mapped characters, which can be easily extended by training on more text. Above that, unmapped characters are treated with underspecification.\n\nIn addition to character sequences, (meta-data) context variables can be configured as extra input. \n\n### Architecture\n\nThe model consists of:\n\n0. an input layer: characters are represented as indexes from the vocabulary mapping, in windows of a number `length` of characters,\n1. a character embedding layer: window sequences are converted into dense vectors by looking up the indexes in an embedding weight matrix,\n2. a context embedding layer: context variables are converted into dense vectors by looking up the indexes in an embedding weight matrix, \n3. character and context vector sequences are concatenated,\n4. a number `depth` of hidden layers: each with a number `width` of hidden recurrent units of _LSTM cells_ (Long Short-term Memory) connected on top of each other,\n5. an output layer derived from the transposed character embedding matrix (weight tying): hidden activations are projected linearly to vectors of dimensionality equal to the character vocabulary size, then softmax is applied returning a probability for each possible value of the next character, respectively.\n\n![model graph depiction](model-graph.png \"graph with 1 context variable\")\n\nThe model is trained by feeding windows of text in index representation to the input layer, calculating output and comparing it to the same text shifted backward by 1 character, and represented as unit vectors (\"one-hot coding\") as target. The loss is calculated as the (unweighted) cross-entropy between target and output. Backpropagation yields error gradients for each layer, which is used to iteratively update the weights (stochastic gradient descent).\n\nThis is implemented in [Keras](https://keras.io) with [Tensorflow](https://www.tensorflow.org/) as backend. It automatically uses a fast CUDA-optimized LSTM implementation (Nividia GPU and Tensorflow installation with GPU support, see below), both in learning and in prediction phase, if available.\n\n\n### Modes of operation\n\nNotably, this model (by default) runs _statefully_, i.e. by implicitly passing hidden state from one window (batch of samples) to the next. That way, the context available for predictions can be arbitrarily long (above `length`, e.g. the complete document up to that point), or short (below `length`, e.g. at the start of a text). (However, this is a passive perspective above `length`, because errors are never back-propagated any further in time during gradient-descent training.) This is favourable to stateless mode because all characters can be output in parallel, and no partial windows need to be presented during training (which slows down).\n\nBesides stateful mode, the model can also be run _incrementally_, i.e. by explicitly passing hidden state from the caller. That way, multiple alternative hypotheses can be processed together. This is used for generation (sampling from the model) and alternative decoding (finding the best path through a sequence of alternatives).\n\n### Context conditioning\n\nEvery text has meta-data like time, author, text type, genre, production features (e.g. print vs typewriter vs digital born rich text, OCR version), language, structural element (e.g. title vs heading vs paragraph vs footer vs marginalia), font family (e.g. Antiqua vs Fraktura) and font shape (e.g. bold vs letter-spaced vs italic vs normal) etc. \n\nThis information (however noisy) can be very useful to facilitate stochastic modelling, since language has an extreme diversity and complexity. To that end, models can be conditioned on extra inputs here, termed _context variables_. The model learns to represent these high-dimensional discrete values as low-dimensional continuous vectors (embeddings), also entering the recurrent hidden layers (as a form of simple additive adaptation).\n\n### Underspecification\n\nIndex zero is reserved for unmapped characters (unseen contexts). During training, its embedding vector is regularised to occupy a center position of all mapped characters (all other contexts), and the hidden layers get to see it every now and then by random degradation. At runtime, therefore, some unknown character (some unknown context) represented as zero does not disturb follow-up predictions too much.\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-virtualenv`` or ``python3-virtualenv``)\n\nCreate and activate a virtualenv as usual.\n\nIf you need a custom version of ``keras`` or ``tensorflow`` (like [GPU support](https://www.tensorflow.org/install/install_sources)), install them via `pip` now.\n\nTo install Python dependencies and this module, then do:\n```shell\nmake deps install\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\npip install -e .\n```\n\nUseful environment variables are:\n- ``TF_CPP_MIN_LOG_LEVEL`` (set to `1` to suppress most of Tensorflow's messages\n- ``CUDA_VISIBLE_DEVICES`` (set empty to force CPU even in a GPU installation)\n\n\n## Usage\n\nThis packages has two user interfaces:\n\n### command line interface `keraslm-rate`\n\nTo be used with string arguments and plain-text files.\n\n```shell\nUsage: keraslm-rate [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  train                           train a language model\n  test                            get overall perplexity from language model\n  apply                           get individual probabilities from language model\n  generate                        sample characters from language model\n  print-charset                   Print the mapped characters\n  prune-charset                   Delete one character from mapping\n  plot-char-embeddings-similarity\n                                  Paint a heat map of character embeddings\n  plot-context-embeddings-similarity\n                                  Paint a heat map of context embeddings\n  plot-context-embeddings-projection\n                                  Paint a 2-d PCA projection of context embeddings\n```\n\nExamples:\n```shell\nkeraslm-rate train --width 64 --depth 4 --length 256 --model model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/*.tcf.txt\nkeraslm-rate generate -m model_dta_64_4_256.h5 --number 6 \"für die Wiſſen\"\nkeraslm-rate apply -m model_dta_64_4_256.h5 \"so schädlich ist es Borkickheile zu pflanzen\"\nkeraslm-rate test -m model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/grimm_*.tcf.txt\n```\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-keraslm-rate`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Input could be anything with a textual annotation (`TextEquiv` on the given `textequiv_level`). The LM rater could be used for both quality control (without alternative decoding, using only each first index `TextEquiv`) and part of post-correction (with `alternative_decoding=True`, finding the best path among `TextEquiv` indexes).\n\n```json\n  \"tools\": {\n    \"ocrd-keraslm-rate\": {\n      \"executable\": \"ocrd-keraslm-rate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Rate elements of the text with a character-level LSTM language model in Keras\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\",\n        \"OCR-D-COR-CIS\",\n        \"OCR-D-COR-ASV\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-LM\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to evaluate TextEquiv sequences on\"\n        },\n        \"alternative_decoding\": {\n          \"type\": \"boolean\",\n          \"description\": \"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative\",\n          \"default\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during search with alternative_decoding\",\n          \"default\": 100\n        }\n      }\n    }\n  }\n```\n\nExamples:\n```shell\nmake deps-test # installs ocrd_tesserocr\nmake test/assets # downloads GT, imports PageXML, builds workspaces\nocrd workspace clone -a test/assets/kant_aufklaerung_1784/mets.xml ws1\ncd ws1\nocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK\nocrd-tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-WORD -p '{ \"textequiv_level\" : \"word\", \"model\" : \"Fraktur\" }'\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-GLYPH -p '{ \"textequiv_level\" : \"glyph\", \"model\" : \"deu-frak\" }'\n# get confidences and perplexity:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-WORD -O OCR-D-OCR-LM-WORD -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"word\", \"alternative_decoding\": false }'\n# also get best path:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-GLYPH -O OCR-D-OCR-LM-GLYPH -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"glyph\", \"alternative_decoding\": true, \"beam_width\": 10 }'\n```\n\n## Testing\n\n```shell\nmake deps-test test\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements_test.txt\ntest -e test/assets || test/prepare_gt.bash test/assets\ntest -f model_dta_test.h5 || keraslm-rate train -m model_dta_test.h5 test/assets/*.txt\nkeraslm-rate test -m model_dta_test.h5 test/assets/*.txt\npython -m pytest test $(PYTEST_ARGS)\n```\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/OCR-D/ocrd_keraslm", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-keraslm", "package_url"=>"https://pypi.org/project/ocrd-keraslm/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-keraslm/", "project_urls"=>{"Homepage"=>"https://github.com/OCR-D/ocrd_keraslm"}, "release_url"=>"https://pypi.org/project/ocrd-keraslm/0.3.2/", "requires_dist"=>["ocrd (>=2.0)", "click", "keras (>=2.2.4)", "numpy", "tensorflow (<2.0)", "h5py", "networkx (>=2.0)", "sklearn; extra == 'plotting'", "matplotlib; extra == 'plotting'"], "requires_python"=>"", "summary"=>"character-level language modelling in Keras", "version"=>"0.3.2"}, "last_serial"=>6158523, "releases"=>{"0.3.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"0da1139d7b62ee27b9bb3af2b4e38929", "sha256"=>"f3ec82a615434e90028722586c6123e4a1887e36b0a57f06566a291892280e88"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.1-py2.py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"0da1139d7b62ee27b9bb3af2b4e38929", "packagetype"=>"bdist_wheel", "python_version"=>"py2.py3", "requires_python"=>nil, "size"=>34192, "upload_time"=>"2019-10-25T22:53:09", "upload_time_iso_8601"=>"2019-10-25T22:53:09.567407Z", "url"=>"https://files.pythonhosted.org/packages/eb/ba/8f5f0f1801ea99221c772357e2c79d9935a88e89873924e557e24aea6c33/ocrd_keraslm-0.3.1-py2.py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"e8d597a8dbf64e45dcbf19196e73bbf8", "sha256"=>"665a9bf1d7bc46f497d71638b2d33608062edd16ac11b9cff05be56eacda53c9"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.1.tar.gz", "has_sig"=>false, "md5_digest"=>"e8d597a8dbf64e45dcbf19196e73bbf8", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>32287, "upload_time"=>"2019-10-25T22:53:12", "upload_time_iso_8601"=>"2019-10-25T22:53:12.437293Z", "url"=>"https://files.pythonhosted.org/packages/79/0e/744edc5497d706ac558b90d8d85b2e52ad5fb6b794c6f9cb44fc0aaa341a/ocrd_keraslm-0.3.1.tar.gz"}], "0.3.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"9e8927b5ca560a990cb924c7a01e7280", "sha256"=>"45c4af95f531e3a2c9528e401d368dad10e4b8f9cdba9a67ef6f816afc682d3b"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"9e8927b5ca560a990cb924c7a01e7280", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34190, "upload_time"=>"2019-11-18T22:03:01", "upload_time_iso_8601"=>"2019-11-18T22:03:01.036117Z", "url"=>"https://files.pythonhosted.org/packages/15/10/690a290322b84e6c4cba17dbff7e0fb570916810371b1b48020f75504d49/ocrd_keraslm-0.3.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"7eb11946732e6410d4ba18dad3fbaf20", "sha256"=>"ba56b149a68c9f351052e62cc247d4074514a66c5dee99e7ef6a78cca497e5e9"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.2.tar.gz", "has_sig"=>false, "md5_digest"=>"7eb11946732e6410d4ba18dad3fbaf20", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>32294, "upload_time"=>"2019-11-18T22:03:06", "upload_time_iso_8601"=>"2019-11-18T22:03:06.384019Z", "url"=>"https://files.pythonhosted.org/packages/0e/75/b3875f685ba4d02c8cce12b86200e139617acde417fab40df2e462d85673/ocrd_keraslm-0.3.2.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"9e8927b5ca560a990cb924c7a01e7280", "sha256"=>"45c4af95f531e3a2c9528e401d368dad10e4b8f9cdba9a67ef6f816afc682d3b"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"9e8927b5ca560a990cb924c7a01e7280", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34190, "upload_time"=>"2019-11-18T22:03:01", "upload_time_iso_8601"=>"2019-11-18T22:03:01.036117Z", "url"=>"https://files.pythonhosted.org/packages/15/10/690a290322b84e6c4cba17dbff7e0fb570916810371b1b48020f75504d49/ocrd_keraslm-0.3.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"7eb11946732e6410d4ba18dad3fbaf20", "sha256"=>"ba56b149a68c9f351052e62cc247d4074514a66c5dee99e7ef6a78cca497e5e9"}, "downloads"=>-1, "filename"=>"ocrd_keraslm-0.3.2.tar.gz", "has_sig"=>false, "md5_digest"=>"7eb11946732e6410d4ba18dad3fbaf20", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>32294, "upload_time"=>"2019-11-18T22:03:06", "upload_time_iso_8601"=>"2019-11-18T22:03:06.384019Z", "url"=>"https://files.pythonhosted.org/packages/0e/75/b3875f685ba4d02c8cce12b86200e139617acde417fab40df2e462d85673/ocrd_keraslm-0.3.2.tar.gz"}]}, "url"=>"https://github.com/OCR-D/ocrd_keraslm"}, "url"=>"https://github.com/OCR-D/ocrd_keraslm"}

ocrd_kraken

{"compliant_cli"=>false, "files"=>{"Dockerfile"=>"FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\nENV LC_ALL C.UTF-8\nENV LANG C.UTF-8\n\nWORKDIR /build-ocrd\nCOPY setup.py .\nCOPY requirements.txt .\nRUN apt-get update && \\\n    apt-get -y install --no-install-recommends \\\n    ca-certificates \\\n    make \\\n    git\nCOPY ocrd_kraken ./ocrd_kraken\nRUN pip3 install --upgrade pip\nRUN pip3 install .\n\nENTRYPOINT [\"/bin/sh\", \"-c\"]\n", "README.md"=>"# ocrd_kraken\n\n> Wrapper for the kraken OCR engine\n\n[![image](https://travis-ci.org/OCR-D/ocrd_kraken.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_kraken)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/kraken.svg)](https://hub.docker.com/r/ocrd/kraken/tags/)\n[![image](https://circleci.com/gh/OCR-D/ocrd_kraken.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_kraken)\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_kraken\",\n  \"version\": \"0.0.2\",\n  \"tools\": {\n    \"ocrd-kraken-binarize\": {\n      \"executable\": \"ocrd-kraken-binarize\",\n      \"input_file_grp\": \"OCR-D-IMG\",\n      \"output_file_grp\": \"OCR-D-IMG-BIN\",\n      \"categories\": [\n        \"Image preprocessing\"\n      ],\n      \"steps\": [\n        \"preprocessing/optimization/binarization\"\n      ],\n      \"description\": \"Binarize images with kraken\",\n      \"parameters\": {\n        \"level-of-operation\": {\n          \"type\": \"string\",\n          \"default\": \"page\",\n          \"enum\": [\"page\", \"block\", \"line\"]\n        }\n      }\n    },\n    \"ocrd-kraken-segment\": {\n      \"executable\": \"ocrd-kraken-segment\",\n      \"categories\": [\n        \"Layout analysis\"\n      ],\n      \"steps\": [\n        \"layout/segmentation/region\"\n      ],\n      \"description\": \"Block segmentation with kraken\",\n      \"parameters\": {\n        \"text_direction\": {\n          \"type\": \"string\",\n          \"description\": \"Sets principal text direction\",\n          \"enum\": [\"horizontal-lr\", \"horizontal-rl\", \"vertical-lr\", \"vertical-rl\"],\n          \"default\": \"horizontal-lr\"\n        },\n        \"script_detect\": {\n          \"type\": \"boolean\",\n          \"description\": \"Enable script detection on segmenter output\",\n          \"default\": false\n        },\n        \"maxcolseps\": {\"type\": \"number\", \"format\": \"integer\", \"default\": 2},\n        \"scale\": {\"type\": \"number\", \"format\": \"float\", \"default\": 0},\n        \"black_colseps\": {\"type\": \"boolean\", \"default\": false},\n        \"white_colseps\": {\"type\": \"boolean\", \"default\": false}\n      }\n    },\n    \"ocrd-kraken-ocr\": {\n      \"executable\": \"ocrd-kraken-ocr\",\n      \"categories\": [\"Text recognition and optimization\"],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"OCR with kraken\",\n      \"parameters\": {\n        \"lines-json\": {\n          \"type\": \"string\",\n          \"format\": \"url\",\n          \"required\": \"true\",\n          \"description\": \"URL to line segmentation in JSON\"\n        }\n      }\n    }\n\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls two binaries:\n\n    - ocrd-kraken-binarize\n    - ocrd-kraken-segment\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_kraken',\n    version='0.1.1',\n    description='kraken bindings',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    author='Konstantin Baierer, Kay-Michael Würzner',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_kraken',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=[\n        'ocrd >= 1.0.0a4',\n        'kraken == 0.9.16',\n        'click >= 7',\n    ],\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-kraken-binarize=ocrd_kraken.cli:ocrd_kraken_binarize',\n            'ocrd-kraken-segment=ocrd_kraken.cli:ocrd_kraken_segment',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Mon Oct 21 20:52:26 2019 +0200", "latest_tag"=>"v0.1.1", "number_of_commits"=>"85", "url"=>"https://github.com/OCR-D/ocrd_kraken.git"}, "name"=>"ocrd_kraken", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_kraken", "tools"=>{"ocrd-kraken-binarize"=>{"categories"=>["Image preprocessing"], "description"=>"Binarize images with kraken", "executable"=>"ocrd-kraken-binarize", "input_file_grp"=>"OCR-D-IMG", "output_file_grp"=>"OCR-D-IMG-BIN", "parameters"=>{"level-of-operation"=>{"default"=>"page", "enum"=>["page", "block", "line"], "type"=>"string"}}, "steps"=>["preprocessing/optimization/binarization"]}, "ocrd-kraken-ocr"=>{"categories"=>["Text recognition and optimization"], "description"=>"OCR with kraken", "executable"=>"ocrd-kraken-ocr", "parameters"=>{"lines-json"=>{"description"=>"URL to line segmentation in JSON", "format"=>"url", "required"=>"true", "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}, "ocrd-kraken-segment"=>{"categories"=>["Layout analysis"], "description"=>"Block segmentation with kraken", "executable"=>"ocrd-kraken-segment", "parameters"=>{"black_colseps"=>{"default"=>false, "type"=>"boolean"}, "maxcolseps"=>{"default"=>2, "format"=>"integer", "type"=>"number"}, "scale"=>{"default"=>0, "format"=>"float", "type"=>"number"}, "script_detect"=>{"default"=>false, "description"=>"Enable script detection on segmenter output", "type"=>"boolean"}, "text_direction"=>{"default"=>"horizontal-lr", "description"=>"Sets principal text direction", "enum"=>["horizontal-lr", "horizontal-rl", "vertical-lr", "vertical-rl"], "type"=>"string"}, "white_colseps"=>{"default"=>false, "type"=>"boolean"}}, "steps"=>["layout/segmentation/region"]}}, "version"=>"0.0.2"}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [tools.ocrd-kraken-binarize.input_file_grp] 'OCR-D-IMG' is not of type 'array'\n  [tools.ocrd-kraken-binarize.output_file_grp] 'OCR-D-IMG-BIN' is not of type 'array'\n  [tools.ocrd-kraken-binarize.parameters.level-of-operation] 'description' is a required property\n  [tools.ocrd-kraken-segment] 'input_file_grp' is a required property\n  [tools.ocrd-kraken-segment.parameters.maxcolseps] 'description' is a required property\n  [tools.ocrd-kraken-segment.parameters.scale] 'description' is a required property\n  [tools.ocrd-kraken-segment.parameters.black_colseps] 'description' is a required property\n  [tools.ocrd-kraken-segment.parameters.white_colseps] 'description' is a required property\n  [tools.ocrd-kraken-ocr] 'input_file_grp' is a required property\n  [tools.ocrd-kraken-ocr.parameters.lines-json.required] 'true' is not of type 'boolean'\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_kraken", "python"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner", "author-email"=>"unixprog@gmail.com, wuerzner@gmail.com", "name"=>"ocrd_kraken", "pypi"=>{"info"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner", "author_email"=>"unixprog@gmail.com, wuerzner@gmail.com", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_kraken\n\n> Wrapper for the kraken OCR engine\n\n[![image](https://travis-ci.org/OCR-D/ocrd_kraken.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_kraken)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/kraken.svg)](https://hub.docker.com/r/ocrd/kraken/tags/)\n[![image](https://circleci.com/gh/OCR-D/ocrd_kraken.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_kraken)\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/OCR-D/ocrd_kraken", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-kraken", "package_url"=>"https://pypi.org/project/ocrd-kraken/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-kraken/", "project_urls"=>{"Homepage"=>"https://github.com/OCR-D/ocrd_kraken"}, "release_url"=>"https://pypi.org/project/ocrd-kraken/0.1.1/", "requires_dist"=>["ocrd (>=1.0.0a4)", "kraken (==0.9.16)", "click (>=7)"], "requires_python"=>"", "summary"=>"kraken bindings", "version"=>"0.1.1"}, "last_serial"=>6008613, "releases"=>{"0.0.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"b065398af77f4804763665f50503e141", "sha256"=>"a0de30df5e8b7d9fe1ed3343a8fa3a413620828a2cdf46bcab8d77e864869d53"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.0.2-py2-none-any.whl", "has_sig"=>false, "md5_digest"=>"b065398af77f4804763665f50503e141", "packagetype"=>"bdist_wheel", "python_version"=>"py2", "requires_python"=>nil, "size"=>10691, "upload_time"=>"2019-01-04T13:42:30", "upload_time_iso_8601"=>"2019-01-04T13:42:30.728403Z", "url"=>"https://files.pythonhosted.org/packages/b4/52/aea22b8cfab48546e10118e0eb7e70dc108fe633af3e07194dfd04e00fb2/ocrd_kraken-0.0.2-py2-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"67b290066697cbaddb71a4ff92eeb9f5", "sha256"=>"805fb1aa976f9ee1275e347b1fee2413af3ea7cc8972af84464c6f4253ebdd6e"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"67b290066697cbaddb71a4ff92eeb9f5", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>9634, "upload_time"=>"2019-01-04T13:42:32", "upload_time_iso_8601"=>"2019-01-04T13:42:32.808242Z", "url"=>"https://files.pythonhosted.org/packages/06/00/a9843c2c73a086c1f66e28d6b0d64053ecd66995daddfb5c0f28e566c9f7/ocrd_kraken-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"351d10f31667ec43d9a117b9dd19e861", "sha256"=>"a6464f3559acfb36947687d4e2e70cd7cb7e655d70234696e2e7c1b07f99bab8"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"351d10f31667ec43d9a117b9dd19e861", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>5003, "upload_time"=>"2019-01-04T13:42:34", "upload_time_iso_8601"=>"2019-01-04T13:42:34.101144Z", "url"=>"https://files.pythonhosted.org/packages/32/bb/9e4299ec1d5f494e7bf14de447f361455f36ea0255181871ee937aae0528/ocrd_kraken-0.0.2.tar.gz"}], "0.1.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"67161c2e535ac409369978252333eb35", "sha256"=>"4e6b7e9d1930de1f0bd57dfd63f9418c4345842e7cc8fdd9b147e7d378b8fe51"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"67161c2e535ac409369978252333eb35", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10442, "upload_time"=>"2019-02-28T09:37:43", "upload_time_iso_8601"=>"2019-02-28T09:37:43.225080Z", "url"=>"https://files.pythonhosted.org/packages/d6/4b/d7027ac27e1228cf9aa3ecd94e412b371b2a63ab2c93c1b77ad5414380c1/ocrd_kraken-0.1.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"f1ec0ad2a8e1d655410e4321c7dfae60", "sha256"=>"9bec610685e29d29e0614f2dfc300d201fbbff3f728140536031f14e4e65584c"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.0.tar.gz", "has_sig"=>false, "md5_digest"=>"f1ec0ad2a8e1d655410e4321c7dfae60", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>4121, "upload_time"=>"2019-02-28T09:37:44", "upload_time_iso_8601"=>"2019-02-28T09:37:44.655031Z", "url"=>"https://files.pythonhosted.org/packages/cb/35/7be3dd70b97e276ce2300dddf165bfc21c0e469c2626d7d531a07b8bf0fb/ocrd_kraken-0.1.0.tar.gz"}], "0.1.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"d6cc67071fe7db22ee35c58e6df6cb7c", "sha256"=>"4d6a4a969ad43711cd22febfe2cc63c966b48b033537f87b433ea8254bb86a1a"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"d6cc67071fe7db22ee35c58e6df6cb7c", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10595, "upload_time"=>"2019-10-21T18:20:21", "upload_time_iso_8601"=>"2019-10-21T18:20:21.215930Z", "url"=>"https://files.pythonhosted.org/packages/20/af/393dbc0767398429e08adb761289656516ab18d4f65d8e5c81791c6cafdc/ocrd_kraken-0.1.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"22813065ca842796d8d53a2ae148b7c9", "sha256"=>"67cad5aa4ce098262051f84c2f98a5a03be4b62e8bc4c2af1654f00b41caae25"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.1.tar.gz", "has_sig"=>false, "md5_digest"=>"22813065ca842796d8d53a2ae148b7c9", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>4209, "upload_time"=>"2019-10-21T18:20:22", "upload_time_iso_8601"=>"2019-10-21T18:20:22.550782Z", "url"=>"https://files.pythonhosted.org/packages/bb/18/1c305cd6dc5b38880a3240bdca9f3ac53c2780a292b2a02812075ddddff7/ocrd_kraken-0.1.1.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"d6cc67071fe7db22ee35c58e6df6cb7c", "sha256"=>"4d6a4a969ad43711cd22febfe2cc63c966b48b033537f87b433ea8254bb86a1a"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"d6cc67071fe7db22ee35c58e6df6cb7c", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10595, "upload_time"=>"2019-10-21T18:20:21", "upload_time_iso_8601"=>"2019-10-21T18:20:21.215930Z", "url"=>"https://files.pythonhosted.org/packages/20/af/393dbc0767398429e08adb761289656516ab18d4f65d8e5c81791c6cafdc/ocrd_kraken-0.1.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"22813065ca842796d8d53a2ae148b7c9", "sha256"=>"67cad5aa4ce098262051f84c2f98a5a03be4b62e8bc4c2af1654f00b41caae25"}, "downloads"=>-1, "filename"=>"ocrd_kraken-0.1.1.tar.gz", "has_sig"=>false, "md5_digest"=>"22813065ca842796d8d53a2ae148b7c9", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>4209, "upload_time"=>"2019-10-21T18:20:22", "upload_time_iso_8601"=>"2019-10-21T18:20:22.550782Z", "url"=>"https://files.pythonhosted.org/packages/bb/18/1c305cd6dc5b38880a3240bdca9f3ac53c2780a292b2a02812075ddddff7/ocrd_kraken-0.1.1.tar.gz"}]}, "url"=>"https://github.com/OCR-D/ocrd_kraken"}, "url"=>"https://github.com/OCR-D/ocrd_kraken"}

ocrd_ocropy

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\nENV LC_ALL C.UTF-8\nENV LANG C.UTF-8\n\nWORKDIR /build-ocrd\nCOPY setup.py .\nCOPY requirements.txt .\nCOPY README.md .\nRUN apt-get update && \\\n    apt-get -y install --no-install-recommends \\\n    ca-certificates \\\n    make \\\n    git\nCOPY ocrd_ocropy ./ocrd_ocropy\nRUN pip3 install --upgrade pip\nRUN make deps install\n\nENTRYPOINT [\"/bin/sh\", \"-c\"]\n", "README.md"=>"# ocrd_ocropy\n\n[![image](https://travis-ci.org/OCR-D/ocrd_ocropy.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_ocropy)\n\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/ocropy.svg)](https://hub.docker.com/r/ocrd/ocropy/tags/)\n\n> Wrapper for the ocropy OCR engine\n", "ocrd-tool.json"=>"{\n  \"version\": \"0.0.1\",\n  \"git_url\": \"https://github.com/OCR-D/ocrd_ocropy\",\n  \"tools\": {\n    \"ocrd-ocropy-segment\": {\n      \"executable\": \"ocrd-ocropy-segment\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"layout/segmentation/region\"],\n      \"description\": \"Segment page\",\n      \"input_file_grp\": [\"OCR-D-IMG-BIN\"],\n      \"output_file_grp\": [\"OCR-D-SEG-LINE\"],\n      \"parameters\": {\n        \"maxcolseps\":  {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 3},\n        \"maxseps\":     {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 0},\n        \"sepwiden\":    {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 10},\n        \"csminheight\": {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 10},\n        \"csminaspect\": {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 1.1},\n        \"pad\":         {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 3},\n        \"expand\":      {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 3},\n        \"usegauss\":    {\"type\": \"boolean\",\"description\": \"has an effect\", \"default\": false},\n        \"threshold\":   {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 0.2},\n        \"noise\":       {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 8},\n        \"scale\":       {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 0.0},\n        \"hscale\":      {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 1.0},\n        \"vscale\":      {\"type\": \"number\", \"description\": \"has an effect\", \"default\": 1.0}\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls one binary:\n\n    - ocrd-ocropy-segment\n\"\"\"\nimport codecs\n\nfrom setuptools import setup\n\nsetup(\n    name='ocrd_ocropy',\n    version='0.0.3',\n    description='ocropy bindings',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    author='Konstantin Baierer',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_ocropy',\n    license='Apache License 2.0',\n    packages=['ocrd_ocropy'],\n    install_requires=[\n        'ocrd >= 1.0.0b8',\n        'ocrd-fork-ocropy >= 1.4.0a3',\n        'click'\n    ],\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-ocropy-segment=ocrd_ocropy.cli:ocrd_ocropy_segment',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Tue Jun 11 14:51:00 2019 +0200", "latest_tag"=>"v0.0.3", "number_of_commits"=>"66", "url"=>"https://github.com/OCR-D/ocrd_ocropy.git"}, "name"=>"ocrd_ocropy", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_ocropy", "tools"=>{"ocrd-ocropy-segment"=>{"categories"=>["Image preprocessing"], "description"=>"Segment page", "executable"=>"ocrd-ocropy-segment", "input_file_grp"=>["OCR-D-IMG-BIN"], "output_file_grp"=>["OCR-D-SEG-LINE"], "parameters"=>{"csminaspect"=>{"default"=>1.1, "description"=>"has an effect", "type"=>"number"}, "csminheight"=>{"default"=>10, "description"=>"has an effect", "type"=>"number"}, "expand"=>{"default"=>3, "description"=>"has an effect", "type"=>"number"}, "hscale"=>{"default"=>1.0, "description"=>"has an effect", "type"=>"number"}, "maxcolseps"=>{"default"=>3, "description"=>"has an effect", "type"=>"number"}, "maxseps"=>{"default"=>0, "description"=>"has an effect", "type"=>"number"}, "noise"=>{"default"=>8, "description"=>"has an effect", "type"=>"number"}, "pad"=>{"default"=>3, "description"=>"has an effect", "type"=>"number"}, "scale"=>{"default"=>0.0, "description"=>"has an effect", "type"=>"number"}, "sepwiden"=>{"default"=>10, "description"=>"has an effect", "type"=>"number"}, "threshold"=>{"default"=>0.2, "description"=>"has an effect", "type"=>"number"}, "usegauss"=>{"default"=>false, "description"=>"has an effect", "type"=>"boolean"}, "vscale"=>{"default"=>1.0, "description"=>"has an effect", "type"=>"number"}}, "steps"=>["layout/segmentation/region"]}}, "version"=>"0.0.1"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_ocropy", "python"=>{"author"=>"Konstantin Baierer", "author-email"=>"unixprog@gmail.com, wuerzner@gmail.com", "name"=>"ocrd_ocropy", "pypi"=>{"info"=>{"author"=>"Konstantin Baierer", "author_email"=>"unixprog@gmail.com, wuerzner@gmail.com", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_ocropy\n\n[![image](https://travis-ci.org/OCR-D/ocrd_ocropy.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_ocropy)\n\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/ocropy.svg)](https://hub.docker.com/r/ocrd/ocropy/tags/)\n\n> Wrapper for the ocropy OCR engine\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/OCR-D/ocrd_ocropy", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-ocropy", "package_url"=>"https://pypi.org/project/ocrd-ocropy/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-ocropy/", "project_urls"=>{"Homepage"=>"https://github.com/OCR-D/ocrd_ocropy"}, "release_url"=>"https://pypi.org/project/ocrd-ocropy/0.0.3/", "requires_dist"=>["ocrd (>=1.0.0b8)", "ocrd-fork-ocropy (>=1.4.0a3)", "click"], "requires_python"=>"", "summary"=>"ocropy bindings", "version"=>"0.0.3"}, "last_serial"=>4979689, "releases"=>{"0.0.1a1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"955580b46dea69b4880f95f90076cfb3", "sha256"=>"1dc3926e7c28ecb52260c42d0b3b6b3cc3d2964b13ea994601219269c8072d89"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.1a1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"955580b46dea69b4880f95f90076cfb3", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>6462, "upload_time"=>"2019-03-19T17:02:48", "upload_time_iso_8601"=>"2019-03-19T17:02:48.327057Z", "url"=>"https://files.pythonhosted.org/packages/c7/ce/9f578c500afbffba6de78fb1fb0d881c23ddb794256a276e4277d5ad7c25/ocrd_ocropy-0.0.1a1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"39723d9e4f1734de4a7f1fdd9e7008fc", "sha256"=>"fc72a46a9e3bc7fd601aa6c00992debe566f1838b95bbd61e8c746b3abd0d673"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.1a1.tar.gz", "has_sig"=>false, "md5_digest"=>"39723d9e4f1734de4a7f1fdd9e7008fc", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>6105, "upload_time"=>"2019-03-19T17:02:50", "upload_time_iso_8601"=>"2019-03-19T17:02:50.204116Z", "url"=>"https://files.pythonhosted.org/packages/8f/a1/2030fb1c2c08cac624a7640daa6a12c3d115a52a9d7d66de5c6b427bbbde/ocrd_ocropy-0.0.1a1.tar.gz"}], "0.0.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"9a5b84192f6eb88c34a6e64528526d98", "sha256"=>"a1827b7fb49a27e297fb01ceea45c2272d996f498c576637e42d8008d28dfe9b"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"9a5b84192f6eb88c34a6e64528526d98", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10625, "upload_time"=>"2019-03-24T19:17:23", "upload_time_iso_8601"=>"2019-03-24T19:17:23.779614Z", "url"=>"https://files.pythonhosted.org/packages/7f/46/222d127fe28c522ab65448bd552f9b9b66ec6e5582f8cc7e2ee57f5450a5/ocrd_ocropy-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"9e83b8f7b5d686f6bcc032a8ca532ed6", "sha256"=>"d1e4cd90fff395e332814f51de1b46533ac88ea72f99f4502524c0c659572519"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"9e83b8f7b5d686f6bcc032a8ca532ed6", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>5855, "upload_time"=>"2019-03-24T19:17:25", "upload_time_iso_8601"=>"2019-03-24T19:17:25.438144Z", "url"=>"https://files.pythonhosted.org/packages/89/18/c634cc95db36cfa523a75f3ae4e5ee3055b8bcf56969bc3231cdddb3d082/ocrd_ocropy-0.0.2.tar.gz"}], "0.0.3"=>[{"comment_text"=>"", "digests"=>{"md5"=>"8a0d325dd9a10aea746f05824d30ce5c", "sha256"=>"2eb914d948f0dcf543560e9c2cb13eccd8d96f335febef1753e108279d0fdc7e"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"8a0d325dd9a10aea746f05824d30ce5c", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10632, "upload_time"=>"2019-03-24T19:53:40", "upload_time_iso_8601"=>"2019-03-24T19:53:40.405082Z", "url"=>"https://files.pythonhosted.org/packages/7b/0a/dd552d4077fe60652b1fe30e0fe4363686838bc8b88aa852d080e667d370/ocrd_ocropy-0.0.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"69fe2b3b78a357940f17678bdc78a80b", "sha256"=>"f7b3f421f34d2cb4637b864709349ee508e859d1f512ce65be8bc3f2ab35374c"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.3.tar.gz", "has_sig"=>false, "md5_digest"=>"69fe2b3b78a357940f17678bdc78a80b", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>5867, "upload_time"=>"2019-03-24T19:53:41", "upload_time_iso_8601"=>"2019-03-24T19:53:41.685748Z", "url"=>"https://files.pythonhosted.org/packages/6b/5a/d711492c2f10b241069361df84544145dab22654a173ac566645cec0bb9f/ocrd_ocropy-0.0.3.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"8a0d325dd9a10aea746f05824d30ce5c", "sha256"=>"2eb914d948f0dcf543560e9c2cb13eccd8d96f335febef1753e108279d0fdc7e"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"8a0d325dd9a10aea746f05824d30ce5c", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>10632, "upload_time"=>"2019-03-24T19:53:40", "upload_time_iso_8601"=>"2019-03-24T19:53:40.405082Z", "url"=>"https://files.pythonhosted.org/packages/7b/0a/dd552d4077fe60652b1fe30e0fe4363686838bc8b88aa852d080e667d370/ocrd_ocropy-0.0.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"69fe2b3b78a357940f17678bdc78a80b", "sha256"=>"f7b3f421f34d2cb4637b864709349ee508e859d1f512ce65be8bc3f2ab35374c"}, "downloads"=>-1, "filename"=>"ocrd_ocropy-0.0.3.tar.gz", "has_sig"=>false, "md5_digest"=>"69fe2b3b78a357940f17678bdc78a80b", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>5867, "upload_time"=>"2019-03-24T19:53:41", "upload_time_iso_8601"=>"2019-03-24T19:53:41.685748Z", "url"=>"https://files.pythonhosted.org/packages/6b/5a/d711492c2f10b241069361df84544145dab22654a173ac566645cec0bb9f/ocrd_ocropy-0.0.3.tar.gz"}]}, "url"=>"https://github.com/OCR-D/ocrd_ocropy"}, "url"=>"https://github.com/OCR-D/ocrd_ocropy"}

ocrd_olena

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"# Patch and build Olena from Git, then\n# Install OCR-D wrapper for binarization\nFROM ocrd/core\n\nMAINTAINER OCR-D\n\nENV PREFIX=/usr/local\n\nWORKDIR /build-olena\nCOPY .gitmodules .\nCOPY Makefile .\nCOPY ocrd-tool.json .\nCOPY ocrd-olena-binarize .\n\nENV DEPS=\"g++ make automake git\"\nRUN apt-get update && \\\n    apt-get -y install --no-install-recommends $DEPS && \\\n    make deps-ubuntu && \\\n    git init && \\\n    git submodule add https://github.com/OCR-D/olena.git repo/olena && \\\n    git submodule add https://github.com/OCR-D/assets.git repo/assets && \\\n    make build-olena install clean-olena && \\\n    apt-get -y remove $DEPS && \\\n    apt-get -y autoremove && apt-get clean && \\\n    rm -fr /build-olena\n\nWORKDIR /data\nVOLUME /data\n\n#ENTRYPOINT [\"/usr/bin/ocrd-olena-binarize\"]\n#CMD [\"--help\"]\nCMD [\"/usr/bin/ocrd-olena-binarize\", \"--help\"]\n", "README.md"=>"# ocrd_olena\n\n> Binarize with Olena/scribo\n\n[![Build Status](https://travis-ci.org/OCR-D/ocrd_olena.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_olena)\n[![CircleCI](https://circleci.com/gh/OCR-D/ocrd_olena.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_olena)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/core.svg)](https://hub.docker.com/r/ocrd/olena/tags/)\n\n## Requirements\n\n```\nmake deps-ubuntu\n```\n\n...will try to install the required packages on Ubuntu.\n\n## Installation\n\n```\nmake build-olena\n```\n\n...will download, patch and build Olena/scribo from source, and install locally (in VIRTUAL_ENV or in CWD/local).\n\n```\nmake install\n```\n\n...will do that, but additionally install `ocrd-binarize-olena` (the OCR-D wrapper).\n\n## Testing\n\n```\nmake test\n```\n\n...will clone the assets repository from Github, make a workspace copy, and run checksum tests for binarization on them.\n\n## Usage\n\nThis package has the following user interfaces:\n\n### command line interface `scribo-cli`\n\nConverts images in any format to netpbm (monochrome portable bitmap).\n\n```\nUsage: scribo-cli [version] [help] COMMAND [ARGS]\n\nList of available COMMAND argument:\n\n  Full Toolchains\n  ---------------\n\n\n   * On documents\n\n     doc-ppc\t       Common preprocessing before looking for text.\n\n     doc-ocr           Find and recognize text. Output: the actual text\n     \t\t       and its location.\n\n     doc-dia           Analyse the document structure and extract the\n     \t\t       text. Output: an XML file with region and text\n     \t\t       information.\n\n\n\n   * On pictures\n\n     pic-loc           Try to localize text if there's any.\n\n     pic-ocr           Localize and try to recognize text.\n\n\n\n  Tools\n  -----\n\n\n     * xml2doc\t       Convert the XML results of document toolchains\n       \t\t       into user documents (HTML, PDF...).\n\n\n  Algorithms\n  ----------\n\n\n   * Binarization\n\n     sauvola           Sauvola's algorithm.\n\n     sauvola-ms        Multi-scale Sauvola's algorithm.\n\n     sauvola-ms-fg     Extract foreground objects and run multi-scale\n                       Sauvola's algorithm.\n\n     sauvola-ms-split  Run multi-scale Sauvola's algorithm on each color\n                       component and merge results.\n\n---------------------------------------------------------------------------\nSee 'scribo-cli COMMAND --help' for more information on a specific command.\n```\n\nFor example:\n\n```sh\nscribo-cli sauvola-ms path/to/input.tif path/to/output.png --enable-negate-output\n```\n\n### [OCR-D processor](https://ocr-d.github.com/cli) interface `ocrd-olena-binarize`\n\nTo be used with [PageXML](https://github.com/PRImA-Research-Lab/PAGE-XML) documents in an [OCR-D](https://ocr-d.github.io) annotation workflow. Input could be any valid workspace with source images available. Currently covers the `Page` hierarchy level only. Uses either (the last) `AlternativeImage`, if any, or `imageFilename`, otherwise. Adds an `AlternativeImage` with the result of binarization for every page.\n\n```json\n    \"ocrd-olena-binarize\": {\n      \"executable\": \"ocrd-olena-binarize\",\n      \"description\": \"OLENA's binarization algos for OCR-D (on page-level)\",\n      \"categories\": [\n        \"Image preprocessing\"\n      ],\n      \"steps\": [\n        \"preprocessing/optimization/binarization\"\n      ],\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-SEG-WORD\",\n        \"OCR-D-IMG\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-SEG-WORD\"\n      ],\n      \"parameters\": {\n        \"impl\": {\n          \"description\": \"The name of the actual binarization algorithm\",\n          \"type\": \"string\",\n          \"required\": true,\n          \"enum\": [\"sauvola\", \"sauvola-ms\", \"sauvola-ms-fg\", \"sauvola-ms-split\", \"kim\", \"wolf\", \"niblack\", \"singh\", \"otsu\"]\n        },\n        \"win-size\": {\n          \"description\": \"Window size\",\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"default\": 101\n        },\n        \"k\": {\n          \"description\": \"Sauvola's formulae parameter\",\n          \"format\": \"float\",\n          \"type\": \"number\",\n          \"default\": 0.34\n        }\n      }\n    }\n```\n\n## License\n\nCopyright 2018-2020 Project OCR-D\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n   http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n", "ocrd-tool.json"=>"{\n  \"version\": \"1.1.0\",\n  \"git_url\": \"https://github.com/OCR-D/ocrd_olena\",\n  \"tools\": {\n    \"ocrd-olena-binarize\": {\n      \"executable\": \"ocrd-olena-binarize\",\n      \"description\": \"OLENA's binarization algos for OCR-D (on page-level)\",\n      \"categories\": [\n        \"Image preprocessing\"\n      ],\n      \"steps\": [\n        \"preprocessing/optimization/binarization\"\n      ],\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-SEG-WORD\",\n        \"OCR-D-IMG\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-SEG-WORD\"\n      ],\n      \"parameters\": {\n        \"impl\": {\n          \"description\": \"The name of the actual binarization algorithm\",\n          \"type\": \"string\",\n          \"required\": true,\n          \"enum\": [\"sauvola\", \"sauvola-ms\", \"sauvola-ms-fg\", \"sauvola-ms-split\", \"kim\", \"wolf\", \"niblack\", \"singh\", \"otsu\"]\n        },\n        \"win-size\": {\n          \"description\": \"Window size\",\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"default\": 101\n        },\n        \"k\": {\n          \"description\": \"Sauvola's formulae parameter\",\n          \"format\": \"float\",\n          \"type\": \"number\",\n          \"default\": 0.34\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>nil}, "git"=>{"last_commit"=>"Wed Jan 8 18:20:03 2020 +0100", "latest_tag"=>"v1.1.1", "number_of_commits"=>"117", "url"=>"https://github.com/OCR-D/ocrd_olena.git"}, "name"=>"ocrd_olena", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_olena", "tools"=>{"ocrd-olena-binarize"=>{"categories"=>["Image preprocessing"], "description"=>"OLENA's binarization algos for OCR-D (on page-level)", "executable"=>"ocrd-olena-binarize", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD", "OCR-D-IMG"], "output_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD"], "parameters"=>{"impl"=>{"description"=>"The name of the actual binarization algorithm", "enum"=>["sauvola", "sauvola-ms", "sauvola-ms-fg", "sauvola-ms-split", "kim", "wolf", "niblack", "singh", "otsu"], "required"=>true, "type"=>"string"}, "k"=>{"default"=>0.34, "description"=>"Sauvola's formulae parameter", "format"=>"float", "type"=>"number"}, "win-size"=>{"default"=>101, "description"=>"Window size", "format"=>"integer", "type"=>"number"}}, "steps"=>["preprocessing/optimization/binarization"]}}, "version"=>"1.1.0"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_olena", "url"=>"https://github.com/OCR-D/ocrd_olena"}

ocrd_segment

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>nil, "README.md"=>"# ocrd_segment\n\nThis repository aims to provide a number of [OCR-D-compliant processors](https://ocr-d.github.io/cli) for layout analysis and evaluation.\n\n## Installation\n\nIn your virtual environment, run:\n```bash\npip install .\n```\n\n## Usage\n\n  - extracting page images (including results from preprocessing like cropping, deskewing or binarization) along with region polygon coordinates and metadata:\n    - [ocrd-segment-extract-regions](ocrd_segment/extract_regions.py)\n  - extracting line images (including results from preprocessing like cropping, deskewing, dewarping or binarization) along with line polygon coordinates and metadata:\n    - [ocrd-segment-extract-lines](ocrd_segment/extract_lines.py)\n  - comparing different layout segmentations (input file groups N = 2, compute the distance between two segmentations, e.g. automatic vs. manual):\n    - [ocrd-segment-evaluate](ocrd_segment/evaluate.py) :construction: (very early stage)\n  - repairing layout segmentations (input file groups N >= 1, based on heuristics implemented using Shapely):\n    - [ocrd-segment-repair](ocrd_segment/repair.py) :construction: (much to be done)\n  - pattern-based segmentation (input file groups N=1, based on a PAGE template, e.g. from Aletheia, and some XSLT or Python to apply it to the input file group)\n    - `ocrd-segment-via-template` :construction: (unpublished)\n  - data-driven segmentation (input file groups N=1, based on a statistical model, e.g. Neural Network)  \n    - `ocrd-segment-via-model` :construction: (unpublished)\n\nFor detailed description on input/output and parameters, see [ocrd-tool.json](ocrd_segment/ocrd-tool.json)\n\n## Testing\n\nNone yet.\n", "ocrd-tool.json"=>"{\n  \"version\": \"0.0.1\",\n  \"git_url\": \"https://github.com/OCR-D/ocrd_segment\",\n  \"tools\": {\n    \"ocrd-segment-repair\": {\n      \"executable\": \"ocrd-segment-repair\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Analyse and repair region segmentation\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\",\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-EVAL-BLOCK\"\n      ],\n      \"steps\": [\"layout/segmentation/region\"],\n      \"parameters\": {\n        \"sanitize\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"Shrink and/or expand a region in such a way that it coordinates include those of all its lines\"\n        },\n        \"plausibilize\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"Remove redundant (almost equal or almost contained) regions, and merge overlapping regions\"\n        },\n        \"plausibilize_merge_min_overlap\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"default\": 0.90,\n          \"description\": \"When merging a region almost contained in another, require at least this ratio of area is shared with the other\"\n        }\n      }\n    },\n    \"ocrd-segment-extract-regions\": {\n      \"executable\": \"ocrd-segment-extract-regions\",\n      \"categories\": [\"Image preprocessing\"],\n      \"description\": \"Extract region segmentation as image+JSON\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-GT-SEG-BLOCK\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-IMG-CROP\"\n      ],\n      \"steps\": [\"layout/analysis\"],\n      \"parameters\": {\n        \"transparency\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"Add alpha channels with segment masks to the images\"\n        }\n      }\n    },\n    \"ocrd-segment-extract-lines\": {\n      \"executable\": \"ocrd-segment-extract-lines\",\n      \"categories\": [\"Image preprocessing\"],\n      \"description\": \"Extract line segmentation as image+txt+JSON\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-GT-SEG-LINE\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-IMG-CROP\"\n      ],\n      \"steps\": [\"layout/analysis\"],\n      \"parameters\": {\n        \"transparency\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"Add alpha channels with segment masks to the images\"\n        }\n      }\n    },\n    \"ocrd-segment-evaluate\": {\n      \"executable\": \"ocrd-segment-evaluate\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Compare region segmentations\",\n      \"input_file_grp\": [\n        \"OCR-D-GT-SEG-BLOCK\",\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"steps\": [\"layout/analysis\"],\n      \"parameters\": {\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n\n    - ocrd-segment-repair\n    - ocrd-segment-extract-pages\n    - ocrd-segment-extract-regions\n    - ocrd-segment-extract-lines\n    - ocrd-segment-evaluate\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_segment',\n    version='0.0.2',\n    description='Page segmentation and segmentation evaluation',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    author='Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de',\n    url='https://github.com/OCR-D/ocrd_segment',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=open('requirements.txt').read().split('\\n'),\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-segment-repair=ocrd_segment.cli:ocrd_segment_repair',\n            'ocrd-segment-extract-pages=ocrd_segment.cli:ocrd_segment_extract_pages',\n            'ocrd-segment-extract-regions=ocrd_segment.cli:ocrd_segment_extract_regions',\n            'ocrd-segment-extract-lines=ocrd_segment.cli:ocrd_segment_extract_lines',\n            'ocrd-segment-evaluate=ocrd_segment.cli:ocrd_segment_evaluate',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Thu Jan 16 10:42:42 2020 +0000", "latest_tag"=>"v0.0.2", "number_of_commits"=>"60", "url"=>"https://github.com/OCR-D/ocrd_segment.git"}, "name"=>"ocrd_segment", "ocrd_tool"=>{"git_url"=>"https://github.com/OCR-D/ocrd_segment", "tools"=>{"ocrd-segment-evaluate"=>{"categories"=>["Layout analysis"], "description"=>"Compare region segmentations", "executable"=>"ocrd-segment-evaluate", "input_file_grp"=>["OCR-D-GT-SEG-BLOCK", "OCR-D-SEG-BLOCK"], "parameters"=>{}, "steps"=>["layout/analysis"]}, "ocrd-segment-extract-lines"=>{"categories"=>["Image preprocessing"], "description"=>"Extract line segmentation as image+txt+JSON", "executable"=>"ocrd-segment-extract-lines", "input_file_grp"=>["OCR-D-SEG-LINE", "OCR-D-GT-SEG-LINE"], "output_file_grp"=>["OCR-D-IMG-CROP"], "parameters"=>{"transparency"=>{"default"=>true, "description"=>"Add alpha channels with segment masks to the images", "type"=>"boolean"}}, "steps"=>["layout/analysis"]}, "ocrd-segment-extract-regions"=>{"categories"=>["Image preprocessing"], "description"=>"Extract region segmentation as image+JSON", "executable"=>"ocrd-segment-extract-regions", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-GT-SEG-BLOCK"], "output_file_grp"=>["OCR-D-IMG-CROP"], "parameters"=>{"transparency"=>{"default"=>true, "description"=>"Add alpha channels with segment masks to the images", "type"=>"boolean"}}, "steps"=>["layout/analysis"]}, "ocrd-segment-repair"=>{"categories"=>["Layout analysis"], "description"=>"Analyse and repair region segmentation", "executable"=>"ocrd-segment-repair", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-BLOCK"], "output_file_grp"=>["OCR-D-EVAL-BLOCK"], "parameters"=>{"plausibilize"=>{"default"=>false, "description"=>"Remove redundant (almost equal or almost contained) regions, and merge overlapping regions", "type"=>"boolean"}, "plausibilize_merge_min_overlap"=>{"default"=>0.9, "description"=>"When merging a region almost contained in another, require at least this ratio of area is shared with the other", "format"=>"float", "type"=>"number"}, "sanitize"=>{"default"=>false, "description"=>"Shrink and/or expand a region in such a way that it coordinates include those of all its lines", "type"=>"boolean"}}, "steps"=>["layout/segmentation/region"]}}, "version"=>"0.0.1"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>false, "org_plus_name"=>"OCR-D/ocrd_segment", "python"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky", "author-email"=>"unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de", "name"=>"ocrd_segment", "pypi"=>{"info"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky", "author_email"=>"unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_segment\n\nThis repository aims to provide a number of [OCR-D-compliant processors](https://ocr-d.github.io/cli) for layout analysis and evaluation.\n\n## Installation\n\nIn your virtual environment, run:\n```bash\npip install .\n```\n\n## Usage\n\n  - extracting page images (including results from preprocessing like cropping, deskewing or binarization) along with region polygon coordinates and metadata:\n    - [ocrd-segment-extract-regions](ocrd_segment/extract_regions.py)\n  - extracting line images (including results from preprocessing like cropping, deskewing, dewarping or binarization) along with line polygon coordinates and metadata:\n    - [ocrd-segment-extract-lines](ocrd_segment/extract_lines.py)\n  - comparing different layout segmentations (input file groups N = 2, compute the distance between two segmentations, e.g. automatic vs. manual):\n    - [ocrd-segment-evaluate](ocrd_segment/evaluate.py) :construction: (very early stage)\n  - repairing layout segmentations (input file groups N >= 1, based on heuristics implemented using Shapely):\n    - [ocrd-segment-repair](ocrd_segment/repair.py) :construction: (much to be done)\n  - pattern-based segmentation (input file groups N=1, based on a PAGE template, e.g. from Aletheia, and some XSLT or Python to apply it to the input file group)\n    - `ocrd-segment-via-template` :construction: (unpublished)\n  - data-driven segmentation (input file groups N=1, based on a statistical model, e.g. Neural Network)  \n    - `ocrd-segment-via-model` :construction: (unpublished)\n\nFor detailed description on input/output and parameters, see [ocrd-tool.json](ocrd_segment/ocrd-tool.json)\n\n## Testing\n\nNone yet.\n\n\n", "description_content_type"=>"", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/OCR-D/ocrd_segment", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-segment", "package_url"=>"https://pypi.org/project/ocrd-segment/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-segment/", "project_urls"=>{"Homepage"=>"https://github.com/OCR-D/ocrd_segment"}, "release_url"=>"https://pypi.org/project/ocrd-segment/0.0.2/", "requires_dist"=>["ocrd (>=1.0.0b19)", "click", "shapely"], "requires_python"=>"", "summary"=>"Page segmentation and segmentation evaluation", "version"=>"0.0.2"}, "last_serial"=>6235446, "releases"=>{"0.0.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e9bc6112469e53afd56563d862000228", "sha256"=>"9b549066f46f26a147b726066712a423f9fcf64b8274dd8285447c564f361783"}, "downloads"=>-1, "filename"=>"ocrd_segment-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e9bc6112469e53afd56563d862000228", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>14529, "upload_time"=>"2019-12-02T11:50:29", "upload_time_iso_8601"=>"2019-12-02T11:50:29.761485Z", "url"=>"https://files.pythonhosted.org/packages/90/34/4825c12fa6e8238ce350fc766f6aaa0d591705c8f426160eb59ec7513541/ocrd_segment-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"6b258735d218ef459887c4d8d23382c7", "sha256"=>"284557d2fd985bf4be93b4bbbe08ba3fc2668300f5c9694af6c93f0be7a7c1c9"}, "downloads"=>-1, "filename"=>"ocrd_segment-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"6b258735d218ef459887c4d8d23382c7", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>10335, "upload_time"=>"2019-12-02T11:50:34", "upload_time_iso_8601"=>"2019-12-02T11:50:34.482743Z", "url"=>"https://files.pythonhosted.org/packages/d0/e8/ab967b490f8cc4f70438b278530042a4eb5a9237941cd084fece279cb507/ocrd_segment-0.0.2.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e9bc6112469e53afd56563d862000228", "sha256"=>"9b549066f46f26a147b726066712a423f9fcf64b8274dd8285447c564f361783"}, "downloads"=>-1, "filename"=>"ocrd_segment-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e9bc6112469e53afd56563d862000228", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>14529, "upload_time"=>"2019-12-02T11:50:29", "upload_time_iso_8601"=>"2019-12-02T11:50:29.761485Z", "url"=>"https://files.pythonhosted.org/packages/90/34/4825c12fa6e8238ce350fc766f6aaa0d591705c8f426160eb59ec7513541/ocrd_segment-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"6b258735d218ef459887c4d8d23382c7", "sha256"=>"284557d2fd985bf4be93b4bbbe08ba3fc2668300f5c9694af6c93f0be7a7c1c9"}, "downloads"=>-1, "filename"=>"ocrd_segment-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"6b258735d218ef459887c4d8d23382c7", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>10335, "upload_time"=>"2019-12-02T11:50:34", "upload_time_iso_8601"=>"2019-12-02T11:50:34.482743Z", "url"=>"https://files.pythonhosted.org/packages/d0/e8/ab967b490f8cc4f70438b278530042a4eb5a9237941cd084fece279cb507/ocrd_segment-0.0.2.tar.gz"}]}, "url"=>"https://github.com/OCR-D/ocrd_segment"}, "url"=>"https://github.com/OCR-D/ocrd_segment"}

ocrd_tesserocr

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\n\nWORKDIR /build-ocrd\nCOPY setup.py .\nCOPY README.md .\nCOPY requirements.txt .\nCOPY requirements_test.txt .\nCOPY ocrd_tesserocr ./ocrd_tesserocr\nCOPY Makefile .\nRUN make deps-ubuntu && \\\n    apt-get install -y --no-install-recommends \\\n    g++ \\\n    tesseract-ocr-script-frak \\\n    tesseract-ocr-deu \\\n    && make deps install \\\n    && rm -rf /build-ocrd \\\n    && apt-get -y remove --auto-remove g++ libtesseract-dev make\n", "README.md"=>"# ocrd_tesserocr\n\n> Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr\n\n[![image](https://circleci.com/gh/OCR-D/ocrd_tesserocr.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_tesserocr)\n[![image](https://img.shields.io/pypi/v/ocrd_tesserocr.svg)](https://pypi.org/project/ocrd_tesserocr/)\n[![image](https://codecov.io/gh/OCR-D/ocrd_tesserocr/branch/master/graph/badge.svg)](https://codecov.io/gh/OCR-D/ocrd_tesserocr)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/tesserocr.svg)](https://hub.docker.com/r/ocrd/tesserocr/tags/)\n\n## Introduction\n\nThis offers [OCR-D](https://ocr-d.github.io) compliant workspace processors for (much of) the functionality of [Tesseract](https://github.com/tesseract-ocr) via its Python API wrapper [tesserocr](https://github.com/sirfz/tesserocr) . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)\n\nThis includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via `AlternativeImage`, text results via `TextEquiv`, deskewing via `@orientation`, cropping via `Border` and segmentation via `Region` / `TextLine` / `Word` elements with `Coords/@points`.\n\n## Installation\n\n### Required ubuntu packages:\n\n- Tesseract headers (`libtesseract-dev`)\n- Some Tesseract language models (`tesseract-ocr-{eng,deu,frk,...}` or script models (`tesseract-ocr-script-{latn,frak,...}`)\n- Leptonica headers (`libleptonica-dev`)\n\n### From PyPI\n\nThis is the best option if you want to use the stable, released version.\n\n---\n\n**NOTE**\n\nocrd_tesserocr requires **Tesseract >= 4.1.0**. The Tesseract packages\nbundled with **Ubuntu < 19.10** are too old. If you are on Ubuntu 18.04 LTS,\nplease enable [Alexander Pozdnyakov PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr) which\nhas up-to-date builds of Tesseract and its dependencies:\n\n```sh\nsudo add-apt-repository ppa:alex-p/tesseract-ocr\nsudo apt-get update\n```\n\n---\n\n```sh\nsudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget\npip install ocrd_tesserocr\n```\n\n### With docker\n\nThis is the best option if you want to run the software in a container.\n\nYou need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n\n```sh\ndocker pull ocrd/tesserocr\n```\n\nTo run with docker:\n\n```\ndocker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...\n```\n\n\n### From git \n\nThis is the best option if you want to change the source code or install the latest, unpublished changes.\n\nWe strongly recommend to use [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).\n\n```sh\ngit clone https://github.com/OCR-D/ocrd_tesserocr\ncd ocrd_tesserocr\nsudo make deps-ubuntu # or manually with apt-get\nmake deps        # or pip install -r requirements\nmake install     # or pip install .\n```\n\n## Usage\n\nSee docstrings and in the individual processors and [ocrd-tool.json](ocrd_tesserocr/ocrd-tool.json) descriptions.\n\nAvailable processors are:\n\n- [ocrd-tesserocr-crop](ocrd_tesserocr/crop.py)\n- [ocrd-tesserocr-deskew](ocrd_tesserocr/deskew.py)\n- [ocrd-tesserocr-binarize](ocrd_tesserocr/binarize.py)\n- [ocrd-tesserocr-segment-region](ocrd_tesserocr/segment_region.py)\n- [ocrd-tesserocr-segment-table](ocrd_tesserocr/segment_table.py)\n- [ocrd-tesserocr-segment-line](ocrd_tesserocr/segment_line.py)\n- [ocrd-tesserocr-segment-word](ocrd_tesserocr/segment_word.py)\n- [ocrd-tesserocr-recognize](ocrd_tesserocr/recognize.py)\n\n## Testing\n\n```sh\nmake test\n```\n\nThis downloads some test data from https://github.com/OCR-D/assets under `repo/assets`, and runs some basic test of the Python API as well as the CLIs.\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n", "ocrd-tool.json"=>"{\n  \"version\": \"0.8.0\",\n  \"git_url\": \"https://github.com/OCR-D/ocrd_tesserocr\",\n  \"dockerhub\": \"ocrd/tesserocr\",\n  \"tools\": {\n    \"ocrd-tesserocr-deskew\": {\n      \"executable\": \"ocrd-tesserocr-deskew\",\n      \"categories\": [\"Image preprocessing\"],\n      \"description\": \"Detect script, orientation and skew angle for pages or regions\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\",\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-DESKEW-BLOCK\"\n      ],\n      \"steps\": [\"preprocessing/optimization/deskewing\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"operation_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"page\",\"region\"],\n          \"default\": \"region\",\n          \"description\": \"PAGE XML hierarchy level to operate on\"\n        },\n        \"min_orientation_confidence\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"default\": 1.5,\n          \"description\": \"Minimum confidence score to apply orientation as detected by OSD\"\n        }\n      }\n    },\n    \"ocrd-tesserocr-recognize\": {\n      \"executable\": \"ocrd-tesserocr-recognize\",\n      \"categories\": [\"Text recognition and optimization\"],\n      \"description\": \"Recognize text in lines with Tesseract (using annotated derived images, or masking and cropping images from coordinate polygons)\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-SEG-WORD\",\n        \"OCR-D-SEG-GLYPH\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-OCR-TESS\"\n      ],\n      \"steps\": [\"recognition/text-recognition\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"word\",\n          \"description\": \"Lowest PAGE XML hierarchy level to add the TextEquiv results to; when below `region`, implicitly adds segmentation below the line level, but requires existing line segmentation\"\n        },\n        \"overwrite_words\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"Remove existing layout and text annotation below the TextLine level (regardless of textequiv_level).\"\n        },\n        \"raw_lines\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"Do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) when using line images (i.e. when textequiv_level<region). Can increase accuracy for certain workflows. Disable when line segments/images may contain components of more than 1 line, or larger gaps/white-spaces.\"\n        },\n        \"char_whitelist\": {\n          \"type\": \"string\",\n          \"default\": \"\",\n          \"description\": \"Enumeration of character hypotheses (from the model) to allow exclusively; overruled by blacklist if set.\"\n        },\n        \"char_blacklist\": {\n          \"type\": \"string\",\n          \"default\": \"\",\n          \"description\": \"Enumeration of character hypotheses (from the model) to suppress; overruled by unblacklist if set.\"\n        },\n        \"char_unblacklist\": {\n          \"type\": \"string\",\n          \"default\": \"\",\n          \"description\": \"Enumeration of character hypotheses (from the model) to allow inclusively.\"\n        },\n        \"model\": {\n          \"type\": \"string\",\n          \"description\": \"tessdata model to apply (an ISO 639-3 language specification or some other basename, e.g. deu-frak or Fraktur)\"\n        }\n      }\n    },\n     \"ocrd-tesserocr-segment-region\": {\n      \"executable\": \"ocrd-tesserocr-segment-region\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Segment page into regions with Tesseract\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\",\n        \"OCR-D-SEG-PAGE\",\n        \"OCR-D-GT-SEG-PAGE\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"steps\": [\"layout/segmentation/region\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"overwrite_regions\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"remove existing layout and text annotation below the Page level\"\n        },\n        \"padding\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"extend detected region rectangles by this many (true) pixels\",\n          \"default\": 0\n        },\n        \"crop_polygons\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"annotate polygon coordinates instead of bounding box rectangles\"\n        },\n        \"find_tables\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"recognise tables as table regions (textord_tabfind_find_tables)\"\n        }\n      }\n    },\n     \"ocrd-tesserocr-segment-table\": {\n      \"executable\": \"ocrd-tesserocr-segment-table\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Segment table regions into cell text regions with Tesseract\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-GT-SEG-BLOCK\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"steps\": [\"layout/segmentation/region\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"overwrite_regions\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"remove existing layout and text annotation below the region level\"\n        }\n      }\n     },\n     \"ocrd-tesserocr-segment-line\": {\n      \"executable\": \"ocrd-tesserocr-segment-line\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Segment regions into lines with Tesseract\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-GT-SEG-BLOCK\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-LINE\"\n      ],\n      \"steps\": [\"layout/segmentation/line\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"overwrite_lines\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"remove existing layout and text annotation below the TextRegion level\"\n        }\n      }\n    },\n    \"ocrd-tesserocr-segment-word\": {\n      \"executable\": \"ocrd-tesserocr-segment-word\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Segment lines into words with Tesseract\",\n      \"input_file_grp\": [\n        \"OCR-D-SEG-LINE\",\n        \"OCR-D-GT-SEG-LINE\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-WORD\"\n      ],\n      \"steps\": [\"layout/segmentation/word\"],\n      \"parameters\": {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"overwrite_words\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"remove existing layout and text annotation below the TextLine level\"\n        }\n      }\n    },\n    \"ocrd-tesserocr-crop\": {\n      \"executable\": \"ocrd-tesserocr-crop\",\n      \"categories\": [\"Image preprocessing\"],\n      \"description\": \"Poor man's cropping via region segmentation\",\n      \"input_file_grp\": [\n\t\"OCR-D-IMG\"\n      ],\n      \"output_file_grp\": [\n\t\"OCR-D-SEG-PAGE\"\n      ],\n      \"steps\": [\"preprocessing/optimization/cropping\"],\n      \"parameters\" : {\n        \"dpi\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n          \"default\": -1\n        },\n        \"padding\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"extend detected border by this many (true) pixels on every side\",\n          \"default\": 4\n        }\n      }\n    },\n    \"ocrd-tesserocr-binarize\": {\n      \"executable\": \"ocrd-tesserocr-binarize\",\n      \"categories\": [\"Image preprocessing\"],\n      \"description\": \"Binarize regions or lines with Tesseract's global Otsu\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG\",\n        \"OCR-D-SEG-BLOCK\",\n        \"OCR-D-SEG-LINE\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-BIN-BLOCK\",\n        \"OCR-D-BIN-LINE\"\n      ],\n      \"steps\": [\"preprocessing/optimization/binarization\"],\n      \"parameters\": {\n        \"operation_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\"],\n          \"default\": \"region\",\n          \"description\": \"PAGE XML hierarchy level to operate on\"\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\n\"\"\"\nInstalls five executables:\n\n    - ocrd_tesserocr_recognize\n    - ocrd_tesserocr_segment_region\n    - ocrd_tesserocr_segment_table\n    - ocrd_tesserocr_segment_line\n    - ocrd_tesserocr_segment_word\n    - ocrd_tesserocr_crop\n    - ocrd_tesserocr_deskew\n    - ocrd_tesserocr_binarize\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_tesserocr',\n    version='0.8.0',\n    description='Tesserocr bindings',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    author='Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de',\n    url='https://github.com/OCR-D/ocrd_tesserocr',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=open('requirements.txt').read().split('\\n'),\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-tesserocr-recognize=ocrd_tesserocr.cli:ocrd_tesserocr_recognize',\n            'ocrd-tesserocr-segment-region=ocrd_tesserocr.cli:ocrd_tesserocr_segment_region',\n            'ocrd-tesserocr-segment-table=ocrd_tesserocr.cli:ocrd_tesserocr_segment_table',\n            'ocrd-tesserocr-segment-line=ocrd_tesserocr.cli:ocrd_tesserocr_segment_line',\n            'ocrd-tesserocr-segment-word=ocrd_tesserocr.cli:ocrd_tesserocr_segment_word',\n            'ocrd-tesserocr-crop=ocrd_tesserocr.cli:ocrd_tesserocr_crop',\n            'ocrd-tesserocr-deskew=ocrd_tesserocr.cli:ocrd_tesserocr_deskew',\n            'ocrd-tesserocr-binarize=ocrd_tesserocr.cli:ocrd_tesserocr_binarize',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Fri Jan 24 15:20:03 2020 +0100", "latest_tag"=>"v0.8.0", "number_of_commits"=>"334", "url"=>"https://github.com/OCR-D/ocrd_tesserocr.git"}, "name"=>"ocrd_tesserocr", "ocrd_tool"=>{"dockerhub"=>"ocrd/tesserocr", "git_url"=>"https://github.com/OCR-D/ocrd_tesserocr", "tools"=>{"ocrd-tesserocr-binarize"=>{"categories"=>["Image preprocessing"], "description"=>"Binarize regions or lines with Tesseract's global Otsu", "executable"=>"ocrd-tesserocr-binarize", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-BIN-BLOCK", "OCR-D-BIN-LINE"], "parameters"=>{"operation_level"=>{"default"=>"region", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["region", "line"], "type"=>"string"}}, "steps"=>["preprocessing/optimization/binarization"]}, "ocrd-tesserocr-crop"=>{"categories"=>["Image preprocessing"], "description"=>"Poor man's cropping via region segmentation", "executable"=>"ocrd-tesserocr-crop", "input_file_grp"=>["OCR-D-IMG"], "output_file_grp"=>["OCR-D-SEG-PAGE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "padding"=>{"default"=>4, "description"=>"extend detected border by this many (true) pixels on every side", "format"=>"integer", "type"=>"number"}}, "steps"=>["preprocessing/optimization/cropping"]}, "ocrd-tesserocr-deskew"=>{"categories"=>["Image preprocessing"], "description"=>"Detect script, orientation and skew angle for pages or regions", "executable"=>"ocrd-tesserocr-deskew", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-BLOCK"], "output_file_grp"=>["OCR-D-DESKEW-BLOCK"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "min_orientation_confidence"=>{"default"=>1.5, "description"=>"Minimum confidence score to apply orientation as detected by OSD", "format"=>"float", "type"=>"number"}, "operation_level"=>{"default"=>"region", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region"], "type"=>"string"}}, "steps"=>["preprocessing/optimization/deskewing"]}, "ocrd-tesserocr-recognize"=>{"categories"=>["Text recognition and optimization"], "description"=>"Recognize text in lines with Tesseract (using annotated derived images, or masking and cropping images from coordinate polygons)", "executable"=>"ocrd-tesserocr-recognize", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD", "OCR-D-SEG-GLYPH"], "output_file_grp"=>["OCR-D-OCR-TESS"], "parameters"=>{"char_blacklist"=>{"default"=>"", "description"=>"Enumeration of character hypotheses (from the model) to suppress; overruled by unblacklist if set.", "type"=>"string"}, "char_unblacklist"=>{"default"=>"", "description"=>"Enumeration of character hypotheses (from the model) to allow inclusively.", "type"=>"string"}, "char_whitelist"=>{"default"=>"", "description"=>"Enumeration of character hypotheses (from the model) to allow exclusively; overruled by blacklist if set.", "type"=>"string"}, "dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "model"=>{"description"=>"tessdata model to apply (an ISO 639-3 language specification or some other basename, e.g. deu-frak or Fraktur)", "type"=>"string"}, "overwrite_words"=>{"default"=>false, "description"=>"Remove existing layout and text annotation below the TextLine level (regardless of textequiv_level).", "type"=>"boolean"}, "raw_lines"=>{"default"=>false, "description"=>"Do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) when using line images (i.e. when textequiv_level<region). Can increase accuracy for certain workflows. Disable when line segments/images may contain components of more than 1 line, or larger gaps/white-spaces.", "type"=>"boolean"}, "textequiv_level"=>{"default"=>"word", "description"=>"Lowest PAGE XML hierarchy level to add the TextEquiv results to; when below `region`, implicitly adds segmentation below the line level, but requires existing line segmentation", "enum"=>["region", "line", "word", "glyph"], "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}, "ocrd-tesserocr-segment-line"=>{"categories"=>["Layout analysis"], "description"=>"Segment regions into lines with Tesseract", "executable"=>"ocrd-tesserocr-segment-line", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-GT-SEG-BLOCK"], "output_file_grp"=>["OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "overwrite_lines"=>{"default"=>true, "description"=>"remove existing layout and text annotation below the TextRegion level", "type"=>"boolean"}}, "steps"=>["layout/segmentation/line"]}, "ocrd-tesserocr-segment-region"=>{"categories"=>["Layout analysis"], "description"=>"Segment page into regions with Tesseract", "executable"=>"ocrd-tesserocr-segment-region", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-PAGE", "OCR-D-GT-SEG-PAGE"], "output_file_grp"=>["OCR-D-SEG-BLOCK"], "parameters"=>{"crop_polygons"=>{"default"=>false, "description"=>"annotate polygon coordinates instead of bounding box rectangles", "type"=>"boolean"}, "dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "find_tables"=>{"default"=>true, "description"=>"recognise tables as table regions (textord_tabfind_find_tables)", "type"=>"boolean"}, "overwrite_regions"=>{"default"=>true, "description"=>"remove existing layout and text annotation below the Page level", "type"=>"boolean"}, "padding"=>{"default"=>0, "description"=>"extend detected region rectangles by this many (true) pixels", "format"=>"integer", "type"=>"number"}}, "steps"=>["layout/segmentation/region"]}, "ocrd-tesserocr-segment-table"=>{"categories"=>["Layout analysis"], "description"=>"Segment table regions into cell text regions with Tesseract", "executable"=>"ocrd-tesserocr-segment-table", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-GT-SEG-BLOCK"], "output_file_grp"=>["OCR-D-SEG-BLOCK"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "overwrite_regions"=>{"default"=>true, "description"=>"remove existing layout and text annotation below the region level", "type"=>"boolean"}}, "steps"=>["layout/segmentation/region"]}, "ocrd-tesserocr-segment-word"=>{"categories"=>["Layout analysis"], "description"=>"Segment lines into words with Tesseract", "executable"=>"ocrd-tesserocr-segment-word", "input_file_grp"=>["OCR-D-SEG-LINE", "OCR-D-GT-SEG-LINE"], "output_file_grp"=>["OCR-D-SEG-WORD"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "overwrite_words"=>{"default"=>true, "description"=>"remove existing layout and text annotation below the TextLine level", "type"=>"boolean"}}, "steps"=>["layout/segmentation/word"]}}, "version"=>"0.8.0"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>true, "org_plus_name"=>"OCR-D/ocrd_tesserocr", "python"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky", "author-email"=>"unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de", "name"=>"ocrd_tesserocr", "pypi"=>{"info"=>{"author"=>"Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky", "author_email"=>"unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_tesserocr\n\n> Crop, deskew, segment into regions / lines / words, or recognize with tesserocr\n\n[![image](https://circleci.com/gh/OCR-D/ocrd_tesserocr.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_tesserocr)\n[![image](https://img.shields.io/pypi/v/ocrd_tesserocr.svg)](https://pypi.org/project/ocrd_tesserocr/)\n[![image](https://codecov.io/gh/OCR-D/ocrd_tesserocr/branch/master/graph/badge.svg)](https://codecov.io/gh/OCR-D/ocrd_tesserocr)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/tesserocr.svg)](https://hub.docker.com/r/ocrd/tesserocr/tags/)\n\n## Introduction\n\nThis offers [OCR-D](https://ocr-d.github.io) compliant workspace processors for (much of) the functionality of [Tesseract](https://github.com/tesseract-ocr) via its Python API wrapper [tesserocr](https://github.com/sirfz/tesserocr) . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)\n\nThis includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via `AlternativeImage`, text results via `TextEquiv`, deskewing via `@orientation`, cropping via `Border` and segmentation via `Region` / `TextLine` / `Word` elements with `Coords/@points`.\n\n## Installation\n\n### Required ubuntu packages:\n\n- Tesseract headers (`libtesseract-dev`)\n- Some tesseract language models (`tesseract-ocr-{eng,deu,frk,...}` or script models (`tesseract-ocr-script-{latn,frak,...}`)\n- Leptonica headers (`libleptonica-dev`)\n\n### From PyPI\n\nThis is the best option if you want to use the stable, released version.\n\n---\n\n**NOTE**\n\nocrd_tesserocr requires **Tesseract >= 4.1.0**. The Tesseract packages\nbundled with **Ubuntu < 19.10** are too old. If you are on Ubuntu 18.04 LTS,\nplease enable [Alexander Pozdnyakov PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr) which\nhas up-to-date builds of Tesseract and its dependencies:\n\n```sh\nsudo add-apt-repository ppa:alex-p/tesseract-ocr\nsudo apt-get update\n```\n\n---\n\n```sh\nsudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget\npip install ocrd_tesserocr\n```\n\n### With docker\n\nThis is the best option if you want to run the software in a container.\n\nYou need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n\n```sh\ndocker pull ocrd/tesserocr\n```\n\n### From git \n\nThis is the best option if you want to change the source code or install the latest, unpublished changes.\n\nWe strongly recommend to use [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).\n\n```sh\ngit clone https://github.com/OCR-D/ocrd_tesserocr\ncd ocrd_tesserocr\nmake deps-ubuntu # or manually with apt-get\nmake deps        # or pip install -r requirements\nmake install     # or pip install .\n```\n\n## Usage\n\nSee docstrings and in the individual processors and [ocrd-tool.json](ocrd_tesserocr/ocrd-tool.json) descriptions.\n\nAvailable processors are:\n\n- [ocrd-tesserocr-crop](ocrd_tesserocr/crop.py)\n- [ocrd-tesserocr-deskew](ocrd_tesserocr/deskew.py)\n- [ocrd-tesserocr-binarize](ocrd_tesserocr/binarize.py)\n- [ocrd-tesserocr-segment-region](ocrd_tesserocr/segment_region.py)\n- [ocrd-tesserocr-segment-line](ocrd_tesserocr/segment_line.py)\n- [ocrd-tesserocr-segment-word](ocrd_tesserocr/segment_word.py)\n- [ocrd-tesserocr-recognize](ocrd_tesserocr/recognize.py)\n\n## Testing\n\nTo run with docker:\n\n```\ndocker run ocrd/tesserocr ocrd-tesserocrd-crop ...\n```\n\n## Testing\n\n```sh\nmake test\n```\n\nThis downloads some test data from https://github.com/OCR-D/assets under `repo/assets`, and runs some basic test of the Python API as well as the CLIs.\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n\n## Development\n\nLatest changes that require pre-release of [ocrd >= 2.0.0](https://github.com/OCR-D/core/tree/edge) are kept in branch [`edge`](https://github.com/OCR-D/ocrd_tesserocr/tree/edge).\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/OCR-D/ocrd_tesserocr", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-tesserocr", "package_url"=>"https://pypi.org/project/ocrd-tesserocr/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-tesserocr/", "project_urls"=>{"Homepage"=>"https://github.com/OCR-D/ocrd_tesserocr"}, "release_url"=>"https://pypi.org/project/ocrd-tesserocr/0.7.0/", "requires_dist"=>["ocrd (>=2.0.0)", "click", "tesserocr (>=2.4.1)"], "requires_python"=>"", "summary"=>"Tesserocr bindings", "version"=>"0.7.0"}, "last_serial"=>6506849, "releases"=>{"0.1.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e12ea0e2f580c6e152d334c470029dc2", "sha256"=>"64ec4e7a43ddaf199af7da8966996e260454dae4d30f79cb112149cddf5b8fd2"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.0-py2-none-any.whl", "has_sig"=>false, "md5_digest"=>"e12ea0e2f580c6e152d334c470029dc2", "packagetype"=>"bdist_wheel", "python_version"=>"py2", "requires_python"=>nil, "size"=>17089, "upload_time"=>"2018-08-31T14:13:24", "upload_time_iso_8601"=>"2018-08-31T14:13:24.592860Z", "url"=>"https://files.pythonhosted.org/packages/07/63/e617002f9c2013f8a9ce10baeab48acffc0dff3d21ab160ee67428e08ebd/ocrd_tesserocr-0.1.0-py2-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"ad528712e13eecf578b236a7ab8457cd", "sha256"=>"b2a7fd61a97bb222f2ac5a6f85b3d2ce43da843509993eef189f09b48f44027f"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"ad528712e13eecf578b236a7ab8457cd", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>15424, "upload_time"=>"2018-08-31T14:13:25", "upload_time_iso_8601"=>"2018-08-31T14:13:25.913866Z", "url"=>"https://files.pythonhosted.org/packages/4d/48/282d1d793137f1ec30118a9a0bd48534a6a8053bc74a830b6c4eb389653f/ocrd_tesserocr-0.1.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"d45fa7a24f23d22313e4314df42cf984", "sha256"=>"3fecd0a93d9a711552fbd2cf15af1f150f04f503f7b3f09d9c025267601bb42d"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.0.tar.gz", "has_sig"=>false, "md5_digest"=>"d45fa7a24f23d22313e4314df42cf984", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>9234, "upload_time"=>"2018-08-31T14:13:27", "upload_time_iso_8601"=>"2018-08-31T14:13:27.040863Z", "url"=>"https://files.pythonhosted.org/packages/eb/a7/66775daafba5937821fd643b6d1069570b262af3a48d701712d2a94350a2/ocrd_tesserocr-0.1.0.tar.gz"}], "0.1.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"fab719d99117d974ca24e63cdf6af83e", "sha256"=>"d474e372af4266ab4343570c47a448f9f68b3c002f970717663b64acabe1dbe4"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.1-py2-none-any.whl", "has_sig"=>false, "md5_digest"=>"fab719d99117d974ca24e63cdf6af83e", "packagetype"=>"bdist_wheel", "python_version"=>"py2", "requires_python"=>nil, "size"=>15461, "upload_time"=>"2018-08-31T14:18:51", "upload_time_iso_8601"=>"2018-08-31T14:18:51.905308Z", "url"=>"https://files.pythonhosted.org/packages/5c/95/7f29b87ff5be4fdd149400855862840de4681b669d3fda60a2ce8bf24127/ocrd_tesserocr-0.1.1-py2-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"cfef79e48dc96f865deff1b89fa28aa6", "sha256"=>"3c0f56fc2c88ec1ea2461eb0610763443b9af279c5260b08a1be079c92bed5c6"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"cfef79e48dc96f865deff1b89fa28aa6", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>15461, "upload_time"=>"2018-08-31T14:18:53", "upload_time_iso_8601"=>"2018-08-31T14:18:53.535866Z", "url"=>"https://files.pythonhosted.org/packages/da/23/fb5e1e125f1fda3b1069960426c5b40a9c5e12fe8f73ac29244888cf110b/ocrd_tesserocr-0.1.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"0dbecd3bc62199f7294a039c4c8557c3", "sha256"=>"2de460c4d3218ac6e3133b498c01ee7428770edcd60a02f65793ae4006f3db82"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.1.tar.gz", "has_sig"=>false, "md5_digest"=>"0dbecd3bc62199f7294a039c4c8557c3", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>9251, "upload_time"=>"2018-08-31T14:18:54", "upload_time_iso_8601"=>"2018-08-31T14:18:54.917641Z", "url"=>"https://files.pythonhosted.org/packages/31/73/c2044ae57f402e21947ceb97f574625cf534eccbf432f6916c419cf3d7e7/ocrd_tesserocr-0.1.1.tar.gz"}], "0.1.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"215dd5bba309954a15fc1be4919cd018", "sha256"=>"b2409adbb5c529b05eba8be5a9d1c7e11660dc2626bcaf61b407b617d5c7c99e"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"215dd5bba309954a15fc1be4919cd018", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>15453, "upload_time"=>"2018-09-03T13:14:20", "upload_time_iso_8601"=>"2018-09-03T13:14:20.618650Z", "url"=>"https://files.pythonhosted.org/packages/c1/ca/38355a461d8e29d7039391f5051be291d6a425b078783adb1ebb6ba10e55/ocrd_tesserocr-0.1.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"b59d049bbfc890edd7a17f3bd596b42a", "sha256"=>"fbde4fc1a5a0340507b6d96bd529a42162e732b7cca31e968b28f6a4fcdccd12"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.2.tar.gz", "has_sig"=>false, "md5_digest"=>"b59d049bbfc890edd7a17f3bd596b42a", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>9242, "upload_time"=>"2018-09-03T13:14:21", "upload_time_iso_8601"=>"2018-09-03T13:14:21.805810Z", "url"=>"https://files.pythonhosted.org/packages/1b/fe/b365c2ffddea53e616408f0213e45614ce3791ead2058df33a795ddc3d21/ocrd_tesserocr-0.1.2.tar.gz"}], "0.1.3"=>[{"comment_text"=>"", "digests"=>{"md5"=>"0f69aed68ca01cf1018b35d91227d74a", "sha256"=>"1549fbf8d314dc1f5ea20b45842e971a97b3c276f78d4d167a463432d5b77b18"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.3-py2-none-any.whl", "has_sig"=>false, "md5_digest"=>"0f69aed68ca01cf1018b35d91227d74a", "packagetype"=>"bdist_wheel", "python_version"=>"py2", "requires_python"=>nil, "size"=>17420, "upload_time"=>"2019-01-04T13:36:12", "upload_time_iso_8601"=>"2019-01-04T13:36:12.698851Z", "url"=>"https://files.pythonhosted.org/packages/18/7f/fd08ca819e6f3980220ac680b5c931080247544c2704963e518db6f7a3d0/ocrd_tesserocr-0.1.3-py2-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"bbc586d5a04c44b640d7782a84e2de83", "sha256"=>"1648df71d28a9b3388f1e701256037eb9023f149a17a22d0a9c2dec4a0510002"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"bbc586d5a04c44b640d7782a84e2de83", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>15729, "upload_time"=>"2019-01-04T13:36:14", "upload_time_iso_8601"=>"2019-01-04T13:36:14.276437Z", "url"=>"https://files.pythonhosted.org/packages/34/08/ea3ebc9476e1d28672e23b8d1332dbbc95ac9a3246cd7d02be2375995da6/ocrd_tesserocr-0.1.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"3f7f434d236449d567213324856c521a", "sha256"=>"6ec1b6c5cb4395f6f4e7356219e7019612fdcda685b511de7171dcaf4f39a439"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.1.3.tar.gz", "has_sig"=>false, "md5_digest"=>"3f7f434d236449d567213324856c521a", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>9442, "upload_time"=>"2019-01-04T13:36:15", "upload_time_iso_8601"=>"2019-01-04T13:36:15.802793Z", "url"=>"https://files.pythonhosted.org/packages/f3/10/d1b3c66b891193ccc07200d93391cbcfe9c4c5ea2bb1cac045e7d1cf1fa6/ocrd_tesserocr-0.1.3.tar.gz"}], "0.2.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e5e19ec5b8786ef3ae8b456e8180b3da", "sha256"=>"f61661e4cba7b77336dcabc6117d1e4fa90357ec98f263eacfc2c836e3a477f4"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e5e19ec5b8786ef3ae8b456e8180b3da", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>16547, "upload_time"=>"2019-02-28T10:12:21", "upload_time_iso_8601"=>"2019-02-28T10:12:21.318896Z", "url"=>"https://files.pythonhosted.org/packages/d1/94/606de830cdba1f81928dc42a71f7e58cc6510d6a8b0f9e945c01f56ee3e7/ocrd_tesserocr-0.2.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"9a06170c3773b520b13c9516b0497a33", "sha256"=>"05cc4be3ae1404afd45d8b9278d19fcd6a1ea86d376f52f571fefc4af4d96b86"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.0.tar.gz", "has_sig"=>false, "md5_digest"=>"9a06170c3773b520b13c9516b0497a33", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>10356, "upload_time"=>"2019-02-28T10:12:22", "upload_time_iso_8601"=>"2019-02-28T10:12:22.854225Z", "url"=>"https://files.pythonhosted.org/packages/50/1c/eda34c75846857877176db4f4f0564e8b7c979a872e4c2a521fa8c389fbb/ocrd_tesserocr-0.2.0.tar.gz"}], "0.2.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"43d7c9b609a3d2e27bcb05bd409cebbc", "sha256"=>"fd8c18ce5d170e766bccd34c2214e5de22ea13f795bc79642e8be2414c550f2a"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"43d7c9b609a3d2e27bcb05bd409cebbc", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>15963, "upload_time"=>"2019-04-16T14:58:44", "upload_time_iso_8601"=>"2019-04-16T14:58:44.123075Z", "url"=>"https://files.pythonhosted.org/packages/39/af/10f4d710bde5515131fc16ea3408670af8e786998a1e0f6d127e800fbc17/ocrd_tesserocr-0.2.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"b9d79ed8396cc81728525c6e66bc2883", "sha256"=>"40f4776bc548be14245de726e744f827742f02e568f6062cc465d6a585624cae"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.1.tar.gz", "has_sig"=>false, "md5_digest"=>"b9d79ed8396cc81728525c6e66bc2883", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>9534, "upload_time"=>"2019-04-16T14:58:45", "upload_time_iso_8601"=>"2019-04-16T14:58:45.820115Z", "url"=>"https://files.pythonhosted.org/packages/df/cc/fd5b999abcae94ff2116a25e31f593b95f0dda4486d89bd4e83d6671b805/ocrd_tesserocr-0.2.1.tar.gz"}], "0.2.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"df13430385faf1faeb9d8bca34e1ca08", "sha256"=>"7ccdeb2a24f9d93ec6668d02807a4f5fa31d88789a3101ad1fd4ea003128ca65"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"df13430385faf1faeb9d8bca34e1ca08", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>18334, "upload_time"=>"2019-05-20T10:24:06", "upload_time_iso_8601"=>"2019-05-20T10:24:06.855632Z", "url"=>"https://files.pythonhosted.org/packages/4e/5f/37ec32a07681542a1d34fa9764c76ef34d201a82489335d154d34e8b46b2/ocrd_tesserocr-0.2.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"d985dfeeedd9946a32e30ec079c3dac3", "sha256"=>"ad96c009bcf39b8f9e99f3e58b736ab385e5683935b9146ed9e39e8e8883b4c2"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.2.2.tar.gz", "has_sig"=>false, "md5_digest"=>"d985dfeeedd9946a32e30ec079c3dac3", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>10990, "upload_time"=>"2019-05-20T10:24:08", "upload_time_iso_8601"=>"2019-05-20T10:24:08.563041Z", "url"=>"https://files.pythonhosted.org/packages/38/53/c0186de6ad8429e6b8e0f5e5ac51a8a3d51a2c71bcb597a5879313bf2a2d/ocrd_tesserocr-0.2.2.tar.gz"}], "0.3.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"06790327b49f97d4ed656fb842b36511", "sha256"=>"09f23770905034ed00f7cb516a907288512a4d21305914b6e2dd7215b9138c6e"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.3.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"06790327b49f97d4ed656fb842b36511", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34706, "upload_time"=>"2019-08-21T14:42:39", "upload_time_iso_8601"=>"2019-08-21T14:42:39.261053Z", "url"=>"https://files.pythonhosted.org/packages/b2/b5/8a890997a3f874498a1f596f3ebdb765daa181858a46cc5a66949945adf8/ocrd_tesserocr-0.3.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"40be922772cb0f0ad188aa4345bbad9a", "sha256"=>"11b6742c4c398ea800d0b17276f0efd8a91ccbd6f0c1df05d7046c3e401a33c8"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.3.0.tar.gz", "has_sig"=>false, "md5_digest"=>"40be922772cb0f0ad188aa4345bbad9a", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>22743, "upload_time"=>"2019-08-21T14:42:40", "upload_time_iso_8601"=>"2019-08-21T14:42:40.918776Z", "url"=>"https://files.pythonhosted.org/packages/f3/fa/10af8e05b04c55680b20582c18bed55ffa846bfa65948c6b6138252a8434/ocrd_tesserocr-0.3.0.tar.gz"}], "0.4.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"9d5ea4deb4c75bae31b7d44a4a8fdd0a", "sha256"=>"4822713547e696dbb327a80f9dd5bad705be4b7dc1f44fdef1d44f9e03c21c1d"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.4.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"9d5ea4deb4c75bae31b7d44a4a8fdd0a", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>37231, "upload_time"=>"2019-08-21T16:47:05", "upload_time_iso_8601"=>"2019-08-21T16:47:05.083051Z", "url"=>"https://files.pythonhosted.org/packages/ee/2b/483b44bf3180e81aa8a5bf7307ae47da4d1656e69dec1a704f9a8d558b88/ocrd_tesserocr-0.4.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"91e09cbc5208905353c22f07029db316", "sha256"=>"616bf420794ef71bcc372fa4c29775c48d6909d01b6849e2d0be83766cd0ed90"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.4.0.tar.gz", "has_sig"=>false, "md5_digest"=>"91e09cbc5208905353c22f07029db316", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>19943, "upload_time"=>"2019-08-21T16:47:06", "upload_time_iso_8601"=>"2019-08-21T16:47:06.605798Z", "url"=>"https://files.pythonhosted.org/packages/87/09/b994a5d7310f73b04b7dd840a5fbdd726da42b7980ac0a07595b6c56ef00/ocrd_tesserocr-0.4.0.tar.gz"}], "0.4.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e634e1792d14a33a6bdde296483f0817", "sha256"=>"d21818eceac8bcdc1fdb38d4a58bfd1620cef8e7a5d0e6276afbd7695c2cac31"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.4.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e634e1792d14a33a6bdde296483f0817", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>38864, "upload_time"=>"2019-10-31T14:58:27", "upload_time_iso_8601"=>"2019-10-31T14:58:27.102775Z", "url"=>"https://files.pythonhosted.org/packages/1d/78/93c90d9593f62546fea5e2ef9b5edbb5a47121582db724ca41f93830ec87/ocrd_tesserocr-0.4.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"3de4e2c8fcb66eb6a3cb32a1a1cd361b", "sha256"=>"bbf3843361c4807c5790790d8a8fc0a0325b2fb9817cd4fa70210659dde8c8cb"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.4.1.tar.gz", "has_sig"=>false, "md5_digest"=>"3de4e2c8fcb66eb6a3cb32a1a1cd361b", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>20535, "upload_time"=>"2019-10-31T14:58:28", "upload_time_iso_8601"=>"2019-10-31T14:58:28.641792Z", "url"=>"https://files.pythonhosted.org/packages/a7/2e/de857738105ed9f1888d3f6724c0c314404b67582652a91b060d25cff808/ocrd_tesserocr-0.4.1.tar.gz"}], "0.5.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"4a807653bdfacd7d22b6c303dc1ac04f", "sha256"=>"f3bca0adcb9fce640a010d38d7e1d04b4fc423ec0cc958ff3980afbf74a5711f"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.5.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"4a807653bdfacd7d22b6c303dc1ac04f", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>33343, "upload_time"=>"2019-10-26T18:40:17", "upload_time_iso_8601"=>"2019-10-26T18:40:17.958444Z", "url"=>"https://files.pythonhosted.org/packages/36/98/a6c6b46903a3b25b1740cde4aedaf62de6441ac887536e36ad24a3c3bf12/ocrd_tesserocr-0.5.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"b4885925db28012b94b5fa3c86d80e28", "sha256"=>"aaf012b2c6adcd9a34b6fa9351dcd16fed3ab848d4d8a563b3825f9b7103be42"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.5.0.tar.gz", "has_sig"=>false, "md5_digest"=>"b4885925db28012b94b5fa3c86d80e28", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>21170, "upload_time"=>"2019-10-26T18:40:19", "upload_time_iso_8601"=>"2019-10-26T18:40:19.386827Z", "url"=>"https://files.pythonhosted.org/packages/85/5b/7c5c21b78ccd00d49f7747ad5b2a381d9860aeed41fe545a24a361544837/ocrd_tesserocr-0.5.0.tar.gz"}], "0.5.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"8835763816200fbfec9b58670bd69d8f", "sha256"=>"18cef805014268db86fd6c32bca83069cdf536298fe8151f59f9197d255a9d14"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.5.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"8835763816200fbfec9b58670bd69d8f", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>38309, "upload_time"=>"2019-10-31T16:43:42", "upload_time_iso_8601"=>"2019-10-31T16:43:42.078476Z", "url"=>"https://files.pythonhosted.org/packages/06/84/b5aca7d06e31dcb91683ab60e154b73a8d0e1cb4d5ae22debf55922573df/ocrd_tesserocr-0.5.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"1c203160eddb792cdbd706ccbb5e35bb", "sha256"=>"7dd6a5fd556395deb58070d5f6196871a241d89434a26d0a0fc7e106404aa90a"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.5.1.tar.gz", "has_sig"=>false, "md5_digest"=>"1c203160eddb792cdbd706ccbb5e35bb", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>20350, "upload_time"=>"2019-10-31T16:43:43", "upload_time_iso_8601"=>"2019-10-31T16:43:43.864345Z", "url"=>"https://files.pythonhosted.org/packages/15/1f/ed95415ee91659222301aa77e4f8c27be33df8e258972059bc031a2c0e3b/ocrd_tesserocr-0.5.1.tar.gz"}], "0.6.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"0f1c539e4ffd53d67a3b891586c7be48", "sha256"=>"41d5309efc4f886569d47dede504cea5e14ffd8e27a33acb69e15c775d34f754"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.6.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"0f1c539e4ffd53d67a3b891586c7be48", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>37693, "upload_time"=>"2019-11-05T19:14:55", "upload_time_iso_8601"=>"2019-11-05T19:14:55.328581Z", "url"=>"https://files.pythonhosted.org/packages/89/a9/431c3ad62ac4612b6be3f5cad58b49910a9c00b5f28dd62f8d535ed0c0cf/ocrd_tesserocr-0.6.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"9c454a4d508b6d43a1551b517c125d5b", "sha256"=>"3a1aeff23dbf42cc8c003039cc8695cd4e01807245f935c9323e6df2832855a7"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.6.0.tar.gz", "has_sig"=>false, "md5_digest"=>"9c454a4d508b6d43a1551b517c125d5b", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>20588, "upload_time"=>"2019-11-05T19:14:57", "upload_time_iso_8601"=>"2019-11-05T19:14:57.128983Z", "url"=>"https://files.pythonhosted.org/packages/48/30/6c8253739ee61d4a42b6512be3fcfe0ce7190ff2835ee1210b1c483da025/ocrd_tesserocr-0.6.0.tar.gz"}], "0.7.0"=>[{"comment_text"=>"", "digests"=>{"md5"=>"c70cf04587dbacd64f10e58706852630", "sha256"=>"19e81e1ff8344c6766bf41e8968e14efceb2902c7bb4fd2b7c811b3697e0f589"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.7.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"c70cf04587dbacd64f10e58706852630", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>44435, "upload_time"=>"2020-01-23T14:31:55", "upload_time_iso_8601"=>"2020-01-23T14:31:55.259065Z", "url"=>"https://files.pythonhosted.org/packages/0d/74/404359c05892e1123e1e6cbbd07d237e11bf42f3aa75cf41db87f4920a42/ocrd_tesserocr-0.7.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"0bc1167c26f1fad3e0a1dfc79ebca1e4", "sha256"=>"640504e049c3ccfe046c912109ca0354fe414004c5afb1fc9e9bb6e0651509d6"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.7.0.tar.gz", "has_sig"=>false, "md5_digest"=>"0bc1167c26f1fad3e0a1dfc79ebca1e4", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>24991, "upload_time"=>"2020-01-23T14:31:56", "upload_time_iso_8601"=>"2020-01-23T14:31:56.649512Z", "url"=>"https://files.pythonhosted.org/packages/16/e7/f6f57abfef6c662cd4cde8f02f2f49639e4075211776e069543c2ca3d484/ocrd_tesserocr-0.7.0.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"c70cf04587dbacd64f10e58706852630", "sha256"=>"19e81e1ff8344c6766bf41e8968e14efceb2902c7bb4fd2b7c811b3697e0f589"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.7.0-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"c70cf04587dbacd64f10e58706852630", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>44435, "upload_time"=>"2020-01-23T14:31:55", "upload_time_iso_8601"=>"2020-01-23T14:31:55.259065Z", "url"=>"https://files.pythonhosted.org/packages/0d/74/404359c05892e1123e1e6cbbd07d237e11bf42f3aa75cf41db87f4920a42/ocrd_tesserocr-0.7.0-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"0bc1167c26f1fad3e0a1dfc79ebca1e4", "sha256"=>"640504e049c3ccfe046c912109ca0354fe414004c5afb1fc9e9bb6e0651509d6"}, "downloads"=>-1, "filename"=>"ocrd_tesserocr-0.7.0.tar.gz", "has_sig"=>false, "md5_digest"=>"0bc1167c26f1fad3e0a1dfc79ebca1e4", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>24991, "upload_time"=>"2020-01-23T14:31:56", "upload_time_iso_8601"=>"2020-01-23T14:31:56.649512Z", "url"=>"https://files.pythonhosted.org/packages/16/e7/f6f57abfef6c662cd4cde8f02f2f49639e4075211776e069543c2ca3d484/ocrd_tesserocr-0.7.0.tar.gz"}]}, "url"=>"https://github.com/OCR-D/ocrd_tesserocr"}, "url"=>"https://github.com/OCR-D/ocrd_tesserocr"}

ocrd_cis

{"compliant_cli"=>false, "files"=>{"Dockerfile"=>"FROM ocrd/core:latest\nENV VERSION=\"Mi 9. Okt 13:26:16 CEST 2019\"\nENV GITURL=\"https://github.com/cisocrgroup\"\nENV DOWNLOAD_URL=\"http://cis.lmu.de/~finkf\"\nENV DATA=\"/apps/ocrd-cis-post-correction\"\n\n# deps\nCOPY data/docker/deps.txt ${DATA}/deps.txt\nRUN apt-get update \\\n\t&& apt-get -y install --no-install-recommends $(cat ${DATA}/deps.txt)\n\n# locales\nRUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen \\\n    && dpkg-reconfigure --frontend=noninteractive locales \\\n    && update-locale LANG=en_US.UTF-8\n\n# install the profiler\nRUN\tgit clone ${GITURL}/Profiler --branch devel --single-branch /tmp/profiler \\\n\t&& cd /tmp/profiler \\\n\t&& mkdir build \\\n\t&& cd build \\\n\t&& cmake -DCMAKE_BUILD_TYPE=release .. \\\n\t&& make compileFBDic trainFrequencyList profiler \\\n\t&& cp bin/compileFBDic bin/trainFrequencyList bin/profiler /apps/ \\\n\t&& cd / \\\n    && rm -rf /tmp/profiler\n\n# install the profiler's language backend\nRUN\tgit clone ${GITURL}/Resources --branch master --single-branch /tmp/resources \\\n\t&& cd /tmp/resources/lexica \\\n\t&& make FBDIC=/apps/compileFBDic TRAIN=/apps/trainFrequencyList \\\n\t&& mkdir -p /${DATA}/languages \\\n\t&& cp -r german latin greek german.ini latin.ini greek.ini /${DATA}/languages \\\n\t&& cd / \\\n\t&& rm -rf /tmp/resources\n\n# install ocrd_cis (python)\nCOPY Manifest.in Makefile setup.py ocrd-tool.json /tmp/build/\nCOPY ocrd_cis/ /tmp/build/ocrd_cis/\nCOPY bashlib/ /tmp/build/bashlib/\n# COPY . /tmp/ocrd_cis\nRUN cd /tmp/build \\\n\t&& make install \\\n\t&& cd / \\\n\t&& rm -rf /tmp/build\n\n# download ocr models and pre-trainded post-correction model\nRUN mkdir /apps/models \\\n\t&& cd /apps/models \\\n\t&& wget ${DOWNLOAD_URL}/model.zip >/dev/null 2>&1 \\\n\t&& wget ${DOWNLOAD_URL}/fraktur1-00085000.pyrnn.gz >/dev/null 2>&1 \\\n\t&& wget ${DOWNLOAD_URL}/fraktur2-00062000.pyrnn.gz >/dev/null 2>&1\n\nVOLUME [\"/data\"]\n", "README.md"=>"[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/context:python)\n[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/alerts/)\n# ocrd_cis\n\n[CIS](http://www.cis.lmu.de) [OCR-D](http://ocr-d.de) command line\ntools for the automatic post-correction of OCR-results.\n\n## Introduction\n`ocrd_cis` contains different tools for the automatic post correction\nof OCR-results.  It contains tools for the training, evaluation and\nexecution of the post correction.  Most of the tools are following the\n[OCR-D cli conventions](https://ocr-d.github.io/cli).\n\nThere is a helper tool to align multiple OCR results as well as a\nversion of ocropy that works with python3.\n\n## Installation\nThere are multiple ways to install the `ocrd_cis` tools:\n * `make install` uses `pip` to install `ocrd_cis` (see below).\n * `make install-devel` uses `pip -e` to install `ocrd_cis` (see\n   below).\n * `pip install --upgrade pip ocrd_cis_dir`\n * `pip install -e --upgrade pip ocrd_cis_dir`\n\nIt is possible to install `ocrd_cis` in a custom directory using\n`virtualenv`:\n```sh\n python3 -m venv venv-dir\n source venv-dir/bin/activate\n make install # or any other command to install ocrd_cis (see above)\n # use ocrd_cis\n deactivate\n```\n\n## Usage\nMost tools follow the [OCR-D cli\nconventions](https://ocr-d.github.io/cli).  They accept the\n`--input-file-grp`, `--output-file-grp`, `--parameter`, `--mets`,\n`--log-level` command line arguments (short and long).  For some tools\n(most notably the alignment tool) expect a comma seperated list of\nmultiple input file groups.\n\nThe [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a schema\ndescription of the parameter config file for the different tools that\naccept the `--parameter` argument.\n\n### ocrd-cis-post-correct.sh\nThis bash script runs the post correction using a pre-trained\n[model](http://cis.lmu.de/~finkf/model.zip).  If additional support\nOCRs should be used, models for these OCR steps are required and must\nbe configured in an according configuration file (see ocrd-tool.json).\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` name of the master-OCR file group\n * `--output-file-grp` name of the post-correction file group\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-align\nAligns tokens of multiple input file groups to one output file group.\nThis tool is used to align the master OCR with any additional support\nOCRs.  It accepts a comma-separated list of input file groups, which\nit aligns in order.\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` comma seperated list of the input file groups;\n   first input file group is the master OCR\n * `--output-file-grp` name of the file group for the aligned result\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-train.sh\nScript to train a model from a list of ground-truth archives (see\nocrd-tool.json) for the post correction.  The tool somewhat mimics the\nbehaviour of other ocrd tools:\n * `--mets` for the workspace\n * `--log-level` is passed to other tools\n * `--parameter` is used as configuration\n * `--output-file-grp` defines the output file group for the model\n\n### ocrd-cis-data\nHelper tool to get the path of the installed data files. Usage:\n`ocrd-cis-data [-jar|-3gs]` to get the path of the jar library or the\npath to th default 3-grams language model file.\n\n### ocrd-cis-wer\nHelper tool to calculate the word error rate aligned ocr files.  It\nwrites a simple JSON-formated stats file to the given output file group.\n\nArguments:\n * `--input-file-grp` input file group of aligned ocr results with\n   their respective ground truth.\n * `--output-file-grp` name of the file group for the stats file\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-profile\nRun the profiler over the given files of the according the given input\nfile grp and adds a gzipped JSON-formatted profile to the output file\ngroup of the workspace.  This tools requires an installed [language\nprofiler](https://github.com/cisocrgroup/Profiler).\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` name of the input file group to profile\n * `--output-file-grp` name of the output file group where the profile\n   is stored\n * `--log-level` set log level\n * `--mets` path to METS file in the workspace\n\n### ocrd-cis-ocropy-train\nThe ocropy-train tool can be used to train LSTM models.\nIt takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.\nThen a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.\n```sh\nocrd-cis-ocropy-train \\\n  --input-file-grp OCR-D-GT-SEG-LINE \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-clip\nThe ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace.\nIt runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-clip \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-CLIP \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-resegment\nThe ocropy-resegment tool can be used to remove overlap between lines of a workspace.\nIt runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.\n```sh\nocrd-cis-ocropy-resegment \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-RES \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-segment\nThe ocropy-segment tool can be used to segment regions into lines.\nIt runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.\n```sh\nocrd-cis-ocropy-segment \\\n  --input-file-grp OCR-D-SEG-BLOCK \\\n  --output-file-grp OCR-D-SEG-LINE \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-deskew\nThe ocropy-deskew tool can be used to deskew pages / regions of a workspace.\nIt runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.\n```sh\nocrd-cis-ocropy-deskew \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-DES \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-denoise\nThe ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace.\nIt runs the Ocropy \"nlbin\" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-denoise \\\n  --input-file-grp OCR-D-SEG-LINE-DES \\\n  --output-file-grp OCR-D-SEG-LINE-DEN \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-binarize\nThe ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace.\nIt runs the Ocropy \"nlbin\" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.\n```sh\nocrd-cis-ocropy-binarize \\\n  --input-file-grp OCR-D-SEG-LINE-DES \\\n  --output-file-grp OCR-D-SEG-LINE-BIN \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-dewarp\nThe ocropy-dewarp tool can be used to dewarp text lines of a workspace.\nIt runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-dewarp \\\n  --input-file-grp OCR-D-SEG-LINE-BIN \\\n  --output-file-grp OCR-D-SEG-LINE-DEW \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-recognize\nThe ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace.\nIt runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.\n```sh\nocrd-cis-ocropy-recognize \\\n  --input-file-grp OCR-D-SEG-LINE-DEW \\\n  --output-file-grp OCR-D-OCR-OCRO \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### Tesserocr\nInstall essential system packages for Tesserocr\n```sh\nsudo apt-get install python3-tk \\\n  tesseract-ocr libtesseract-dev libleptonica-dev \\\n  libimage-exiftool-perl libxml2-utils\n```\n\nThen install Tesserocr from: https://github.com/OCR-D/ocrd_tesserocr\n```sh\npip install -r requirements.txt\npip install .\n```\n\nDownload and move tesseract models from:\nhttps://github.com/tesseract-ocr/tesseract/wiki/Data-Files\nor use your own models and\nplace them into: /usr/share/tesseract-ocr/4.00/tessdata\n\n## Workflow configuration\n\nA decent pipeline might look like this:\n\n1. page-level cropping\n2. page-level binarization\n3. page-level deskewing\n4. page-level dewarping\n5. region segmentation\n6. region-level clipping\n7. region-level deskewing\n8. line segmentation\n9. line-level clipping or resegmentation\n10. line-level dewarping\n11. line-level recognition\n12. line-level alignment\n\nIf GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.\n\n## Testing\nTo run a few basic tests type `make test` (`ocrd_cis` has to be\ninstalled in order to run any tests).\n\n## OCR-D workspace\n\n* Create a new (empty) workspace: `ocrd workspace init workspace-dir`\n* cd into `workspace-dir`\n* Add new file to workspace: `ocrd workspace add file -G group -i id\n  -m mimetype`\n\n## OCR-D links\n\n- [OCR-D](https://ocr-d.github.io)\n- [Github](https://github.com/OCR-D)\n- [Project-page](http://www.ocr-d.de/)\n- [Ground-truth](http://www.ocr-d.de/sites/all/GTDaten/IndexGT.html)\n", "ocrd-tool.json"=>"{\n\t\"git_url\": \"https://github.com/cisocrgroup/ocrd_cis\",\n\t\"version\": \"0.0.6\",\n\t\"tools\": {\n\t\t\"ocrd-cis-ocropy-binarize\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-binarize\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Image preprocessing\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"preprocessing/optimization/binarization\",\n\t\t\t\t\"preprocessing/optimization/grayscale_normalization\",\n\t\t\t\t\"preprocessing/optimization/deskewing\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-IMG\",\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-IMG-BIN\",\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Binarize (and optionally deskew/despeckle) pages / regions / lines with ocropy\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"method\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"none\", \"global\", \"otsu\", \"gauss-otsu\", \"ocropy\"],\n\t\t\t\t\t\"description\": \"binarization method to use (only ocropy will include deskewing)\",\n\t\t\t\t\t\"default\": \"ocropy\"\n\t\t\t\t},\n\t\t\t\t\"grayscale\": {\n\t\t\t\t\t\"type\": \"boolean\",\n\t\t\t\t\t\"description\": \"for the ocropy method, produce grayscale-normalized instead of thresholded image\",\n\t\t\t\t\t\"default\": false\n\t\t\t\t},\n\t\t\t\t\"maxskew\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"description\": \"modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)\",\n\t\t\t\t\t\"default\": 0.0\n\t\t\t\t},\n\t\t\t\t\"noise_maxsize\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"description\": \"maximum pixel number for connected components to regard as noise (0 will deactivate denoising)\",\n\t\t\t\t\t\"default\": 0\n\t\t\t\t},\n\t\t\t\t\"level-of-operation\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"page\", \"region\", \"line\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level granularity to annotate images for\",\n\t\t\t\t\t\"default\": \"page\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-deskew\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-deskew\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Image preprocessing\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"preprocessing/optimization/deskewing\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Deskew regions with ocropy (by annotating orientation angle and adding AlternativeImage)\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"maxskew\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"description\": \"modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)\",\n\t\t\t\t\t\"default\": 5.0\n\t\t\t\t},\n\t\t\t\t\"level-of-operation\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"page\", \"region\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level granularity to annotate images for\",\n\t\t\t\t\t\"default\": \"region\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-denoise\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-denoise\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Image preprocessing\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"preprocessing/optimization/despeckling\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-IMG\",\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-IMG-DESPECK\",\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Despeckle pages / regions / lines with ocropy\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"noise_maxsize\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"maximum size in points (pt) for connected components to regard as noise (0 will deactivate denoising)\",\n\t\t\t\t\t\"default\": 3.0\n\t\t\t\t},\n\t\t\t\t\"dpi\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t},\n\t\t\t\t\"level-of-operation\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"page\", \"region\", \"line\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level granularity to annotate images for\",\n\t\t\t\t\t\"default\": \"page\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-clip\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-clip\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Layout analysis\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"layout/segmentation/region\",\n\t\t\t\t\"layout/segmentation/line\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Clip text regions / lines at intersections with neighbours\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"level-of-operation\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"region\", \"line\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level granularity to annotate images for\",\n\t\t\t\t\t\"default\": \"region\"\n\t\t\t\t},\n\t\t\t\t\"dpi\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t},\n\t\t\t\t\"min_fraction\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"share of foreground pixels that must be retained by the largest label\",\n\t\t\t\t\t\"default\": 0.7\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-resegment\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-resegment\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Layout analysis\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"layout/segmentation/line\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Resegment lines with ocropy (by shrinking annotated polygons)\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"dpi\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t},\n\t\t\t\t\"min_fraction\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"share of foreground pixels that must be retained by the largest label\",\n\t\t\t\t\t\"default\": 0.8\n\t\t\t\t},\n\t\t\t\t\"extend_margins\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"integer\",\n\t\t\t\t\t\"description\": \"number of pixels to extend the input polygons horizontally and vertically before intersecting\",\n\t\t\t\t\t\"default\": 3\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-dewarp\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-dewarp\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Image preprocessing\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"preprocessing/optimization/dewarping\"\n\t\t\t],\n\t\t\t\"description\": \"Dewarp line images with ocropy\",\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"parameters\": {\n\t\t\t\t\"dpi\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t},\n\t\t\t\t\"range\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"maximum vertical disposition or maximum margin (will be multiplied by mean centerline deltas to yield pixels)\",\n\t\t\t\t\t\"default\": 4.0\n\t\t\t\t},\n\t\t\t\t\"max_neighbour\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"maximum rate of foreground pixels intruding from neighbouring lines (line will not be processed above that)\",\n\t\t\t\t\t\"default\": 0.05\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-recognize\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-recognize\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"recognition/text-recognition\"\n\t\t\t],\n\t\t\t\"description\": \"Recognize text in (binarized+deskewed+dewarped) lines with ocropy\",\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\",\n\t\t\t\t\"OCR-D-SEG-WORD\",\n\t\t\t\t\"OCR-D-SEG-GLYPH\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-OCR-OCRO\"\n\t\t\t],\n\t\t\t\"parameters\": {\n\t\t\t\t\"textequiv_level\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"line\", \"word\", \"glyph\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level granularity to add the TextEquiv results to\",\n\t\t\t\t\t\"default\": \"line\"\n\t\t\t\t},\n\t\t\t\t\"model\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"description\": \"ocropy model to apply (e.g. fraktur.pyrnn)\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-rec\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-rec\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"recognition/text-recognition\"\n\t\t\t],\n\t\t\t\"description\": \"Recognize text snippets\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"model\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"description\": \"ocropy model to apply (e.g. fraktur.pyrnn)\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-segment\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-segment\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Layout analysis\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"layout/segmentation/region\",\n\t\t\t\t\"layout/segmentation/line\"\n\t\t\t],\n\t\t\t\"input_file_grp\": [\n\t\t\t\t\"OCR-D-GT-SEG-BLOCK\",\n\t\t\t\t\"OCR-D-SEG-BLOCK\"\n\t\t\t],\n\t\t\t\"output_file_grp\": [\n\t\t\t\t\"OCR-D-SEG-LINE\"\n\t\t\t],\n\t\t\t\"description\": \"Segment pages into regions or regions into lines with ocropy\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"dpi\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"description\": \"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t},\n\t\t\t\t\"level-of-operation\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"page\", \"region\"],\n\t\t\t\t\t\"description\": \"PAGE XML hierarchy level to read images from\",\n\t\t\t\t\t\"default\": \"region\"\n\t\t\t\t},\n\t\t\t\t\"maxcolseps\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"integer\",\n\t\t\t\t\t\"default\": 2,\n\t\t\t\t\t\"description\": \"number of white/background column separators to try (when operating on the page level)\"\n\t\t\t\t},\n\t\t\t\t\"maxseps\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"integer\",\n\t\t\t\t\t\"default\": 5,\n\t\t\t\t\t\"description\": \"number of black/foreground column separators to try, counted individually as lines (when operating on the page level)\"\n\t\t\t\t},\n\t\t\t\t\"overwrite_regions\": {\n\t\t\t\t\t\"type\": \"boolean\",\n\t\t\t\t\t\"default\": true,\n\t\t\t\t\t\"description\": \"remove any existing TextRegion elements (when operating on the page level)\"\n\t\t\t\t},\n\t\t\t\t\"overwrite_lines\": {\n\t\t\t\t\t\"type\": \"boolean\",\n\t\t\t\t\t\"default\": true,\n\t\t\t\t\t\"description\": \"remove any existing TextLine elements (when operating on the region level)\"\n\t\t\t\t},\n\t\t\t\t\"spread\": {\n\t\t\t\t\t\"type\": \"number\",\n\t\t\t\t\t\"format\": \"float\",\n\t\t\t\t\t\"default\": 2.4,\n\t\t\t\t\t\"description\": \"distance in points (pt) from the foreground to project text line (or text region) labels into the background\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-ocropy-train\": {\n\t\t\t\"executable\": \"ocrd-cis-ocropy-train\",\n\t\t\t\"categories\": [\n\t\t\t\t\"lstm ocropy model training\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"training\"\n\t\t\t],\n\t\t\t\"description\": \"train model with ground truth from mets data\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"textequiv_level\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"enum\": [\"line\", \"word\", \"glyph\"],\n\t\t\t\t\t\"default\": \"line\"\n\t\t\t\t},\n\t\t\t\t\"model\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"description\": \"load model or crate new one (e.g. fraktur.pyrnn)\"\n\t\t\t\t},\n\t\t\t\t\"ntrain\": {\n\t\t\t\t\t\"type\": \"integer\",\n\t\t\t\t\t\"description\": \"lines to train before stopping\",\n\t\t\t\t\t\"default\": 1000000\n\t\t\t\t},\n\t\t\t\t\"outputpath\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"description\": \"(existing) path for the trained model\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-align\": {\n\t\t\t\"executable\": \"ocrd-cis-align\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"postprocessing/alignment\"\n\t\t\t],\n\t\t\t\"description\": \"Align multiple OCRs and/or GTs\"\n\t\t},\n\t\t\"ocrd-cis-wer\": {\n\t\t\t\"executable\": \"ocrd-cis-wer\",\n\t\t\t\"categories\": [\n\t\t\t\t\"evaluation\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"evaluation\"\n\t\t\t],\n\t\t\t\"description\": \"calculate the word error rate for aligned page xml files\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"testIndex\": {\n\t\t\t\t\t\"description\": \"text equiv index for the test/ocr tokens\",\n\t\t\t\t\t\"type\": \"integer\",\n\t\t\t\t\t\"default\": 0\n\t\t\t\t},\n\t\t\t\t\"gtIndex\": {\n\t\t\t\t\t\"type\": \"integer\",\n\t\t\t\t\t\"description\": \"text equiv index for the gt tokens\",\n\t\t\t\t\t\"default\": -1\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-jar\": {\n\t\t\t\"executable\": \"ocrd-cis-jar\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"postprocessing/alignment\"\n\t\t\t],\n\t\t\t\"description\": \"Output path to the ocrd-cis.jar file\"\n\t\t},\n\t\t\"ocrd-cis-profile\": {\n\t\t\t\"executable\": \"ocrd-cis-profile\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"postprocessing/alignment\"\n\t\t\t],\n\t\t\t\"description\": \"Add a correction suggestions and suspicious tokens (profile)\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"executable\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"required\": true\n\t\t\t\t},\n\t\t\t\t\"backend\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"required\": true\n\t\t\t\t},\n\t\t\t\t\"language\": {\n\t\t\t\t    \"type\": \"string\",\n\t\t\t\t\t\"required\": false,\n\t\t\t\t\t\"default\": \"german\"\n\t\t\t\t},\n\t\t\t\t\"additionalLexicon\": {\n\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\"required\": false,\n\t\t\t\t\t\"default\": \"\"\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-train\": {\n\t\t\t\"executable\": \"ocrd-cis-train.sh\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"postprocessing/alignment\"\n\t\t\t],\n\t\t\t\"description\": \"Train post correction model\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"gtArchives\": {\n\t\t\t\t\t\"description\": \"List of ground truth archives\",\n\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\"description\": \"Path (or URL) to a ground truth archive\",\n\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t}\n\t\t\t\t},\n\t\t\t\t\"imagePreprocessingSteps\": {\n\t\t\t\t\t\"description\": \"List of image preprocessing steps\",\n\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\"description\": \"Image preprocessing command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $IMG_OUTPUT_FILE_GRP, $IMG_INPUT_FILE_GRP, $PARAMETER)\",\n\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t}\n\t\t\t\t},\n\t\t\t\t\"ocrSteps\": {\n\t\t\t\t\t\"description\": \"List of ocr steps\",\n\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\"description\": \"OCR command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $PARAMETER)\",\n\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t}\n\t\t\t\t},\n\t\t\t\t\"training\": {\n\t\t\t\t\t\"description\": \"Configuration of training command\",\n\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\"trigrams\",\n\t\t\t\t\t\t\"maxCandidate\",\n\t\t\t\t\t\t\"profiler\",\n\t\t\t\t\t\t\"leFeatures\",\n\t\t\t\t\t\t\"rrFeatures\",\n\t\t\t\t\t\t\"dmFeatures\"\n\t\t\t\t\t],\n\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\"trigrams\": {\n\t\t\t\t\t\t\t\"description\": \"Path to character trigrams csv file (format: n,trigram)\",\n\t\t\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\t\t\"required\": true\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"maxCandidate\": {\n\t\t\t\t\t\t\t\"description\": \"Maximum number of considered profiler candidates per token\",\n\t\t\t\t\t\t\t\"type\": \"integer\",\n\t\t\t\t\t\t\t\"required\": true\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"filterClasses\": {\n\t\t\t\t\t\t\t\"description\": \"List of filtered feature classes\",\n\t\t\t\t\t\t\t\"required\": false,\n\t\t\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\t\t\"description\": \"Class name of feature class to filter\",\n\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"profiler\": {\n\t\t\t\t\t\t\t\"description\": \"Profiler configuration\",\n\t\t\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\t\t\"path\",\n\t\t\t\t\t\t\t\t\"config\"\n\t\t\t\t\t\t\t],\n\t\t\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\t\t\"path\": {\n\t\t\t\t\t\t\t\t\t\"description\": \"Path to the profiler executable\",\n\t\t\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\"config\": {\n\t\t\t\t\t\t\t\t\t\"description\": \"Path to the profiler language config file\",\n\t\t\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"leFeatures\": {\n\t\t\t\t\t\t\t\"description\": \"List of the lexicon extension features\",\n\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\t\t\"description\": \"Feature configuration\",\n\t\t\t\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\t\t\t\"type\",\n\t\t\t\t\t\t\t\t\t\"name\"\n\t\t\t\t\t\t\t\t],\n\t\t\t\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\t\t\t\"name\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"type\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Fully qualified java class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"class\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"rrFeatures\": {\n\t\t\t\t\t\t\t\"description\": \"List of the reranker features\",\n\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\t\t\"description\": \"Feature configuration\",\n\t\t\t\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\t\t\t\"type\",\n\t\t\t\t\t\t\t\t\t\"name\"\n\t\t\t\t\t\t\t\t],\n\t\t\t\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\t\t\t\"name\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"type\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Fully qualified java class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"class\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"dmFeatures\": {\n\t\t\t\t\t\t\t\"description\": \"List of the desicion maker features\",\n\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\t\t\"description\": \"Feature configuration\",\n\t\t\t\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\t\t\t\"type\",\n\t\t\t\t\t\t\t\t\t\"name\"\n\t\t\t\t\t\t\t\t],\n\t\t\t\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\t\t\t\"name\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"type\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Fully qualified java class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\t\"class\": {\n\t\t\t\t\t\t\t\t\t\t\"description\": \"Class name of the feature\",\n\t\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t\t\"ocrd-cis-post-correct\": {\n\t\t\t\"executable\": \"ocrd-cis-post-correct.sh\",\n\t\t\t\"categories\": [\n\t\t\t\t\"Text recognition and optimization\"\n\t\t\t],\n\t\t\t\"steps\": [\n\t\t\t\t\"postprocessing/alignment\"\n\t\t\t],\n\t\t\t\"description\": \"Post correct OCR results\",\n\t\t\t\"parameters\": {\n\t\t\t\t\"ocrSteps\": {\n\t\t\t\t\t\"description\": \"List of additional ocr steps\",\n\t\t\t\t\t\"type\": \"array\",\n\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\"items\": {\n\t\t\t\t\t\t\"description\": \"OCR command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $PARAMETER)\",\n\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t}\n\t\t\t\t},\n\t\t\t\t\"postCorrection\": {\n\t\t\t\t\t\"description\": \"Configuration of post correction command\",\n\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\"maxCandidate\",\n\t\t\t\t\t\t\"profiler\",\n\t\t\t\t\t\t\"model\",\n\t\t\t\t\t\t\"runLE\",\n\t\t\t\t\t\t\"runDM\"\n\t\t\t\t\t],\n\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\"maxCandidate\": {\n\t\t\t\t\t\t\t\"description\": \"Maximum number of considered profiler candidates per token\",\n\t\t\t\t\t\t\t\"type\": \"integer\",\n\t\t\t\t\t\t\t\"required\": true\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"profiler\": {\n\t\t\t\t\t\t\t\"description\": \"Profiler configuration\",\n\t\t\t\t\t\t\t\"type\": \"object\",\n\t\t\t\t\t\t\t\"required\": [\n\t\t\t\t\t\t\t\t\"path\",\n\t\t\t\t\t\t\t\t\"config\"\n\t\t\t\t\t\t\t],\n\t\t\t\t\t\t\t\"properties\": {\n\t\t\t\t\t\t\t\t\"path\": {\n\t\t\t\t\t\t\t\t\t\"description\": \"Path to the profiler executable\",\n\t\t\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t},\n\t\t\t\t\t\t\t\t\"config\": {\n\t\t\t\t\t\t\t\t\t\"description\": \"Path to the profiler language config file\",\n\t\t\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\t\t\"type\": \"string\"\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"model\": {\n\t\t\t\t\t\t\t\"description\": \"Path to the post correction model file\",\n\t\t\t\t\t\t\t\"type\": \"string\",\n\t\t\t\t\t\t\t\"required\": true\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"runLE\": {\n\t\t\t\t\t\t\t\"description\": \"Do run the lexicon extension step for the post correction\",\n\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\"type\": \"boolean\"\n\t\t\t\t\t\t},\n\t\t\t\t\t\t\"runDM\": {\n\t\t\t\t\t\t\t\"description\": \"Do run the ranking and the decision step for the post correction\",\n\t\t\t\t\t\t\t\"required\": true,\n\t\t\t\t\t\t\t\"type\": \"boolean\"\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n", "setup.py"=>"\"\"\"\nInstalls:\n    - ocrd-cis-align\n    - ocrd-cis-training\n    - ocrd-cis-profile\n    - ocrd-cis-wer\n    - ocrd-cis-data\n    - ocrd-cis-ocropy-clip\n    - ocrd-cis-ocropy-denoise\n    - ocrd-cis-ocropy-deskew\n    - ocrd-cis-ocropy-binarize\n    - ocrd-cis-ocropy-resegment\n    - ocrd-cis-ocropy-segment\n    - ocrd-cis-ocropy-dewarp\n    - ocrd-cis-ocropy-recognize\n    - ocrd-cis-ocropy-train\n\"\"\"\n\nimport codecs\nfrom setuptools import setup\nfrom setuptools import find_packages\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_cis',\n    version='0.0.6',\n    description='CIS OCR-D command line tools',\n    long_description=README,\n    long_description_content_type='text/markdown',\n    author='Florian Fink, Tobias Englmeier, Christoph Weber',\n    author_email='finkf@cis.lmu.de, englmeier@cis.lmu.de, web_chris@msn.com',\n    url='https://github.com/cisocrgroup/ocrd_cis',\n    license='MIT',\n    packages=find_packages(),\n    include_package_data=True,\n    install_requires=[\n        'ocrd>=2.0.0',\n        'click',\n        'scipy',\n        'numpy>=1.17.0',\n        'pillow>=6.2.0',\n        'shapely',\n        'matplotlib>3.0.0',\n        'python-Levenshtein',\n        'calamari_ocr == 0.3.5'\n    ],\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml', '*.csv.gz', '*.jar'],\n    },\n    scripts=[\n        'bashlib/ocrd-cis-lib.sh',\n        'bashlib/ocrd-cis-train.sh',\n        'bashlib/ocrd-cis-post-correct.sh',\n    ],\n    entry_points={\n        'console_scripts': [\n            'ocrd-cis-align=ocrd_cis.align.cli:ocrd_cis_align',\n            'ocrd-cis-profile=ocrd_cis.profile.cli:ocrd_cis_profile',\n            'ocrd-cis-wer=ocrd_cis.wer.cli:ocrd_cis_wer',\n            'ocrd-cis-data=ocrd_cis.data.__main__:main',\n            'ocrd-cis-ocropy-binarize=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_binarize',\n            'ocrd-cis-ocropy-clip=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_clip',\n            'ocrd-cis-ocropy-denoise=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_denoise',\n            'ocrd-cis-ocropy-deskew=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_deskew',\n            'ocrd-cis-ocropy-dewarp=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_dewarp',\n            'ocrd-cis-ocropy-recognize=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_recognize',\n            'ocrd-cis-ocropy-rec=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_rec',\n            'ocrd-cis-ocropy-resegment=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_resegment',\n            'ocrd-cis-ocropy-segment=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_segment',\n            'ocrd-cis-ocropy-train=ocrd_cis.ocropy.cli:ocrd_cis_ocropy_train',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Thu Jan 23 15:42:32 2020 +0100", "latest_tag"=>"", "number_of_commits"=>"436", "url"=>"https://github.com/cisocrgroup/ocrd_cis.git"}, "name"=>"ocrd_cis", "ocrd_tool"=>{"git_url"=>"https://github.com/cisocrgroup/ocrd_cis", "tools"=>{"ocrd-cis-align"=>{"categories"=>["Text recognition and optimization"], "description"=>"Align multiple OCRs and/or GTs", "executable"=>"ocrd-cis-align", "steps"=>["postprocessing/alignment"]}, "ocrd-cis-jar"=>{"categories"=>["Text recognition and optimization"], "description"=>"Output path to the ocrd-cis.jar file", "executable"=>"ocrd-cis-jar", "steps"=>["postprocessing/alignment"]}, "ocrd-cis-ocropy-binarize"=>{"categories"=>["Image preprocessing"], "description"=>"Binarize (and optionally deskew/despeckle) pages / regions / lines with ocropy", "executable"=>"ocrd-cis-ocropy-binarize", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-IMG-BIN", "OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "parameters"=>{"grayscale"=>{"default"=>false, "description"=>"for the ocropy method, produce grayscale-normalized instead of thresholded image", "type"=>"boolean"}, "level-of-operation"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level granularity to annotate images for", "enum"=>["page", "region", "line"], "type"=>"string"}, "maxskew"=>{"default"=>0.0, "description"=>"modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)", "type"=>"number"}, "method"=>{"default"=>"ocropy", "description"=>"binarization method to use (only ocropy will include deskewing)", "enum"=>["none", "global", "otsu", "gauss-otsu", "ocropy"], "type"=>"string"}, "noise_maxsize"=>{"default"=>0, "description"=>"maximum pixel number for connected components to regard as noise (0 will deactivate denoising)", "type"=>"number"}}, "steps"=>["preprocessing/optimization/binarization", "preprocessing/optimization/grayscale_normalization", "preprocessing/optimization/deskewing"]}, "ocrd-cis-ocropy-clip"=>{"categories"=>["Layout analysis"], "description"=>"Clip text regions / lines at intersections with neighbours", "executable"=>"ocrd-cis-ocropy-clip", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "level-of-operation"=>{"default"=>"region", "description"=>"PAGE XML hierarchy level granularity to annotate images for", "enum"=>["region", "line"], "type"=>"string"}, "min_fraction"=>{"default"=>0.7, "description"=>"share of foreground pixels that must be retained by the largest label", "format"=>"float", "type"=>"number"}}, "steps"=>["layout/segmentation/region", "layout/segmentation/line"]}, "ocrd-cis-ocropy-denoise"=>{"categories"=>["Image preprocessing"], "description"=>"Despeckle pages / regions / lines with ocropy", "executable"=>"ocrd-cis-ocropy-denoise", "input_file_grp"=>["OCR-D-IMG", "OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-IMG-DESPECK", "OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "level-of-operation"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level granularity to annotate images for", "enum"=>["page", "region", "line"], "type"=>"string"}, "noise_maxsize"=>{"default"=>3.0, "description"=>"maximum size in points (pt) for connected components to regard as noise (0 will deactivate denoising)", "format"=>"float", "type"=>"number"}}, "steps"=>["preprocessing/optimization/despeckling"]}, "ocrd-cis-ocropy-deskew"=>{"categories"=>["Image preprocessing"], "description"=>"Deskew regions with ocropy (by annotating orientation angle and adding AlternativeImage)", "executable"=>"ocrd-cis-ocropy-deskew", "input_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-SEG-BLOCK", "OCR-D-SEG-LINE"], "parameters"=>{"level-of-operation"=>{"default"=>"region", "description"=>"PAGE XML hierarchy level granularity to annotate images for", "enum"=>["page", "region"], "type"=>"string"}, "maxskew"=>{"default"=>5.0, "description"=>"modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)", "type"=>"number"}}, "steps"=>["preprocessing/optimization/deskewing"]}, "ocrd-cis-ocropy-dewarp"=>{"categories"=>["Image preprocessing"], "description"=>"Dewarp line images with ocropy", "executable"=>"ocrd-cis-ocropy-dewarp", "input_file_grp"=>["OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "max_neighbour"=>{"default"=>0.05, "description"=>"maximum rate of foreground pixels intruding from neighbouring lines (line will not be processed above that)", "format"=>"float", "type"=>"number"}, "range"=>{"default"=>4.0, "description"=>"maximum vertical disposition or maximum margin (will be multiplied by mean centerline deltas to yield pixels)", "format"=>"float", "type"=>"number"}}, "steps"=>["preprocessing/optimization/dewarping"]}, "ocrd-cis-ocropy-rec"=>{"categories"=>["Text recognition and optimization"], "description"=>"Recognize text snippets", "executable"=>"ocrd-cis-ocropy-rec", "parameters"=>{"model"=>{"description"=>"ocropy model to apply (e.g. fraktur.pyrnn)", "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}, "ocrd-cis-ocropy-recognize"=>{"categories"=>["Text recognition and optimization"], "description"=>"Recognize text in (binarized+deskewed+dewarped) lines with ocropy", "executable"=>"ocrd-cis-ocropy-recognize", "input_file_grp"=>["OCR-D-SEG-LINE", "OCR-D-SEG-WORD", "OCR-D-SEG-GLYPH"], "output_file_grp"=>["OCR-D-OCR-OCRO"], "parameters"=>{"model"=>{"description"=>"ocropy model to apply (e.g. fraktur.pyrnn)", "type"=>"string"}, "textequiv_level"=>{"default"=>"line", "description"=>"PAGE XML hierarchy level granularity to add the TextEquiv results to", "enum"=>["line", "word", "glyph"], "type"=>"string"}}, "steps"=>["recognition/text-recognition"]}, "ocrd-cis-ocropy-resegment"=>{"categories"=>["Layout analysis"], "description"=>"Resegment lines with ocropy (by shrinking annotated polygons)", "executable"=>"ocrd-cis-ocropy-resegment", "input_file_grp"=>["OCR-D-SEG-LINE"], "output_file_grp"=>["OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "extend_margins"=>{"default"=>3, "description"=>"number of pixels to extend the input polygons horizontally and vertically before intersecting", "format"=>"integer", "type"=>"number"}, "min_fraction"=>{"default"=>0.8, "description"=>"share of foreground pixels that must be retained by the largest label", "format"=>"float", "type"=>"number"}}, "steps"=>["layout/segmentation/line"]}, "ocrd-cis-ocropy-segment"=>{"categories"=>["Layout analysis"], "description"=>"Segment pages into regions or regions into lines with ocropy", "executable"=>"ocrd-cis-ocropy-segment", "input_file_grp"=>["OCR-D-GT-SEG-BLOCK", "OCR-D-SEG-BLOCK"], "output_file_grp"=>["OCR-D-SEG-LINE"], "parameters"=>{"dpi"=>{"default"=>-1, "description"=>"pixel density in dots per inch (overrides any meta-data in the images); disabled when negative", "format"=>"float", "type"=>"number"}, "level-of-operation"=>{"default"=>"region", "description"=>"PAGE XML hierarchy level to read images from", "enum"=>["page", "region"], "type"=>"string"}, "maxcolseps"=>{"default"=>2, "description"=>"number of white/background column separators to try (when operating on the page level)", "format"=>"integer", "type"=>"number"}, "maxseps"=>{"default"=>5, "description"=>"number of black/foreground column separators to try, counted individually as lines (when operating on the page level)", "format"=>"integer", "type"=>"number"}, "overwrite_lines"=>{"default"=>true, "description"=>"remove any existing TextLine elements (when operating on the region level)", "type"=>"boolean"}, "overwrite_regions"=>{"default"=>true, "description"=>"remove any existing TextRegion elements (when operating on the page level)", "type"=>"boolean"}, "spread"=>{"default"=>2.4, "description"=>"distance in points (pt) from the foreground to project text line (or text region) labels into the background", "format"=>"float", "type"=>"number"}}, "steps"=>["layout/segmentation/region", "layout/segmentation/line"]}, "ocrd-cis-ocropy-train"=>{"categories"=>["lstm ocropy model training"], "description"=>"train model with ground truth from mets data", "executable"=>"ocrd-cis-ocropy-train", "parameters"=>{"model"=>{"description"=>"load model or crate new one (e.g. fraktur.pyrnn)", "type"=>"string"}, "ntrain"=>{"default"=>1000000, "description"=>"lines to train before stopping", "type"=>"integer"}, "outputpath"=>{"description"=>"(existing) path for the trained model", "type"=>"string"}, "textequiv_level"=>{"default"=>"line", "enum"=>["line", "word", "glyph"], "type"=>"string"}}, "steps"=>["training"]}, "ocrd-cis-post-correct"=>{"categories"=>["Text recognition and optimization"], "description"=>"Post correct OCR results", "executable"=>"ocrd-cis-post-correct.sh", "parameters"=>{"ocrSteps"=>{"description"=>"List of additional ocr steps", "items"=>{"description"=>"OCR command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $PARAMETER)", "type"=>"string"}, "required"=>true, "type"=>"array"}, "postCorrection"=>{"description"=>"Configuration of post correction command", "properties"=>{"maxCandidate"=>{"description"=>"Maximum number of considered profiler candidates per token", "required"=>true, "type"=>"integer"}, "model"=>{"description"=>"Path to the post correction model file", "required"=>true, "type"=>"string"}, "profiler"=>{"description"=>"Profiler configuration", "properties"=>{"config"=>{"description"=>"Path to the profiler language config file", "required"=>true, "type"=>"string"}, "path"=>{"description"=>"Path to the profiler executable", "required"=>true, "type"=>"string"}}, "required"=>["path", "config"], "type"=>"object"}, "runDM"=>{"description"=>"Do run the ranking and the decision step for the post correction", "required"=>true, "type"=>"boolean"}, "runLE"=>{"description"=>"Do run the lexicon extension step for the post correction", "required"=>true, "type"=>"boolean"}}, "required"=>["maxCandidate", "profiler", "model", "runLE", "runDM"], "type"=>"object"}}, "steps"=>["postprocessing/alignment"]}, "ocrd-cis-profile"=>{"categories"=>["Text recognition and optimization"], "description"=>"Add a correction suggestions and suspicious tokens (profile)", "executable"=>"ocrd-cis-profile", "parameters"=>{"additionalLexicon"=>{"default"=>"", "required"=>false, "type"=>"string"}, "backend"=>{"required"=>true, "type"=>"string"}, "executable"=>{"required"=>true, "type"=>"string"}, "language"=>{"default"=>"german", "required"=>false, "type"=>"string"}}, "steps"=>["postprocessing/alignment"]}, "ocrd-cis-train"=>{"categories"=>["Text recognition and optimization"], "description"=>"Train post correction model", "executable"=>"ocrd-cis-train.sh", "parameters"=>{"gtArchives"=>{"description"=>"List of ground truth archives", "items"=>{"description"=>"Path (or URL) to a ground truth archive", "type"=>"string"}, "required"=>true, "type"=>"array"}, "imagePreprocessingSteps"=>{"description"=>"List of image preprocessing steps", "items"=>{"description"=>"Image preprocessing command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $IMG_OUTPUT_FILE_GRP, $IMG_INPUT_FILE_GRP, $PARAMETER)", "type"=>"string"}, "required"=>true, "type"=>"array"}, "ocrSteps"=>{"description"=>"List of ocr steps", "items"=>{"description"=>"OCR command that is evaled using the bash eval command (available parameters: $METS, $LOG_LEVEL, $XML_INPUT_FILE_GRP, $XML_OUTPUT_FILE_GRP, $PARAMETER)", "type"=>"string"}, "required"=>true, "type"=>"array"}, "training"=>{"description"=>"Configuration of training command", "properties"=>{"dmFeatures"=>{"description"=>"List of the desicion maker features", "items"=>{"description"=>"Feature configuration", "properties"=>{"class"=>{"description"=>"Class name of the feature", "type"=>"string"}, "name"=>{"description"=>"Name of the feature", "type"=>"string"}, "type"=>{"description"=>"Fully qualified java class name of the feature", "type"=>"string"}}, "required"=>["type", "name"], "type"=>"object"}, "required"=>true, "type"=>"array"}, "filterClasses"=>{"description"=>"List of filtered feature classes", "items"=>{"description"=>"Class name of feature class to filter", "type"=>"string"}, "required"=>false, "type"=>"array"}, "leFeatures"=>{"description"=>"List of the lexicon extension features", "items"=>{"description"=>"Feature configuration", "properties"=>{"class"=>{"description"=>"Class name of the feature", "type"=>"string"}, "name"=>{"description"=>"Name of the feature", "type"=>"string"}, "type"=>{"description"=>"Fully qualified java class name of the feature", "type"=>"string"}}, "required"=>["type", "name"], "type"=>"object"}, "required"=>true, "type"=>"array"}, "maxCandidate"=>{"description"=>"Maximum number of considered profiler candidates per token", "required"=>true, "type"=>"integer"}, "profiler"=>{"description"=>"Profiler configuration", "properties"=>{"config"=>{"description"=>"Path to the profiler language config file", "required"=>true, "type"=>"string"}, "path"=>{"description"=>"Path to the profiler executable", "required"=>true, "type"=>"string"}}, "required"=>["path", "config"], "type"=>"object"}, "rrFeatures"=>{"description"=>"List of the reranker features", "items"=>{"description"=>"Feature configuration", "properties"=>{"class"=>{"description"=>"Class name of the feature", "type"=>"string"}, "name"=>{"description"=>"Name of the feature", "type"=>"string"}, "type"=>{"description"=>"Fully qualified java class name of the feature", "type"=>"string"}}, "required"=>["type", "name"], "type"=>"object"}, "required"=>true, "type"=>"array"}, "trigrams"=>{"description"=>"Path to character trigrams csv file (format: n,trigram)", "required"=>true, "type"=>"string"}}, "required"=>["trigrams", "maxCandidate", "profiler", "leFeatures", "rrFeatures", "dmFeatures"], "type"=>"object"}}, "steps"=>["postprocessing/alignment"]}, "ocrd-cis-wer"=>{"categories"=>["evaluation"], "description"=>"calculate the word error rate for aligned page xml files", "executable"=>"ocrd-cis-wer", "parameters"=>{"gtIndex"=>{"default"=>-1, "description"=>"text equiv index for the gt tokens", "type"=>"integer"}, "testIndex"=>{"default"=>0, "description"=>"text equiv index for the test/ocr tokens", "type"=>"integer"}}, "steps"=>["evaluation"]}}, "version"=>"0.0.6"}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [tools.ocrd-cis-ocropy-rec] 'input_file_grp' is a required property\n  [tools.ocrd-cis-ocropy-train] 'input_file_grp' is a required property\n  [tools.ocrd-cis-ocropy-train.parameters.textequiv_level] 'description' is a required property\n  [tools.ocrd-cis-ocropy-train.parameters.ntrain.type] 'integer' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-ocropy-train.categories.0] 'lstm ocropy model training' is not one of ['Image preprocessing', 'Layout analysis', 'Text recognition and optimization', 'Model training', 'Long-term preservation', 'Quality assurance']\n  [tools.ocrd-cis-ocropy-train.steps.0] 'training' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-align] 'input_file_grp' is a required property\n  [tools.ocrd-cis-align.steps.0] 'postprocessing/alignment' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-wer] 'input_file_grp' is a required property\n  [tools.ocrd-cis-wer.parameters.testIndex.type] 'integer' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-wer.parameters.gtIndex.type] 'integer' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-wer.categories.0] 'evaluation' is not one of ['Image preprocessing', 'Layout analysis', 'Text recognition and optimization', 'Model training', 'Long-term preservation', 'Quality assurance']\n  [tools.ocrd-cis-wer.steps.0] 'evaluation' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-jar] 'input_file_grp' is a required property\n  [tools.ocrd-cis-jar.steps.0] 'postprocessing/alignment' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-profile] 'input_file_grp' is a required property\n  [tools.ocrd-cis-profile.parameters.executable] 'description' is a required property\n  [tools.ocrd-cis-profile.parameters.backend] 'description' is a required property\n  [tools.ocrd-cis-profile.parameters.language] 'description' is a required property\n  [tools.ocrd-cis-profile.parameters.additionalLexicon] 'description' is a required property\n  [tools.ocrd-cis-profile.steps.0] 'postprocessing/alignment' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-train] 'input_file_grp' is a required property\n  [tools.ocrd-cis-train.parameters.gtArchives] Additional properties are not allowed ('items' was unexpected)\n  [tools.ocrd-cis-train.parameters.gtArchives.type] 'array' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-train.parameters.imagePreprocessingSteps] Additional properties are not allowed ('items' was unexpected)\n  [tools.ocrd-cis-train.parameters.imagePreprocessingSteps.type] 'array' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-train.parameters.ocrSteps] Additional properties are not allowed ('items' was unexpected)\n  [tools.ocrd-cis-train.parameters.ocrSteps.type] 'array' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-train.parameters.training] Additional properties are not allowed ('properties' was unexpected)\n  [tools.ocrd-cis-train.parameters.training.type] 'object' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-train.parameters.training.required] ['trigrams', 'maxCandidate', 'profiler', 'leFeatures', 'rrFeatures', 'dmFeatures'] is not of type 'boolean'\n  [tools.ocrd-cis-train.steps.0] 'postprocessing/alignment' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-cis-post-correct] 'input_file_grp' is a required property\n  [tools.ocrd-cis-post-correct.parameters.ocrSteps] Additional properties are not allowed ('items' was unexpected)\n  [tools.ocrd-cis-post-correct.parameters.ocrSteps.type] 'array' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-post-correct.parameters.postCorrection] Additional properties are not allowed ('properties' was unexpected)\n  [tools.ocrd-cis-post-correct.parameters.postCorrection.type] 'object' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-cis-post-correct.parameters.postCorrection.required] ['maxCandidate', 'profiler', 'model', 'runLE', 'runDM'] is not of type 'boolean'\n  [tools.ocrd-cis-post-correct.steps.0] 'postprocessing/alignment' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n</report>", "official"=>true, "org_plus_name"=>"cisocrgroup/ocrd_cis", "python"=>{"author"=>"Florian Fink, Tobias Englmeier, Christoph Weber", "author-email"=>"finkf@cis.lmu.de, englmeier@cis.lmu.de, web_chris@msn.com", "name"=>"ocrd_cis", "pypi"=>{"info"=>{"author"=>"Florian Fink, Tobias Englmeier, Christoph Weber", "author_email"=>"finkf@cis.lmu.de, englmeier@cis.lmu.de, web_chris@msn.com", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/context:python)\n[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/alerts/)\n# ocrd_cis\n\n[CIS](http://www.cis.lmu.de) [OCR-D](http://ocr-d.de) command line\ntools for the automatic post-correction of OCR-results.\n\n## Introduction\n`ocrd_cis` contains different tools for the automatic post correction\nof OCR-results.  It contains tools for the training, evaluation and\nexecution of the post correction.  Most of the tools are following the\n[OCR-D cli conventions](https://ocr-d.github.io/cli).\n\nThere is a helper tool to align multiple OCR results as well as a\nversion of ocropy that works with python3.\n\n## Installation\nThere are multiple ways to install the `ocrd_cis` tools:\n * `make install` uses `pip` to install `ocrd_cis` (see below).\n * `make install-devel` uses `pip -e` to install `ocrd_cis` (see\n   below).\n * `pip install --upgrade pip ocrd_cis_dir`\n * `pip install -e --upgrade pip ocrd_cis_dir`\n\nIt is possible to install `ocrd_cis` in a custom directory using\n`virtualenv`:\n```sh\n python3 -m venv venv-dir\n source venv-dir/bin/activate\n make install # or any other command to install ocrd_cis (see above)\n # use ocrd_cis\n deactivate\n```\n\n## Usage\nMost tools follow the [OCR-D cli\nconventions](https://ocr-d.github.io/cli).  They accept the\n`--input-file-grp`, `--output-file-grp`, `--parameter`, `--mets`,\n`--log-level` command line arguments (short and long).  For some tools\n(most notably the alignment tool) expect a comma seperated list of\nmultiple input file groups.\n\nThe [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a schema\ndescription of the parameter config file for the different tools that\naccept the `--parameter` argument.\n\n### ocrd-cis-post-correct.sh\nThis bash script runs the post correction using a pre-trained\n[model](http://cis.lmu.de/~finkf/model.zip).  If additional support\nOCRs should be used, models for these OCR steps are required and must\nbe configured in an according configuration file (see ocrd-tool.json).\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` name of the master-OCR file group\n * `--output-file-grp` name of the post-correction file group\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-align\nAligns tokens of multiple input file groups to one output file group.\nThis tool is used to align the master OCR with any additional support\nOCRs.  It accepts a comma-separated list of input file groups, which\nit aligns in order.\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` comma seperated list of the input file groups;\n   first input file group is the master OCR\n * `--output-file-grp` name of the file group for the aligned result\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-train.sh\nScript to train a model from a list of ground-truth archives (see\nocrd-tool.json) for the post correction.  The tool somewhat mimics the\nbehaviour of other ocrd tools:\n * `--mets` for the workspace\n * `--log-level` is passed to other tools\n * `--parameter` is used as configuration\n * `--output-file-grp` defines the output file group for the model\n\n### ocrd-cis-data\nHelper tool to get the path of the installed data files. Usage:\n`ocrd-cis-data [-jar|-3gs]` to get the path of the jar library or the\npath to th default 3-grams language model file.\n\n### ocrd-cis-wer\nHelper tool to calculate the word error rate aligned ocr files.  It\nwrites a simple JSON-formated stats file to the given output file group.\n\nArguments:\n * `--input-file-grp` input file group of aligned ocr results with\n   their respective ground truth.\n * `--output-file-grp` name of the file group for the stats file\n * `--log-level` set log level\n * `--mets` path to METS file in workspace\n\n### ocrd-cis-profile\nRun the profiler over the given files of the according the given input\nfile grp and adds a gzipped JSON-formatted profile to the output file\ngroup of the workspace.  This tools requires an installed [language\nprofiler](https://github.com/cisocrgroup/Profiler).\n\nArguments:\n * `--parameter` path to configuration file\n * `--input-file-grp` name of the input file group to profile\n * `--output-file-grp` name of the output file group where the profile\n   is stored\n * `--log-level` set log level\n * `--mets` path to METS file in the workspace\n\n### ocrd-cis-ocropy-train\nThe ocropy-train tool can be used to train LSTM models.\nIt takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.\nThen a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.\n```sh\nocrd-cis-ocropy-train \\\n  --input-file-grp OCR-D-GT-SEG-LINE \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-clip\nThe ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace.\nIt runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-clip \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-CLIP \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-resegment\nThe ocropy-resegment tool can be used to remove overlap between lines of a workspace.\nIt runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.\n```sh\nocrd-cis-ocropy-resegment \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-RES \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-segment\nThe ocropy-segment tool can be used to segment regions into lines.\nIt runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.\n```sh\nocrd-cis-ocropy-segment \\\n  --input-file-grp OCR-D-SEG-BLOCK \\\n  --output-file-grp OCR-D-SEG-LINE \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-deskew\nThe ocropy-deskew tool can be used to deskew pages / regions of a workspace.\nIt runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.\n```sh\nocrd-cis-ocropy-deskew \\\n  --input-file-grp OCR-D-SEG-LINE \\\n  --output-file-grp OCR-D-SEG-LINE-DES \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-denoise\nThe ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace.\nIt runs the Ocropy \"nlbin\" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-denoise \\\n  --input-file-grp OCR-D-SEG-LINE-DES \\\n  --output-file-grp OCR-D-SEG-LINE-DEN \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-binarize\nThe ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace.\nIt runs the Ocropy \"nlbin\" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.\n```sh\nocrd-cis-ocropy-binarize \\\n  --input-file-grp OCR-D-SEG-LINE-DES \\\n  --output-file-grp OCR-D-SEG-LINE-BIN \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-dewarp\nThe ocropy-dewarp tool can be used to dewarp text lines of a workspace.\nIt runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).\n```sh\nocrd-cis-ocropy-dewarp \\\n  --input-file-grp OCR-D-SEG-LINE-BIN \\\n  --output-file-grp OCR-D-SEG-LINE-DEW \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### ocrd-cis-ocropy-recognize\nThe ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace.\nIt runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.\n```sh\nocrd-cis-ocropy-recognize \\\n  --input-file-grp OCR-D-SEG-LINE-DEW \\\n  --output-file-grp OCR-D-OCR-OCRO \\\n  --mets mets.xml\n  --parameter file:///path/to/config.json\n```\n\n### Tesserocr\nInstall essential system packages for Tesserocr\n```sh\nsudo apt-get install python3-tk \\\n  tesseract-ocr libtesseract-dev libleptonica-dev \\\n  libimage-exiftool-perl libxml2-utils\n```\n\nThen install Tesserocr from: https://github.com/OCR-D/ocrd_tesserocr\n```sh\npip install -r requirements.txt\npip install .\n```\n\nDownload and move tesseract models from:\nhttps://github.com/tesseract-ocr/tesseract/wiki/Data-Files\nor use your own models and\nplace them into: /usr/share/tesseract-ocr/4.00/tessdata\n\n## Workflow configuration\n\nA decent pipeline might look like this:\n\n1. page-level cropping\n2. page-level binarization\n3. page-level deskewing\n4. page-level dewarping\n5. region segmentation\n6. region-level clipping\n7. region-level deskewing\n8. line segmentation\n9. line-level clipping or resegmentation\n10. line-level dewarping\n11. line-level recognition\n12. line-level alignment\n\nIf GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.\n\n## Testing\nTo run a few basic tests type `make test` (`ocrd_cis` has to be\ninstalled in order to run any tests).\n\n## OCR-D workspace\n\n* Create a new (empty) workspace: `ocrd workspace init workspace-dir`\n* cd into `workspace-dir`\n* Add new file to workspace: `ocrd workspace add file -G group -i id\n  -m mimetype`\n\n## OCR-D links\n\n- [OCR-D](https://ocr-d.github.io)\n- [Github](https://github.com/OCR-D)\n- [Project-page](http://www.ocr-d.de/)\n- [Ground-truth](http://www.ocr-d.de/sites/all/GTDaten/IndexGT.html)\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/cisocrgroup/ocrd_cis", "keywords"=>"", "license"=>"MIT", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-cis", "package_url"=>"https://pypi.org/project/ocrd-cis/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-cis/", "project_urls"=>{"Homepage"=>"https://github.com/cisocrgroup/ocrd_cis"}, "release_url"=>"https://pypi.org/project/ocrd-cis/0.0.7/", "requires_dist"=>["ocrd (>=2.0.0)", "click", "scipy", "numpy (>=1.17.0)", "pillow (>=6.2.0)", "matplotlib (>3.0.0)", "python-Levenshtein", "calamari-ocr (==0.3.5)"], "requires_python"=>"", "summary"=>"CIS OCR-D command line tools", "version"=>"0.0.7"}, "last_serial"=>6235442, "releases"=>{"0.0.6"=>[{"comment_text"=>"", "digests"=>{"md5"=>"a186d34dad8d16c13d12af2d0b6d889b", "sha256"=>"ac2ada13f48b301831e41cba1e9a86b8e10ac2e8f4036ecdda9eb3524e36461c"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.6-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"a186d34dad8d16c13d12af2d0b6d889b", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34044792, "upload_time"=>"2019-11-05T19:37:33", "upload_time_iso_8601"=>"2019-11-05T19:37:33.819139Z", "url"=>"https://files.pythonhosted.org/packages/f7/e0/5e3953c9243d05859e679bb83bef9c6f08e10fe0eef736fce90bc42657bc/ocrd_cis-0.0.6-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"5c8c3934a2a4fe764c112d8fd12a5ffc", "sha256"=>"97aea3f172a5eda7272113eb99d55fddda0a96069a20173ea17563d0532bbd55"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.6.tar.gz", "has_sig"=>false, "md5_digest"=>"5c8c3934a2a4fe764c112d8fd12a5ffc", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>96645, "upload_time"=>"2019-11-05T19:37:38", "upload_time_iso_8601"=>"2019-11-05T19:37:38.406783Z", "url"=>"https://files.pythonhosted.org/packages/8a/a9/1fab502623c41529c13b4ecbedfe224f35843160ddcef4c527a18cfe73b8/ocrd_cis-0.0.6.tar.gz"}], "0.0.7"=>[{"comment_text"=>"", "digests"=>{"md5"=>"539c82850462be8013eb31938e7779cf", "sha256"=>"c3d5898c869ae8c88db28fd52907bcabf1ac0d5cd474f73a30a1ff06615c3dbe"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.7-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"539c82850462be8013eb31938e7779cf", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34044484, "upload_time"=>"2019-12-02T15:30:28", "upload_time_iso_8601"=>"2019-12-02T15:30:28.430896Z", "url"=>"https://files.pythonhosted.org/packages/38/c3/10637d7c51e3d6a0e5e5004476dcf2de093e1e3bec8452e241dcf1fa595c/ocrd_cis-0.0.7-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"7df03598c04d60203afb00c61ff836da", "sha256"=>"3629b49d32e1626830b6890f6d47793474fcb3232e4b12c43d5d3f38bb33f08d"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.7.tar.gz", "has_sig"=>false, "md5_digest"=>"7df03598c04d60203afb00c61ff836da", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>96590, "upload_time"=>"2019-12-02T15:30:33", "upload_time_iso_8601"=>"2019-12-02T15:30:33.037095Z", "url"=>"https://files.pythonhosted.org/packages/b8/cb/3fdc4daee6b85b732913c012cf41cafaab708b367c3fd5883d0d8e99c1b1/ocrd_cis-0.0.7.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"539c82850462be8013eb31938e7779cf", "sha256"=>"c3d5898c869ae8c88db28fd52907bcabf1ac0d5cd474f73a30a1ff06615c3dbe"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.7-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"539c82850462be8013eb31938e7779cf", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>34044484, "upload_time"=>"2019-12-02T15:30:28", "upload_time_iso_8601"=>"2019-12-02T15:30:28.430896Z", "url"=>"https://files.pythonhosted.org/packages/38/c3/10637d7c51e3d6a0e5e5004476dcf2de093e1e3bec8452e241dcf1fa595c/ocrd_cis-0.0.7-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"7df03598c04d60203afb00c61ff836da", "sha256"=>"3629b49d32e1626830b6890f6d47793474fcb3232e4b12c43d5d3f38bb33f08d"}, "downloads"=>-1, "filename"=>"ocrd_cis-0.0.7.tar.gz", "has_sig"=>false, "md5_digest"=>"7df03598c04d60203afb00c61ff836da", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>96590, "upload_time"=>"2019-12-02T15:30:33", "upload_time_iso_8601"=>"2019-12-02T15:30:33.037095Z", "url"=>"https://files.pythonhosted.org/packages/b8/cb/3fdc4daee6b85b732913c012cf41cafaab708b367c3fd5883d0d8e99c1b1/ocrd_cis-0.0.7.tar.gz"}]}, "url"=>"https://github.com/cisocrgroup/ocrd_cis"}, "url"=>"https://github.com/cisocrgroup/ocrd_cis"}

ocrd_anybaseocr

{"compliant_cli"=>false, "files"=>{"Dockerfile"=>"FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\n\nWORKDIR /build-layouterkennung\nCOPY setup.py .\nCOPY requirements.txt .\nCOPY README.md .\nCOPY ocrd_anybaseocr ./ocrd_anybaseocr\nRUN pip3 install .\n", "README.md"=>"# Document Preprocessing and Segmentation\n\n[![CircleCI](https://circleci.com/gh/mjenckel/OCR-D-LAYoutERkennung.svg?style=svg)](https://circleci.com/gh/mjenckel/OCR-D-LAYoutERkennung)\n\n> Tools for preprocessing scanned images for OCR\n\n# Installing\n\nTo install anyBaseOCR dependencies system-wide:\n\n    $ sudo pip install .\n\nAlternatively, dependencies can be installed into a Virtual Environment:\n\n    $ virtualenv venv\n    $ source venv/bin/activate\n    $ pip install -e .\n\n#Tools\n\n## Binarizer\n\n### Method Behaviour \n This function takes a scanned colored /gray scale document image as input and do the black and white binarize image.\n \n #### Usage:\n```sh\nocrd-anybaseocr-binarize -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-binarize \\\n   -m mets.xml \\\n   -I OCR-D-IMG \\\n   -O OCR-D-PAGE-BIN\n```\n\n## Deskewer\n\n### Method Behaviour \n This function takes a document image as input and do the skew correction of that document.\n \n #### Usage:\n```sh\nocrd-anybaseocr-deskew -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-deskew \\\n  -m mets.xml \\\n  -I OCR-D-PAGE-BIN \\\n  -O OCR-D-PAGE-DESKEW\n```\n\n## Cropper\n\n### Method Behaviour \n This function takes a document image as input and crops/selects the page content area only (that's mean remove textual noise as well as any other noise around page content area)\n \n #### Usage:\n```sh\nocrd-anybaseocr-crop -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-crop \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-DESKEW \\\n   -O OCR-D-PAGE-CROP\n```\n\n\n## Dewarper\n\n### Method Behaviour \n This function takes a document image as input and make the text line straight if its curved.\n \n #### Usage:\n```sh\nocrd-anybaseocr-dewarp -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n\n#### Example: \n```sh\nCUDA_VISIBLE_DEVICES=0 ocrd-anybaseocr-dewarp \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-CROP \\\n   -O OCR-D-PAGE-DEWARP\n```\n\n## Text/Non-Text Segmenter\n\n### Method Behaviour \n This function takes a document image as an input and separates the text and non-text part from the input document image.\n \n #### Usage:\n```sh\nocrd-anybaseocr-tiseg -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-tiseg \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-CROP \\\n   -O OCR-D-PAGE-TISEG\n```\n\n## Textline Segmenter\n\n### Method Behaviour \n This function takes a cropped document image as an input and segment the image into textline images.\n \n #### Usage:\n```sh\nocrd-anybaseocr-textline -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-textline \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-TISEG \\\n   -O OCR-D-PAGE-TL\n```\n\n## Block Segmenter\n\n### Method Behaviour \n This function takes raw document image as an input and segments the image into the different text blocks.\n \n #### Usage:\n```sh\nocrd-anybaseocr-block-segmenter -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-block-segmenter \\\n   -m mets.xml \\\n   -I OCR-IMG \\\n   -O OCR-D-PAGE-BLOCK\n```\n\n## Document Analyser\n\n### Method Behaviour \n This function takes all the cropped document images of a single book and its corresponding text regions as input and generates the logical structure on the book level.\n \n #### Usage:\n```sh\nocrd-anybaseocr-layout-analysis -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-layout-analysis \\\n   -m mets.xml \\\n   -I OCR-IMG \\\n   -O OCR-D-PAGE-BLOCK\n```\n\n\n## Testing\n\nTo test the tools, download [OCR-D/assets](https://github.com/OCR-D/assets). In\nparticular, the code is tested with the\n[dfki-testdata](https://github.com/OCR-D/assets/tree/master/data/dfki-testdata)\ndataset.\n\nRun `make test` to run all tests.\n\n## License\n\n\n```\n Licensed under the Apache License, Version 2.0 (the \"License\");\n you may not use this file except in compliance with the License.\n You may obtain a copy of the License at\n\n     http://www.apache.org/licenses/LICENSE-2.0\n\n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n ```\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/mjenckel/LAYoutERkennung/\",\n  \"version\": \"0.0.1\",\n  \"tools\": {\n    \"ocrd-anybaseocr-binarize\": {\n      \"executable\": \"ocrd-anybaseocr-binarize\",\n      \"description\": \"Binarize images with the algorithm from ocropy\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization/binarization\"],\n      \"input_file_grp\": [\"OCR-D-IMG\"],\n      \"output_file_grp\": [\"OCR-D-IMG-BIN\"],\n      \"parameters\": {\n        \"nocheck\":         {\"type\": \"boolean\",                     \"default\": false, \"description\": \"disable error checking on inputs\"},\n        \"show\":            {\"type\": \"boolean\",                     \"default\": false, \"description\": \"display final results\"},\n        \"raw_copy\":        {\"type\": \"boolean\",                     \"default\": false, \"description\": \"also copy the raw image\"},\n        \"gray\":            {\"type\": \"boolean\",                     \"default\": false, \"description\": \"force grayscale processing even if image seems binary\"},\n        \"bignore\":         {\"type\": \"number\", \"format\": \"float\",   \"default\": 0.1,   \"description\": \"ignore this much of the border for threshold estimation\"},\n        \"debug\":           {\"type\": \"number\", \"format\": \"integer\", \"default\": 0,     \"description\": \"display intermediate results\"},\n        \"escale\":          {\"type\": \"number\", \"format\": \"float\",   \"default\": 1.0,   \"description\": \"scale for estimating a mask over the text region\"},\n        \"hi\":              {\"type\": \"number\", \"format\": \"float\",   \"default\": 90,    \"description\": \"percentile for white estimation\"},\n        \"lo\":              {\"type\": \"number\", \"format\": \"float\",   \"default\": 5,     \"description\": \"percentile for black estimation\"},\n        \"perc\":            {\"type\": \"number\", \"format\": \"float\",   \"default\": 80,    \"description\": \"percentage for filters\"},\n        \"range\":           {\"type\": \"number\", \"format\": \"integer\", \"default\": 20,    \"description\": \"range for filters\"},\n        \"threshold\":       {\"type\": \"number\", \"format\": \"float\",   \"default\": 0.5,   \"description\": \"threshold, determines lightness\"},\n        \"zoom\":            {\"type\": \"number\", \"format\": \"float\",   \"default\": 0.5,   \"description\": \"zoom for page background estimation, smaller=faster\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-deskew\": {\n      \"executable\": \"ocrd-anybaseocr-deskew\",\n      \"description\": \"Deskew images with the algorithm from ocropy\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization/deskewing\"],\n      \"input_file_grp\": [\"OCR-D-IMG-BIN\"],\n      \"output_file_grp\": [\"OCR-D-IMG-DESKEW\"],\n      \"parameters\": {\n        \"escale\":    {\"type\": \"number\", \"format\": \"float\",   \"default\": 1.0, \"description\": \"scale for estimating a mask over the text region\"},\n        \"bignore\":   {\"type\": \"number\", \"format\": \"float\",   \"default\": 0.1, \"description\": \"ignore this much of the border for threshold estimation\"},\n        \"threshold\": {\"type\": \"number\", \"format\": \"float\",   \"default\": 0.5, \"description\": \"threshold, determines lightness\"},\n        \"maxskew\":   {\"type\": \"number\", \"format\": \"float\",   \"default\": 1.0, \"description\": \"skew angle estimation parameters (degrees)\"},\n        \"skewsteps\": {\"type\": \"number\", \"format\": \"integer\", \"default\": 8,   \"description\": \"steps for skew angle estimation (per degree)\"},\n        \"debug\":     {\"type\": \"number\", \"format\": \"integer\", \"default\": 0,   \"description\": \"display intermediate results\"},\n        \"parallel\":  {\"type\": \"number\", \"format\": \"integer\", \"default\": 0,   \"description\": \"???\"},\n        \"lo\":        {\"type\": \"number\", \"format\": \"integer\", \"default\": 5,   \"description\": \"percentile for black estimation\"},\n        \"hi\":        {\"type\": \"number\", \"format\": \"integer\", \"default\": 90,   \"description\": \"percentile for white estimation\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-crop\": {\n      \"executable\": \"ocrd-anybaseocr-crop\",\n      \"description\": \"Image crop using non-linear processing\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization/cropping\"],\n      \"input_file_grp\": [\"OCR-D-IMG-DESKEW\"],\n      \"output_file_grp\": [\"OCR-D-IMG-CROP\"],\n      \"parameters\": {\n        \"colSeparator\":  {\"type\": \"number\", \"format\": \"float\", \"default\": 0.04, \"description\": \"consider space between column. 25% of width\"},\n        \"maxRularArea\":  {\"type\": \"number\", \"format\": \"float\", \"default\": 0.3, \"description\": \"Consider maximum rular area\"},\n        \"minArea\":       {\"type\": \"number\", \"format\": \"float\", \"default\": 0.05, \"description\": \"rular position in below\"},\n        \"minRularArea\":  {\"type\": \"number\", \"format\": \"float\", \"default\": 0.01, \"description\": \"Consider minimum rular area\"},\n        \"positionBelow\": {\"type\": \"number\", \"format\": \"float\", \"default\": 0.75, \"description\": \"rular position in below\"},\n        \"positionLeft\":  {\"type\": \"number\", \"format\": \"float\", \"default\": 0.4, \"description\": \"rular position in left\"},\n        \"positionRight\": {\"type\": \"number\", \"format\": \"float\", \"default\": 0.6, \"description\": \"rular position in right\"},\n        \"rularRatioMax\": {\"type\": \"number\", \"format\": \"float\", \"default\": 10.0, \"description\": \"rular position in below\"},\n        \"rularRatioMin\": {\"type\": \"number\", \"format\": \"float\", \"default\": 3.0, \"description\": \"rular position in below\"},\n        \"rularWidth\":    {\"type\": \"number\", \"format\": \"float\", \"default\": 0.95, \"description\": \"maximum rular width\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-dewarp\": {\n      \"executable\": \"ocrd-anybaseocr-dewarp\",\n      \"description\": \"dewarp image with anyBaseOCR\",\n      \"categories\": [\"Image preprocessing\"],\n      \"steps\": [\"preprocessing/optimization/dewarping\"],\n      \"input_file_grp\": [\"OCR-D-IMG-CROP\"],\n      \"output_file_grp\": [\"OCR-D-IMG-DEWARP\"],\n      \"parameters\": {\n        \"imgresize\":    { \"type\": \"string\",                      \"default\": \"resize_and_crop\", \"description\": \"run on original size image\"},\n        \"pix2pixHD\":    { \"type\": \"string\", \"default\":\"/home/ahmed/project/pix2pixHD\", \"description\": \"Path to pix2pixHD library\"},\n        \"model_name\":\t{ \"type\": \"string\", \"default\":\"models\", \"description\": \"name of dir with trained pix2pixHD model (latest_net_G.pth)\"},\n        \"checkpoint_dir\":   { \"type\": \"string\", \"default\":\"./\", \"description\": \"Path to where to look for dir with model name\"},\n        \"gpu_id\":       { \"type\": \"number\", \"format\": \"integer\", \"default\": 0,    \"description\": \"gpu id\"},\n        \"resizeHeight\": { \"type\": \"number\", \"format\": \"integer\", \"default\": 1024, \"description\": \"resized image height\"},\n        \"resizeWidth\":  { \"type\": \"number\", \"format\": \"integer\", \"default\": 1024, \"description\": \"resized image width\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-tiseg\": {\n      \"executable\": \"ocrd-anybaseocr-tiseg\",\n      \"input_file_grp\": [\"OCR-D-IMG-CROP\"],\n      \"output_file_grp\": [\"OCR-D-SEG-TISEG\"],\n      \"categories\": [\"Layout analysis\"],\n      \"steps\": [\"layout/segmentation/text-image\"],\n      \"description\": \"separate text and non-text part with anyBaseOCR\",\n      \"parameters\": {\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-textline\": {\n      \"executable\": \"ocrd-anybaseocr-textline\",\n      \"input_file_grp\": [\"OCR-D-SEG-TISEG\"],\n      \"output_file_grp\": [\"OCR-D-SEG-LINE-ANY\"],\n      \"categories\": [\"Layout analysis\"],\n      \"steps\": [\"layout/segmentation/line\"],\n      \"description\": \"separate each text line\",\n      \"parameters\": {\n        \"minscale\":    {\"type\": \"number\", \"format\": \"float\", \"default\": 12.0, \"description\": \"minimum scale permitted\"},\n        \"maxlines\":    {\"type\": \"number\", \"format\": \"float\", \"default\": 300, \"description\": \"non-standard scaling of horizontal parameters\"},\n        \"scale\":       {\"type\": \"number\", \"format\": \"float\", \"default\": 0.0, \"description\": \"the basic scale of the document (roughly, xheight) 0=automatic\"},\n        \"hscale\":      {\"type\": \"number\", \"format\": \"float\", \"default\": 1.0, \"description\": \"non-standard scaling of horizontal parameters\"},\n        \"vscale\":      {\"type\": \"number\", \"format\": \"float\", \"default\": 1.7, \"description\": \"non-standard scaling of vertical parameters\"},\n        \"threshold\":   {\"type\": \"number\", \"format\": \"float\", \"default\": 0.2, \"description\": \"baseline threshold\"},\n        \"noise\":       {\"type\": \"number\", \"format\": \"integer\", \"default\": 8, \"description\": \"noise threshold for removing small components from lines\"},\n        \"usegauss\":    {\"type\": \"boolean\", \"default\": false, \"description\": \"use gaussian instead of uniform\"},\n        \"maxseps\":     {\"type\": \"number\", \"format\": \"integer\", \"default\": 2, \"description\": \"maximum black column separators\"},\n        \"sepwiden\":    {\"type\": \"number\", \"format\": \"integer\", \"default\": 10, \"description\": \"widen black separators (to account for warping)\"},\n        \"blackseps\":   {\"type\": \"boolean\", \"default\": false, \"description\": \"also check for black column separators\"},\n        \"maxcolseps\":  {\"type\": \"number\", \"format\": \"integer\", \"default\": 2, \"description\": \"maximum # whitespace column separators\"},\n        \"csminaspect\": {\"type\": \"number\", \"format\": \"float\", \"default\": 1.1, \"description\": \"minimum aspect ratio for column separators\"},\n        \"csminheight\": {\"type\": \"number\", \"format\": \"float\", \"default\": 6.5, \"description\": \"minimum column height (units=scale)\"},\n        \"pad\":         {\"type\": \"number\", \"format\": \"integer\", \"default\": 3, \"description\": \"padding for extracted lines\"},\n        \"expand\":      {\"type\": \"number\", \"format\": \"integer\", \"default\": 3, \"description\": \"expand mask for grayscale extraction\"},\n        \"parallel\":    {\"type\": \"number\", \"format\": \"integer\", \"default\": 0, \"description\": \"number of CPUs to use\"},\n        \"libpath\":     {\"type\": \"string\", \"default\": \".\", \"description\": \"Library Path for C Executables\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }\n    },\n    \"ocrd-anybaseocr-layout-analysis\": {\n      \"executable\": \"ocrd-anybaseocr-layout-analysis\",\n      \"input_file_grp\": [\"OCR-D-IMG-CROP\"],\n      \"output_file_grp\": [\"OCR-D-SEG-LAYOUT\"],\n      \"categories\": [\"Layout analysis\"],\n      \"steps\": [\"layout/segmentation/text-image\"],\n      \"description\": \"Analysis of the input document\",\n      \"parameters\": {\n        \"batch_size\":         {\"type\": \"number\", \"format\": \"integer\", \"default\": 4, \"description\": \"Batch size for generating test images\"},\n        \"model_path\":         { \"type\": \"string\", \"default\":\"models/structure_analysis.h5\", \"required\": false, \"description\": \"Path to Layout Structure Classification Model\"},\n        \"class_mapping_path\": { \"type\": \"string\", \"default\":\"models/mapping_DenseNet.pickle\",\"required\": false, \"description\": \"Path to Layout Structure Classes\"}\n      }\n    },\n    \"ocrd-anybaseocr-block-segmentation\": {\n      \"executable\": \"ocrd-anybaseocr-block-segmentation\",\n      \"input_file_grp\": [\"OCR-D-IMG\"],\n      \"output_file_grp\": [\"OCR-D-BLOCK-SEGMENT\"],\n      \"categories\": [\"Layout analysis\"],\n      \"steps\": [\"layout/segmentation/text-image\"],\n      \"description\": \"Analysis of the input document\",\n      \"parameters\": {        \n        \"block_segmentation_model\":   { \"type\": \"string\",\"default\":\"mrcnn/\", \"required\": false, \"description\": \"Path to block segmentation Model\"},\n        \"block_segmentation_weights\": { \"type\": \"string\",\"default\":\"mrcnn/block_segmentation_weights.h5\",  \"required\": false, \"description\": \"Path to model weights\"},\n        \"operation_level\": {\"type\": \"string\", \"enum\": [\"page\",\"region\", \"line\"], \"default\": \"page\",\"description\": \"PAGE XML hierarchy level to operate on\"}\n      }       \n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd-anybaseocr',\n    version='0.0.1',\n    author=\"DFKI\",\n    author_email=\"Saqib.Bukhari@dfki.de, Mohammad_mohsin.reza@dfki.de\",\n    url=\"https://github.com/mjenckel/LAYoutERkennung\",\n    license='Apache License 2.0',\n    long_description=open('README.md').read(),\n    long_description_content_type='text/markdown',\n    install_requires=open('requirements.txt').read().split('\\n'),\n    packages=find_packages(exclude=[\"work_dir\", \"src\"]),\n    package_data={\n        '': ['*.json']\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-anybaseocr-binarize           = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_binarize',\n            'ocrd-anybaseocr-deskew             = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_deskew',\n            'ocrd-anybaseocr-crop               = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_cropping',        \n            'ocrd-anybaseocr-dewarp             = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_dewarp',\n            'ocrd-anybaseocr-tiseg              = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_tiseg',\n            'ocrd-anybaseocr-textline           = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_textline',\n            'ocrd-anybaseocr-layout-analysis    = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_layout_analysis',\n            'ocrd-anybaseocr-block-segmentation = ocrd_anybaseocr.cli.cli:ocrd_anybaseocr_block_segmentation'\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Tue Dec 17 13:28:07 2019 +0100", "latest_tag"=>"", "number_of_commits"=>"111", "url"=>"https://github.com/OCR-D/ocrd_anybaseocr.git"}, "name"=>"ocrd_anybaseocr", "ocrd_tool"=>{"git_url"=>"https://github.com/mjenckel/LAYoutERkennung/", "tools"=>{"ocrd-anybaseocr-binarize"=>{"categories"=>["Image preprocessing"], "description"=>"Binarize images with the algorithm from ocropy", "executable"=>"ocrd-anybaseocr-binarize", "input_file_grp"=>["OCR-D-IMG"], "output_file_grp"=>["OCR-D-IMG-BIN"], "parameters"=>{"bignore"=>{"default"=>0.1, "description"=>"ignore this much of the border for threshold estimation", "format"=>"float", "type"=>"number"}, "debug"=>{"default"=>0, "description"=>"display intermediate results", "format"=>"integer", "type"=>"number"}, "escale"=>{"default"=>1.0, "description"=>"scale for estimating a mask over the text region", "format"=>"float", "type"=>"number"}, "gray"=>{"default"=>false, "description"=>"force grayscale processing even if image seems binary", "type"=>"boolean"}, "hi"=>{"default"=>90, "description"=>"percentile for white estimation", "format"=>"float", "type"=>"number"}, "lo"=>{"default"=>5, "description"=>"percentile for black estimation", "format"=>"float", "type"=>"number"}, "nocheck"=>{"default"=>false, "description"=>"disable error checking on inputs", "type"=>"boolean"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}, "perc"=>{"default"=>80, "description"=>"percentage for filters", "format"=>"float", "type"=>"number"}, "range"=>{"default"=>20, "description"=>"range for filters", "format"=>"integer", "type"=>"number"}, "raw_copy"=>{"default"=>false, "description"=>"also copy the raw image", "type"=>"boolean"}, "show"=>{"default"=>false, "description"=>"display final results", "type"=>"boolean"}, "threshold"=>{"default"=>0.5, "description"=>"threshold, determines lightness", "format"=>"float", "type"=>"number"}, "zoom"=>{"default"=>0.5, "description"=>"zoom for page background estimation, smaller=faster", "format"=>"float", "type"=>"number"}}, "steps"=>["preprocessing/optimization/binarization"]}, "ocrd-anybaseocr-block-segmentation"=>{"categories"=>["Layout analysis"], "description"=>"Analysis of the input document", "executable"=>"ocrd-anybaseocr-block-segmentation", "input_file_grp"=>["OCR-D-IMG"], "output_file_grp"=>["OCR-D-BLOCK-SEGMENT"], "parameters"=>{"block_segmentation_model"=>{"default"=>"mrcnn/", "description"=>"Path to block segmentation Model", "required"=>false, "type"=>"string"}, "block_segmentation_weights"=>{"default"=>"mrcnn/block_segmentation_weights.h5", "description"=>"Path to model weights", "required"=>false, "type"=>"string"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}}, "steps"=>["layout/segmentation/text-image"]}, "ocrd-anybaseocr-crop"=>{"categories"=>["Image preprocessing"], "description"=>"Image crop using non-linear processing", "executable"=>"ocrd-anybaseocr-crop", "input_file_grp"=>["OCR-D-IMG-DESKEW"], "output_file_grp"=>["OCR-D-IMG-CROP"], "parameters"=>{"colSeparator"=>{"default"=>0.04, "description"=>"consider space between column. 25% of width", "format"=>"float", "type"=>"number"}, "maxRularArea"=>{"default"=>0.3, "description"=>"Consider maximum rular area", "format"=>"float", "type"=>"number"}, "minArea"=>{"default"=>0.05, "description"=>"rular position in below", "format"=>"float", "type"=>"number"}, "minRularArea"=>{"default"=>0.01, "description"=>"Consider minimum rular area", "format"=>"float", "type"=>"number"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}, "positionBelow"=>{"default"=>0.75, "description"=>"rular position in below", "format"=>"float", "type"=>"number"}, "positionLeft"=>{"default"=>0.4, "description"=>"rular position in left", "format"=>"float", "type"=>"number"}, "positionRight"=>{"default"=>0.6, "description"=>"rular position in right", "format"=>"float", "type"=>"number"}, "rularRatioMax"=>{"default"=>10.0, "description"=>"rular position in below", "format"=>"float", "type"=>"number"}, "rularRatioMin"=>{"default"=>3.0, "description"=>"rular position in below", "format"=>"float", "type"=>"number"}, "rularWidth"=>{"default"=>0.95, "description"=>"maximum rular width", "format"=>"float", "type"=>"number"}}, "steps"=>["preprocessing/optimization/cropping"]}, "ocrd-anybaseocr-deskew"=>{"categories"=>["Image preprocessing"], "description"=>"Deskew images with the algorithm from ocropy", "executable"=>"ocrd-anybaseocr-deskew", "input_file_grp"=>["OCR-D-IMG-BIN"], "output_file_grp"=>["OCR-D-IMG-DESKEW"], "parameters"=>{"bignore"=>{"default"=>0.1, "description"=>"ignore this much of the border for threshold estimation", "format"=>"float", "type"=>"number"}, "debug"=>{"default"=>0, "description"=>"display intermediate results", "format"=>"integer", "type"=>"number"}, "escale"=>{"default"=>1.0, "description"=>"scale for estimating a mask over the text region", "format"=>"float", "type"=>"number"}, "hi"=>{"default"=>90, "description"=>"percentile for white estimation", "format"=>"integer", "type"=>"number"}, "lo"=>{"default"=>5, "description"=>"percentile for black estimation", "format"=>"integer", "type"=>"number"}, "maxskew"=>{"default"=>1.0, "description"=>"skew angle estimation parameters (degrees)", "format"=>"float", "type"=>"number"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}, "parallel"=>{"default"=>0, "description"=>"???", "format"=>"integer", "type"=>"number"}, "skewsteps"=>{"default"=>8, "description"=>"steps for skew angle estimation (per degree)", "format"=>"integer", "type"=>"number"}, "threshold"=>{"default"=>0.5, "description"=>"threshold, determines lightness", "format"=>"float", "type"=>"number"}}, "steps"=>["preprocessing/optimization/deskewing"]}, "ocrd-anybaseocr-dewarp"=>{"categories"=>["Image preprocessing"], "description"=>"dewarp image with anyBaseOCR", "executable"=>"ocrd-anybaseocr-dewarp", "input_file_grp"=>["OCR-D-IMG-CROP"], "output_file_grp"=>["OCR-D-IMG-DEWARP"], "parameters"=>{"checkpoint_dir"=>{"default"=>"./", "description"=>"Path to where to look for dir with model name", "type"=>"string"}, "gpu_id"=>{"default"=>0, "description"=>"gpu id", "format"=>"integer", "type"=>"number"}, "imgresize"=>{"default"=>"resize_and_crop", "description"=>"run on original size image", "type"=>"string"}, "model_name"=>{"default"=>"models", "description"=>"name of dir with trained pix2pixHD model (latest_net_G.pth)", "type"=>"string"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}, "pix2pixHD"=>{"default"=>"/home/ahmed/project/pix2pixHD", "description"=>"Path to pix2pixHD library", "type"=>"string"}, "resizeHeight"=>{"default"=>1024, "description"=>"resized image height", "format"=>"integer", "type"=>"number"}, "resizeWidth"=>{"default"=>1024, "description"=>"resized image width", "format"=>"integer", "type"=>"number"}}, "steps"=>["preprocessing/optimization/dewarping"]}, "ocrd-anybaseocr-layout-analysis"=>{"categories"=>["Layout analysis"], "description"=>"Analysis of the input document", "executable"=>"ocrd-anybaseocr-layout-analysis", "input_file_grp"=>["OCR-D-IMG-CROP"], "output_file_grp"=>["OCR-D-SEG-LAYOUT"], "parameters"=>{"batch_size"=>{"default"=>4, "description"=>"Batch size for generating test images", "format"=>"integer", "type"=>"number"}, "class_mapping_path"=>{"default"=>"models/mapping_DenseNet.pickle", "description"=>"Path to Layout Structure Classes", "required"=>false, "type"=>"string"}, "model_path"=>{"default"=>"models/structure_analysis.h5", "description"=>"Path to Layout Structure Classification Model", "required"=>false, "type"=>"string"}}, "steps"=>["layout/segmentation/text-image"]}, "ocrd-anybaseocr-textline"=>{"categories"=>["Layout analysis"], "description"=>"separate each text line", "executable"=>"ocrd-anybaseocr-textline", "input_file_grp"=>["OCR-D-SEG-TISEG"], "output_file_grp"=>["OCR-D-SEG-LINE-ANY"], "parameters"=>{"blackseps"=>{"default"=>false, "description"=>"also check for black column separators", "type"=>"boolean"}, "csminaspect"=>{"default"=>1.1, "description"=>"minimum aspect ratio for column separators", "format"=>"float", "type"=>"number"}, "csminheight"=>{"default"=>6.5, "description"=>"minimum column height (units=scale)", "format"=>"float", "type"=>"number"}, "expand"=>{"default"=>3, "description"=>"expand mask for grayscale extraction", "format"=>"integer", "type"=>"number"}, "hscale"=>{"default"=>1.0, "description"=>"non-standard scaling of horizontal parameters", "format"=>"float", "type"=>"number"}, "libpath"=>{"default"=>".", "description"=>"Library Path for C Executables", "type"=>"string"}, "maxcolseps"=>{"default"=>2, "description"=>"maximum # whitespace column separators", "format"=>"integer", "type"=>"number"}, "maxlines"=>{"default"=>300, "description"=>"non-standard scaling of horizontal parameters", "format"=>"float", "type"=>"number"}, "maxseps"=>{"default"=>2, "description"=>"maximum black column separators", "format"=>"integer", "type"=>"number"}, "minscale"=>{"default"=>12.0, "description"=>"minimum scale permitted", "format"=>"float", "type"=>"number"}, "noise"=>{"default"=>8, "description"=>"noise threshold for removing small components from lines", "format"=>"integer", "type"=>"number"}, "operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}, "pad"=>{"default"=>3, "description"=>"padding for extracted lines", "format"=>"integer", "type"=>"number"}, "parallel"=>{"default"=>0, "description"=>"number of CPUs to use", "format"=>"integer", "type"=>"number"}, "scale"=>{"default"=>0.0, "description"=>"the basic scale of the document (roughly, xheight) 0=automatic", "format"=>"float", "type"=>"number"}, "sepwiden"=>{"default"=>10, "description"=>"widen black separators (to account for warping)", "format"=>"integer", "type"=>"number"}, "threshold"=>{"default"=>0.2, "description"=>"baseline threshold", "format"=>"float", "type"=>"number"}, "usegauss"=>{"default"=>false, "description"=>"use gaussian instead of uniform", "type"=>"boolean"}, "vscale"=>{"default"=>1.7, "description"=>"non-standard scaling of vertical parameters", "format"=>"float", "type"=>"number"}}, "steps"=>["layout/segmentation/line"]}, "ocrd-anybaseocr-tiseg"=>{"categories"=>["Layout analysis"], "description"=>"separate text and non-text part with anyBaseOCR", "executable"=>"ocrd-anybaseocr-tiseg", "input_file_grp"=>["OCR-D-IMG-CROP"], "output_file_grp"=>["OCR-D-SEG-TISEG"], "parameters"=>{"operation_level"=>{"default"=>"page", "description"=>"PAGE XML hierarchy level to operate on", "enum"=>["page", "region", "line"], "type"=>"string"}}, "steps"=>["layout/segmentation/text-image"]}}, "version"=>"0.0.1"}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [tools.ocrd-anybaseocr-tiseg.steps.0] 'layout/segmentation/text-image' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-anybaseocr-layout-analysis.steps.0] 'layout/segmentation/text-image' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n  [tools.ocrd-anybaseocr-block-segmentation.steps.0] 'layout/segmentation/text-image' is not one of ['preprocessing/characterization', 'preprocessing/optimization', 'preprocessing/optimization/cropping', 'preprocessing/optimization/deskewing', 'preprocessing/optimization/despeckling', 'preprocessing/optimization/dewarping', 'preprocessing/optimization/binarization', 'preprocessing/optimization/grayscale_normalization', 'recognition/text-recognition', 'recognition/font-identification', 'recognition/post-correction', 'layout/segmentation', 'layout/segmentation/text-nontext', 'layout/segmentation/region', 'layout/segmentation/line', 'layout/segmentation/word', 'layout/segmentation/classification', 'layout/analysis']\n</report>", "official"=>true, "org_plus_name"=>"OCR-D/ocrd_anybaseocr", "python"=>{"author"=>"DFKI", "author-email"=>"Saqib.Bukhari@dfki.de, Mohammad_mohsin.reza@dfki.de", "name"=>"ocrd-anybaseocr", "pypi"=>{"info"=>{"author"=>"DFKI", "author_email"=>"Saqib.Bukhari@dfki.de, Mohammad_mohsin.reza@dfki.de", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# Document Preprocessing and Segmentation\n\n[![CircleCI](https://circleci.com/gh/mjenckel/OCR-D-LAYoutERkennung.svg?style=svg)](https://circleci.com/gh/mjenckel/OCR-D-LAYoutERkennung)\n\n> Tools for preprocessing scanned images for OCR\n\n# Installing\n\nTo install anyBaseOCR dependencies system-wide:\n\n    $ sudo pip install .\n\nAlternatively, dependencies can be installed into a Virtual Environment:\n\n    $ virtualenv venv\n    $ source venv/bin/activate\n    $ pip install -e .\n\n#Tools\n\n## Binarizer\n\n### Method Behaviour \n This function takes a scanned colored /gray scale document image as input and do the black and white binarize image.\n\n #### Usage:\n```sh\nocrd-anybaseocr-binarize -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-binarize \\\n   -m mets.xml \\\n   -I OCR-D-IMG \\\n   -O OCR-D-PAGE-BIN\n```\n\n## Deskewer\n\n### Method Behaviour \n This function takes a document image as input and do the skew correction of that document.\n\n #### Usage:\n```sh\nocrd-anybaseocr-deskew -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-deskew \\\n  -m mets.xml \\\n  -I OCR-D-PAGE-BIN \\\n  -O OCR-D-PAGE-DESKEW\n```\n\n## Cropper\n\n### Method Behaviour \n This function takes a document image as input and crops/selects the page content area only (that's mean remove textual noise as well as any other noise around page content area)\n\n #### Usage:\n```sh\nocrd-anybaseocr-crop -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-crop \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-DESKEW \\\n   -O OCR-D-PAGE-CROP\n```\n\n\n## Dewarper\n\n### Method Behaviour \n This function takes a document image as input and make the text line straight if its curved.\n\n #### Usage:\n```sh\nocrd-anybaseocr-dewarp -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n\n#### Example: \n```sh\nCUDA_VISIBLE_DEVICES=0 ocrd-anybaseocr-dewarp \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-CROP \\\n   -O OCR-D-PAGE-DEWARP\n```\n\n## Text/Non-Text Segmenter\n\n### Method Behaviour \n This function takes a document image as an input and separates the text and non-text part from the input document image.\n\n #### Usage:\n```sh\nocrd-anybaseocr-tiseg -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-tiseg \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-CROP \\\n   -O OCR-D-PAGE-TISEG\n```\n\n## Textline Segmenter\n\n### Method Behaviour \n This function takes a cropped document image as an input and segment the image into textline images.\n\n #### Usage:\n```sh\nocrd-anybaseocr-textline -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-textline \\\n   -m mets.xml \\\n   -I OCR-D-PAGE-TISEG \\\n   -O OCR-D-PAGE-TL\n```\n\n## Block Segmenter\n\n### Method Behaviour \n This function takes raw document image as an input and segments the image into the different text blocks.\n\n #### Usage:\n```sh\nocrd-anybaseocr-block-segmenter -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-block-segmenter \\\n   -m mets.xml \\\n   -I OCR-IMG \\\n   -O OCR-D-PAGE-BLOCK\n```\n\n## Document Analyser\n\n### Method Behaviour \n This function takes all the cropped document images of a single book and its corresponding text regions as input and generates the logical structure on the book level.\n\n #### Usage:\n```sh\nocrd-anybaseocr-layout-analysis -m (path to METs input file) -I (Input group name) -O (Output group name) [-p (path to parameter file) -o (METs output filename)]\n```\n\n#### Example: \n```sh\nocrd-anybaseocr-layout-analysis \\\n   -m mets.xml \\\n   -I OCR-IMG \\\n   -O OCR-D-PAGE-BLOCK\n```\n\n\n## Testing\n\nTo test the tools, download [OCR-D/assets](https://github.com/OCR-D/assets). In\nparticular, the code is tested with the\n[dfki-testdata](https://github.com/OCR-D/assets/tree/master/data/dfki-testdata)\ndataset.\n\nRun `make test` to run all tests.\n\n## License\n\n\n```\n Licensed under the Apache License, Version 2.0 (the \"License\");\n you may not use this file except in compliance with the License.\n You may obtain a copy of the License at\n\n     http://www.apache.org/licenses/LICENSE-2.0\n\n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.\n ```\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/mjenckel/LAYoutERkennung", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-anybaseocr", "package_url"=>"https://pypi.org/project/ocrd-anybaseocr/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-anybaseocr/", "project_urls"=>{"Homepage"=>"https://github.com/mjenckel/LAYoutERkennung"}, "release_url"=>"https://pypi.org/project/ocrd-anybaseocr/0.0.1/", "requires_dist"=>["ocrd (>=2.0.0)", "opencv-python-headless (>=3.4)", "ocrd-fork-ocropy (>=1.4.0a3)", "ocrd-fork-pylsd (>=0.0.3)", "setuptools (>=41.0.0)", "torch (>=1.1.0)", "torchvision", "pandas", "keras", "tensorflow-gpu (==1.14.0)", "scikit-image"], "requires_python"=>"", "summary"=>"", "version"=>"0.0.1"}, "last_serial"=>6317222, "releases"=>{"0.0.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e70acb5331cd2daece04bc114622ec39", "sha256"=>"021a114defc9702fa99988308277cab92bad1a95a8472395b8e38fde23569dc6"}, "downloads"=>-1, "filename"=>"ocrd_anybaseocr-0.0.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e70acb5331cd2daece04bc114622ec39", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>95755, "upload_time"=>"2019-12-17T13:15:51", "upload_time_iso_8601"=>"2019-12-17T13:15:51.079869Z", "url"=>"https://files.pythonhosted.org/packages/d6/2c/9417ad5fb850c2eb52a86e822f64741d4df65831580104a68196e0c5cbcf/ocrd_anybaseocr-0.0.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"84203839fe06916bc281097251eba50f", "sha256"=>"077b3f59f09f1e315aee5fafbeef8184706d45c0a5863224f5ebef941b682281"}, "downloads"=>-1, "filename"=>"ocrd_anybaseocr-0.0.1.tar.gz", "has_sig"=>false, "md5_digest"=>"84203839fe06916bc281097251eba50f", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>77823, "upload_time"=>"2019-12-17T13:15:54", "upload_time_iso_8601"=>"2019-12-17T13:15:54.116419Z", "url"=>"https://files.pythonhosted.org/packages/15/cf/fc744aa2323538a7a980a44af16d86ab68feba42f78ba6069763e9ed125d/ocrd_anybaseocr-0.0.1.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"e70acb5331cd2daece04bc114622ec39", "sha256"=>"021a114defc9702fa99988308277cab92bad1a95a8472395b8e38fde23569dc6"}, "downloads"=>-1, "filename"=>"ocrd_anybaseocr-0.0.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"e70acb5331cd2daece04bc114622ec39", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>95755, "upload_time"=>"2019-12-17T13:15:51", "upload_time_iso_8601"=>"2019-12-17T13:15:51.079869Z", "url"=>"https://files.pythonhosted.org/packages/d6/2c/9417ad5fb850c2eb52a86e822f64741d4df65831580104a68196e0c5cbcf/ocrd_anybaseocr-0.0.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"84203839fe06916bc281097251eba50f", "sha256"=>"077b3f59f09f1e315aee5fafbeef8184706d45c0a5863224f5ebef941b682281"}, "downloads"=>-1, "filename"=>"ocrd_anybaseocr-0.0.1.tar.gz", "has_sig"=>false, "md5_digest"=>"84203839fe06916bc281097251eba50f", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>77823, "upload_time"=>"2019-12-17T13:15:54", "upload_time_iso_8601"=>"2019-12-17T13:15:54.116419Z", "url"=>"https://files.pythonhosted.org/packages/15/cf/fc744aa2323538a7a980a44af16d86ab68feba42f78ba6069763e9ed125d/ocrd_anybaseocr-0.0.1.tar.gz"}]}, "url"=>"https://github.com/mjenckel/LAYoutERkennung"}, "url"=>"https://github.com/OCR-D/ocrd_anybaseocr"}

ocrd_pc_segmentation

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>"# WORK IN PROGRESS - NOT READY\nFROM ocrd/core\nVOLUME [\"/data\"]\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\n\nWORKDIR /build-ocrd\nCOPY setup.py .\nCOPY README.md .\nCOPY requirements.txt .\n#COPY requirements_test.txt .\nCOPY ocrd_pc_segmentation ./ocrd_pc_segmentation\nCOPY Makefile .\nRUN apt-get update && \\\n    apt-get -y install --no-install-recommends \\\n        build-essential \\\n    && make deps install \\\n    && apt-get -y remove --auto-remove build-essential\n", "README.md"=>"# page-segmentation module for OCRd\n\n## Introduction\n\nThis module implements a page segmentation algorithm based on a Fully\nConvolutional Network (FCN). The FCN creates a classification for each pixel in\na binary image. This result is then segmented per class using XY cuts.\n\n## Requirements\n\n- For GPU-Support: [CUDA](https://developer.nvidia.com/cuda-downloads) and\n  [CUDNN](https://developer.nvidia.com/cudnn)\n- other requirements are installed via Makefile / pip, see `requirements.txt`\n  in repository root.\n\n## Installation\n\nIf you want to use GPU support, set the environment variable `TENSORFLOW_GPU`,\notherwise leave it unset. Then:\n\n```bash\nmake deps\n```\n\nto install dependencies and\n\n```sh\nmake install\n```\n\nto install the package.\n\nBoth are python packages installed via pip, so you may want to activate\na virtalenv before installing.\n\n## Usage\n\n`ocrd-pc-segmentation` follows the [ocrd CLI](https://ocr-d.github.io/cli).\n\nIt expects a binary page image and produces region entries in the PageXML file.\n\n## Configuration\n\nThe following parameters are recognized in the JSON parameter file:\n\n- `overwrite_regions`: remove previously existing text regions\n- `xheight`: height of character \"x\" in pixels used during training.\n- `model`: pixel-classifier model path\n- `gpu_allow_growth`: required for GPU use with some graphic cards\n  (set to true, if you get CUDNN_INTERNAL_ERROR)\n- `resize_height`: scale down pixelclassifier output to this height before postprocessing. Independent of training / used model.\n  (performance / quality tradeoff, defaults to 300)\n\n## Testing\n\nThere is a simple CLI test, that will run the tool on a single image from the assets repository.\n\n`make test-cli`\n\n## Training\n\nTo train models for the pixel classifier, see [its README](https://github.com/ocr-d-modul-2-segmentierung/page-segmentation/blob/master/README.md)\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner\",\n  \"version\": \"0.1.0\",\n  \"tools\": {\n    \"ocrd-pixelclassifier-segmentation\": {\n      \"executable\": \"ocrd-pc-segmentation\",\n      \"categories\": [\"Layout analysis\"],\n      \"description\": \"Segment page into regions using a pixel classifier based on a Fully Convolutional Network (FCN)\",\n      \"input_file_grp\": [\n        \"OCR-D-IMG-BIN\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-SEG-BLOCK\"\n      ],\n      \"steps\": [\"layout/segmentation/region\"],\n      \"parameters\": {\n        \"overwrite_regions\": {\n          \"type\": \"boolean\",\n          \"default\": true,\n          \"description\": \"remove existing layout and text annotation below the Page level\"\n        },\n        \"xheight\": {\n          \"type\": \"integer\",\n          \"description\": \"height of character x in pixels used during training\",\n          \"default\": 8\n        },\n        \"model\":  {\n          \"type\": \"string\",\n          \"description\": \"trained model for pixel classifier\",\n          \"default\": \"__DEFAULT__\"\n        },\n        \"gpu_allow_growth\": {\n          \"type\": \"boolean\",\n          \"default\": false,\n          \"description\": \"required for GPU use with some graphic cards (set to true, if you get CUDNN_INTERNAL_ERROR)\"\n\n        },\n        \"resize_height\": {\n          \"type\": \"integer\",\n          \"default\": 300,\n          \"description\": \"scale down pixelclassifier output to this height for postprocessing (performance/quality tradeoff). Independent of training.\"\n        }\n\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_pc_segmentation',\n    version='0.1.3',\n    description='pixel-classifier based page segmentation',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    author='Alexander Gehrke, Christian Reul, Christoph Wick',\n    author_email='alexander.gehrke@uni-wuerzburg.de, christian.reul@uni-wuerzburg.de, christoph.wick@uni-wuerzburg.de',\n    url='https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=open(\"requirements.txt\").read().split(),\n    extras_require={\n        'tf_cpu': ['ocr4all_pixel_classifier[tf_cpu]>=0.0.1'],\n        'tf_gpu': ['ocr4all_pixel_classifier[tf_gpu]>=0.0.1'],\n    },\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    classifiers=[\n        \"Programming Language :: Python :: 3\",\n        \"License :: OSI Approved :: Apache Software License\",\n        \"Topic :: Scientific/Engineering :: Image Recognition\"\n\n    ],\n    entry_points={\n        'console_scripts': [\n            'ocrd-pc-segmentation=ocrd_pc_segmentation.cli:ocrd_pc_segmentation',\n        ]\n    },\n    data_files=[('', [\"requirements.txt\"])],\n    include_package_data=True,\n)\n"}, "git"=>{"last_commit"=>"Mon Jan 20 10:00:24 2020 +0100", "latest_tag"=>"v0.1.3", "number_of_commits"=>"29", "url"=>"https://github.com/ocr-d-modul-2-segmentierung/ocrd-pixelclassifier-segmentation.git"}, "name"=>"ocrd_pc_segmentation", "ocrd_tool"=>{"git_url"=>"https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner", "tools"=>{"ocrd-pixelclassifier-segmentation"=>{"categories"=>["Layout analysis"], "description"=>"Segment page into regions using a pixel classifier based on a Fully Convolutional Network (FCN)", "executable"=>"ocrd-pc-segmentation", "input_file_grp"=>["OCR-D-IMG-BIN"], "output_file_grp"=>["OCR-D-SEG-BLOCK"], "parameters"=>{"gpu_allow_growth"=>{"default"=>false, "description"=>"required for GPU use with some graphic cards (set to true, if you get CUDNN_INTERNAL_ERROR)", "type"=>"boolean"}, "model"=>{"default"=>"__DEFAULT__", "description"=>"trained model for pixel classifier", "type"=>"string"}, "overwrite_regions"=>{"default"=>true, "description"=>"remove existing layout and text annotation below the Page level", "type"=>"boolean"}, "resize_height"=>{"default"=>300, "description"=>"scale down pixelclassifier output to this height for postprocessing (performance/quality tradeoff). Independent of training.", "type"=>"integer"}, "xheight"=>{"default"=>8, "description"=>"height of character x in pixels used during training", "type"=>"integer"}}, "steps"=>["layout/segmentation/region"]}}, "version"=>"0.1.0"}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [tools.ocrd-pixelclassifier-segmentation.parameters.xheight.type] 'integer' is not one of ['string', 'number', 'boolean']\n  [tools.ocrd-pixelclassifier-segmentation.parameters.resize_height.type] 'integer' is not one of ['string', 'number', 'boolean']\n</report>", "official"=>true, "org_plus_name"=>"ocr-d-modul-2-segmentierung/ocrd_pc_segmentation", "python"=>{"author"=>"Alexander Gehrke, Christian Reul, Christoph Wick", "author-email"=>"alexander.gehrke@uni-wuerzburg.de, christian.reul@uni-wuerzburg.de, christoph.wick@uni-wuerzburg.de", "name"=>"ocrd_pc_segmentation", "pypi"=>{"info"=>{"author"=>"Alexander Gehrke, Christian Reul, Christoph Wick", "author_email"=>"alexander.gehrke@uni-wuerzburg.de, christian.reul@uni-wuerzburg.de, christoph.wick@uni-wuerzburg.de", "bugtrack_url"=>nil, "classifiers"=>["License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Image Recognition"], "description"=>"# page-segmentation module for OCRd\n\n## Introduction\n\nThis module implements a page segmentation algorithm based on a Fully\nConvolutional Network (FCN). The FCN creates a classification for each pixel in\na binary image. This result is then segmented per class using XY cuts.\n\n## Requirements\n\n- For GPU-Support: [CUDA](https://developer.nvidia.com/cuda-downloads) and\n  [CUDNN](https://developer.nvidia.com/cudnn)\n- other requirements are installed via Makefile / pip, see `requirements.txt`\n  in repository root.\n\n## Installation\n\nIf you want to use GPU support, set the environment variable `TENSORFLOW_GPU`,\notherwise leave it unset. Then:\n\n```bash\nmake dep\n```\n\nto install dependencies and\n\n```sh\nmake install\n```\n\nto install the package.\n\nBoth are python packages installed via pip, so you may want to activate\na virtalenv before installing.\n\n## Usage\n\n`ocrd-pc-segmentation` follows the [ocrd CLI](https://ocr-d.github.io/cli).\n\nIt expects a binary page image and produces region entries in the PageXML file.\n\n## Configuration\n\nThe following parameters are recognized in the JSON parameter file:\n\n- `overwrite_regions`: remove previously existing text regions\n- `xheight`: height of character \"x\" in pixels used during training.\n- `model`: pixel-classifier model path\n- `gpu_allow_growth`: required for GPU use with some graphic cards\n  (set to true, if you get CUDNN_INTERNAL_ERROR)\n- `resize_height`: scale down pixelclassifier output to this height before postprocessing. Independent of training / used model.\n  (performance / quality tradeoff, defaults to 300)\n\n## Testing\n\nThere is a simple CLI test, that will run the tool on a single image from the assets repository.\n\n`make test-cli`\n\n## Training\n\nTo train models for the pixel classifier, see [its README](https://github.com/ocr-d-modul-2-segmentierung/page-segmentation/blob/master/README.md)\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner", "keywords"=>"", "license"=>"", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-pc-segmentation", "package_url"=>"https://pypi.org/project/ocrd-pc-segmentation/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-pc-segmentation/", "project_urls"=>{"Homepage"=>"https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner"}, "release_url"=>"https://pypi.org/project/ocrd-pc-segmentation/0.1.3/", "requires_dist"=>["ocrd (>=2.0.0a1)", "click", "ocr4all-pixel-classifier (>=0.1.3)", "numpy", "ocr4all-pixel-classifier[tf_cpu] (>=0.0.1) ; extra == 'tf_cpu'", "ocr4all-pixel-classifier[tf_gpu] (>=0.0.1) ; extra == 'tf_gpu'"], "requires_python"=>"", "summary"=>"pixel-classifier based page segmentation", "version"=>"0.1.3"}, "last_serial"=>6169845, "releases"=>{"0.1.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"7cd68c8c55c0110fbfb6de61877fd60e", "sha256"=>"c22e9fad55a01f29bea78943c8ac93bc1a0780cbc6b606cbf81bac5f888d2294"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"7cd68c8c55c0110fbfb6de61877fd60e", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>7559195, "upload_time"=>"2019-11-18T12:31:31", "upload_time_iso_8601"=>"2019-11-18T12:31:31.585016Z", "url"=>"https://files.pythonhosted.org/packages/72/f6/5936ad2bdc878920ae26b448bd68eb580f04632b373d5fba62c79a8c8148/ocrd_pc_segmentation-0.1.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"8469f2af2217a526828000b4af13f7f0", "sha256"=>"9f908f54f86d85a10b5d1d339e9f964f1b2ade3b4032ee8dadeeaa474dc299b7"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.1.tar.gz", "has_sig"=>false, "md5_digest"=>"8469f2af2217a526828000b4af13f7f0", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>7547673, "upload_time"=>"2019-11-18T12:31:39", "upload_time_iso_8601"=>"2019-11-18T12:31:39.690671Z", "url"=>"https://files.pythonhosted.org/packages/d9/82/c3fee56b73554529fe319dd596df56758e5429b1d5ee4b8603d404f7c94e/ocrd_pc_segmentation-0.1.1.tar.gz"}], "0.1.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"aceed390bfeffbaf723ca96961ed5d7f", "sha256"=>"026be378afb3104e0f2367254da1da0f3ba212f5d4d5c8f6a7880b4eddc5b9a5"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"aceed390bfeffbaf723ca96961ed5d7f", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>7559196, "upload_time"=>"2019-11-19T13:53:55", "upload_time_iso_8601"=>"2019-11-19T13:53:55.306178Z", "url"=>"https://files.pythonhosted.org/packages/cc/d6/396ad6297c509445f03fddedc5efcd6f882ce5bb223c050157d675574858/ocrd_pc_segmentation-0.1.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"c982401d1a8ab607bf6ed1871df87826", "sha256"=>"e2dcd0b641accb8c6594d6dd24dcf1899c3cefbc033c5860b4ff72c20f1ad4ca"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.2.tar.gz", "has_sig"=>false, "md5_digest"=>"c982401d1a8ab607bf6ed1871df87826", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>7547671, "upload_time"=>"2019-11-19T13:54:12", "upload_time_iso_8601"=>"2019-11-19T13:54:12.122137Z", "url"=>"https://files.pythonhosted.org/packages/0c/47/46c39455cc4c5739e4599f7715c4b618193b561885aa302777fd7b11c1b5/ocrd_pc_segmentation-0.1.2.tar.gz"}], "0.1.3"=>[{"comment_text"=>"", "digests"=>{"md5"=>"6f80c4823630b6a94f3b013ec6eab69e", "sha256"=>"30442df84ae140871ed32549d7f0e5472f02783614bd4b627bceafdd540ca266"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"6f80c4823630b6a94f3b013ec6eab69e", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>7559081, "upload_time"=>"2019-11-20T16:45:32", "upload_time_iso_8601"=>"2019-11-20T16:45:32.354512Z", "url"=>"https://files.pythonhosted.org/packages/45/9c/3d1dc9c772ea9446f372837318f1e55b76c6a2cb1368579592c6b3fe9326/ocrd_pc_segmentation-0.1.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"16b1c95e3235cf1d9f2b971bc4684daf", "sha256"=>"b58ab36e89213735fcf0b9376ce97e342626fbf8892d302c5feb3dbd5b1c73a3"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.3.tar.gz", "has_sig"=>false, "md5_digest"=>"16b1c95e3235cf1d9f2b971bc4684daf", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>7547532, "upload_time"=>"2019-11-20T16:45:36", "upload_time_iso_8601"=>"2019-11-20T16:45:36.370537Z", "url"=>"https://files.pythonhosted.org/packages/3a/66/bad782febb7496d089df1520d08a241af4875d6d656e68d93cfaa4fa6cf2/ocrd_pc_segmentation-0.1.3.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"6f80c4823630b6a94f3b013ec6eab69e", "sha256"=>"30442df84ae140871ed32549d7f0e5472f02783614bd4b627bceafdd540ca266"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.3-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"6f80c4823630b6a94f3b013ec6eab69e", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>7559081, "upload_time"=>"2019-11-20T16:45:32", "upload_time_iso_8601"=>"2019-11-20T16:45:32.354512Z", "url"=>"https://files.pythonhosted.org/packages/45/9c/3d1dc9c772ea9446f372837318f1e55b76c6a2cb1368579592c6b3fe9326/ocrd_pc_segmentation-0.1.3-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"16b1c95e3235cf1d9f2b971bc4684daf", "sha256"=>"b58ab36e89213735fcf0b9376ce97e342626fbf8892d302c5feb3dbd5b1c73a3"}, "downloads"=>-1, "filename"=>"ocrd_pc_segmentation-0.1.3.tar.gz", "has_sig"=>false, "md5_digest"=>"16b1c95e3235cf1d9f2b971bc4684daf", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>7547532, "upload_time"=>"2019-11-20T16:45:36", "upload_time_iso_8601"=>"2019-11-20T16:45:36.370537Z", "url"=>"https://files.pythonhosted.org/packages/3a/66/bad782febb7496d089df1520d08a241af4875d6d656e68d93cfaa4fa6cf2/ocrd_pc_segmentation-0.1.3.tar.gz"}]}, "url"=>"https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner"}, "url"=>"https://github.com/ocr-d-modul-2-segmentierung/ocrd_pc_segmentation"}

dinglehopper

{"compliant_cli"=>false, "files"=>{"Dockerfile"=>nil, "README.md"=>"dinglehopper\n============\n\ndinglehopper is an OCR evaluation tool and reads [ALTO](https://github.com/altoxml), [PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files.\n\n[![Build Status](https://travis-ci.org/qurator-spk/dinglehopper.svg?branch=master)](https://travis-ci.org/qurator-spk/dinglehopper)\n\nGoals\n-----\n* Useful\n  * As a UI tool\n  * For an automated evaluation\n  * As a library\n* Unicode support\n\nInstallation\n------------\nIt's best to use pip, e.g.:\n~~~\nsudo pip install .\n~~~\n\nUsage\n-----\n~~~\ndinglehopper some-document.gt.page.xml some-document.ocr.alto.xml\n~~~\nThis generates `report.html` and `report.json`.\n\n\nAs a OCR-D processor:\n~~~\nocrd-dinglehopper -m mets.xml -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL\n~~~\nThis generates HTML and JSON reports in the `OCR-D-OCR-TESS-EVAL` filegroup.\n\n\n![dinglehopper displaying metrics and character differences](.screenshots/dinglehopper.png?raw=true)\n\nTesting\n-------\nUse `pytest` to run the tests in [the tests directory](qurator/dinglehopper/tests):\n~~~\nvirtualenv -p /usr/bin/python3 venv\n. venv/bin/activate\npip install -r requirements.txt\npip install pytest\npytest\n~~~\n", "ocrd-tool.json"=>"{\n  \"git_url\": \"https://github.com/qurator-spk/dinglehopper\",\n  \"tools\": {\n    \"ocrd-dinglehopper\": {\n      \"executable\": \"ocrd-dinglehopper\",\n      \"description\": \"Evaluate OCR text against ground truth with dinglehopper\",\n      \"input_file_grp\": [\n        \"OCR-D-GT-PAGE\",\n        \"OCR-D-OCR\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-OCR-EVAL\"\n      ],\n      \"categories\": [\n        \"Quality assurance\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ]\n    }\n  }\n}\n", "setup.py"=>"from io import open\nfrom setuptools import find_packages, setup\n\nwith open('requirements.txt') as fp:\n    install_requires = fp.read()\n\nsetup(\n    name='dinglehopper',\n    author='Mike Gerber, The QURATOR SPK Team',\n    author_email='mike.gerber@sbb.spk-berlin.de, qurator@sbb.spk-berlin.de',\n    description='The OCR evaluation tool',\n    long_description=open('README.md', 'r', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    keywords='qurator ocr',\n    license='Apache',\n    namespace_packages=['qurator'],\n    packages=find_packages(exclude=['*.tests', '*.tests.*', 'tests.*', 'tests']),\n    install_requires=install_requires,\n    package_data={\n        '': ['*.json', 'templates/*'],\n    },\n    entry_points={\n      'console_scripts': [\n        'dinglehopper=qurator.dinglehopper.cli:main',\n        'ocrd-dinglehopper=qurator.dinglehopper.ocrd_cli:ocrd_dinglehopper',\n      ]\n    }\n)\n"}, "git"=>{"last_commit"=>"Tue Jan 14 13:22:42 2020 +0100", "latest_tag"=>"", "number_of_commits"=>"56", "url"=>"https://github.com/qurator-spk/dinglehopper.git"}, "name"=>"dinglehopper", "ocrd_tool"=>{"git_url"=>"https://github.com/qurator-spk/dinglehopper", "tools"=>{"ocrd-dinglehopper"=>{"categories"=>["Quality assurance"], "description"=>"Evaluate OCR text against ground truth with dinglehopper", "executable"=>"ocrd-dinglehopper", "input_file_grp"=>["OCR-D-GT-PAGE", "OCR-D-OCR"], "output_file_grp"=>["OCR-D-OCR-EVAL"], "steps"=>["recognition/text-recognition"]}}}, "ocrd_tool_validate"=>"<report valid=\"false\">\n  [] 'version' is a required property\n</report>", "official"=>false, "org_plus_name"=>"qurator-spk/dinglehopper", "python"=>{"author"=>"Mike Gerber, The QURATOR SPK Team", "author-email"=>"mike.gerber@sbb.spk-berlin.de, qurator@sbb.spk-berlin.de", "name"=>"dinglehopper", "pypi"=>nil, "url"=>"UNKNOWN"}, "url"=>"https://github.com/qurator-spk/dinglehopper"}

ocrd_typegroups_classifier

{"compliant_cli"=>true, "files"=>{"Dockerfile"=>nil, "README.md"=>"# ocrd_typegroups_classifier\n\n> Typegroups classifier for OCR\n\n## Installation\n\n### From PyPI\n\n```sh\npip3 install ocrd_typegroup_classifier\n```\n\n### From source\n\nIf needed, create a virtual environment for Python 3 (it was tested\nsuccessfully with Python 3.7), activate it, and install ocrd.\n\n```sh\nvirtualenv -p python3 ocrd-venv3\nsource ocrd-venv3/bin/activate\npip3 install ocrd\n```\n\nEnter in the folder containing the tool:\n\n```\ncd ocrd_typegroups_classifier/\n```\n\nInstall the module and its dependencies\n\n```\nmake install\n```\n\nFinally, run the test:\n\n```\nsh test/test.sh\n```\n\n** Important: ** The test makes sure that the system does work. For\nspeed reasons, a very small neural network is used and applied only to\nthe top-left corner of the image, therefore the quality of the results\nwill be of poor quality.\n\n## Models\n\nThe model classifier-1.tgc is based on a ResNet-18, with less neurons\nper layer than the usual model. It was briefly trained on 12 classes:\nAdornment, Antiqua, Bastarda, Book covers and other irrelevant data,\nEmpty Pages, Fraktur, Griechisch, Hebräisch, Kursiv, Rotunda, Textura,\nand Woodcuts - Engravings.\n\n## Heatmap Generation ##\nGiven a trained model, it is possible to produce heatmaps corresponding\nto classification results. Syntax:\n\n```\npython3 tools/heatmap.py ocrd_typegroups_classifier/models/classifier.tgc sample.jpg out\n```\n", "ocrd-tool.json"=>"{\n  \"version\": \"0.0.2\",\n  \"git_url\": \"https://github.com/seuretm/ocrd_typegroups_classifier\",\n  \"tools\": {\n    \"ocrd-typegroups-classifier\": {\n      \"executable\": \"ocrd-typegroups-classifier\",\n      \"description\": \"Classification of 15th century type groups\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/font-identification\"\n      ],\n      \"input_file_grp\": [\"OCR-D-IMG\"],\n      \"parameters\": {\n        \"network\": {\n          \"description\": \"The file name of the neural network to use, including sufficient path information\",\n          \"type\": \"string\",\n          \"required\": true\n        },\n        \"stride\": {\n          \"description\": \"Stride applied to the CNN on the image. Should be between 1 and 224. Smaller values increase the computation time.\",\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"default\": 112\n        }\n      }\n    }\n  }\n}\n", "setup.py"=>"# -*- coding: utf-8 -*-\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_typegroups_classifier',\n    version='0.0.2',\n    description='Typegroups classifier for OCR',\n    long_description=README,\n    long_description_content_type='text/markdown',\n    author='Matthias Seuret, Konstantin Baierer',\n    author_email='seuretm@users.noreply.github.com',\n    url='https://github.com/seuretm/ocrd_typegroups_classifier',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    include_package_data=True,\n    install_requires=open('requirements.txt').read().split('\\n'),\n    package_data={\n        '': ['*.json', '*.tgc'],\n    },\n    entry_points={\n        'console_scripts': [\n            'typegroups-classifier=ocrd_typegroups_classifier.cli.simple:cli',\n            'ocrd-typegroups-classifier=ocrd_typegroups_classifier.cli.ocrd_cli:cli',\n        ]\n    },\n)\n"}, "git"=>{"last_commit"=>"Thu Jan 16 11:38:59 2020 +0100", "latest_tag"=>"v0.0.2", "number_of_commits"=>"77", "url"=>"https://github.com/OCR-D/ocrd_typegroups_classifier.git"}, "name"=>"ocrd_typegroups_classifier", "ocrd_tool"=>{"git_url"=>"https://github.com/seuretm/ocrd_typegroups_classifier", "tools"=>{"ocrd-typegroups-classifier"=>{"categories"=>["Text recognition and optimization"], "description"=>"Classification of 15th century type groups", "executable"=>"ocrd-typegroups-classifier", "input_file_grp"=>["OCR-D-IMG"], "parameters"=>{"network"=>{"description"=>"The file name of the neural network to use, including sufficient path information", "required"=>true, "type"=>"string"}, "stride"=>{"default"=>112, "description"=>"Stride applied to the CNN on the image. Should be between 1 and 224. Smaller values increase the computation time.", "format"=>"integer", "type"=>"number"}}, "steps"=>["recognition/font-identification"]}}, "version"=>"0.0.2"}, "ocrd_tool_validate"=>"<report valid=\"true\">\n</report>", "official"=>true, "org_plus_name"=>"OCR-D/ocrd_typegroups_classifier", "python"=>{"author"=>"Matthias Seuret, Konstantin Baierer", "author-email"=>"seuretm@users.noreply.github.com", "name"=>"ocrd_typegroups_classifier", "pypi"=>{"info"=>{"author"=>"Matthias Seuret, Konstantin Baierer", "author_email"=>"seuretm@users.noreply.github.com", "bugtrack_url"=>nil, "classifiers"=>[], "description"=>"# ocrd_typegroups_classifier\n\n> Typegroups classifier for OCR\n\n## Installation\n\n### From PyPI\n\n```sh\npip3 install ocrd_typegroup_classifier\n```\n\n### From source\n\nIf needed, create a virtual environment for Python 3 (it was tested\nsuccessfully with Python 3.7), activate it, and install ocrd.\n\n```sh\nvirtualenv -p python3 ocrd-venv3\nsource ocrd-venv3/bin/activate\npip3 install ocrd\n```\n\nEnter in the folder containing the tool:\n\n```\ncd ocrd_typegroups_classifier/\n```\n\nInstall the module and its dependencies\n\n```\nmake install\n```\n\nFinally, run the test:\n\n```\nsh test/test.sh\n```\n\n** Important: ** The test makes sure that the system does work. For\nspeed reasons, a very small neural network is used and applied only to\nthe top-left corner of the image, therefore the quality of the results\nwill be of poor quality.\n\n## Models\n\nThe model classifier-1.tgc is based on a ResNet-18, with less neurons\nper layer than the usual model. It was briefly trained on 12 classes:\nAdornment, Antiqua, Bastarda, Book covers and other irrelevant data,\nEmpty Pages, Fraktur, Griechisch, Hebräisch, Kursiv, Rotunda, Textura,\nand Woodcuts - Engravings.\n\n## Heatmap Generation ##\nGiven a trained model, it is possible to produce heatmaps corresponding\nto classification results. Syntax:\n\n```\npython3 tools/heatmap.py ocrd_typegroups_classifier/models/classifier.tgc sample.jpg out\n```\n\n\n", "description_content_type"=>"text/markdown", "docs_url"=>nil, "download_url"=>"", "downloads"=>{"last_day"=>-1, "last_month"=>-1, "last_week"=>-1}, "home_page"=>"https://github.com/seuretm/ocrd_typegroups_classifier", "keywords"=>"", "license"=>"Apache License 2.0", "maintainer"=>"", "maintainer_email"=>"", "name"=>"ocrd-typegroups-classifier", "package_url"=>"https://pypi.org/project/ocrd-typegroups-classifier/", "platform"=>"", "project_url"=>"https://pypi.org/project/ocrd-typegroups-classifier/", "project_urls"=>{"Homepage"=>"https://github.com/seuretm/ocrd_typegroups_classifier"}, "release_url"=>"https://pypi.org/project/ocrd-typegroups-classifier/0.0.2/", "requires_dist"=>["ocrd (>=2.0.1)", "pandas", "scikit-image", "torch (>=1.4.0)", "torchvision (>=0.5.0)"], "requires_python"=>"", "summary"=>"Typegroups classifier for OCR", "version"=>"0.0.2"}, "last_serial"=>6465300, "releases"=>{"0.0.1"=>[{"comment_text"=>"", "digests"=>{"md5"=>"19437f8f76a7e346479a2bea163b164f", "sha256"=>"d469964e37069a2dab403bbf7400eec4ddabcf4ee83c86d6e88bda1bd96e9c1d"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.1-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"19437f8f76a7e346479a2bea163b164f", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>26290742, "upload_time"=>"2019-11-29T15:27:55", "upload_time_iso_8601"=>"2019-11-29T15:27:55.449239Z", "url"=>"https://files.pythonhosted.org/packages/e6/1b/5d0e6967985a7e23d01f558677bd7de4385dacc0186e4896ad23cb4e2f0d/ocrd_typegroups_classifier-0.0.1-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"48c202c02d301243c8e9f365e9dcad1d", "sha256"=>"6b339f6b52cb62acc93f64d11637aa895a2cfbe7958df3391e4d6480d8c87d28"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.1.tar.gz", "has_sig"=>false, "md5_digest"=>"48c202c02d301243c8e9f365e9dcad1d", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>15969, "upload_time"=>"2019-11-29T15:27:59", "upload_time_iso_8601"=>"2019-11-29T15:27:59.723574Z", "url"=>"https://files.pythonhosted.org/packages/6a/d0/620fd50f319ef68ec959b67d0c048bb0f1d602ca5cc0baa0ff46fd235382/ocrd_typegroups_classifier-0.0.1.tar.gz"}], "0.0.2"=>[{"comment_text"=>"", "digests"=>{"md5"=>"733fcd5009cf54a7349aa314bf9a6e47", "sha256"=>"75057c3c0c8be6f664f04c903ce3fd4337a5f87dea8c825a423e006a2c406a03"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"733fcd5009cf54a7349aa314bf9a6e47", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>26294951, "upload_time"=>"2020-01-16T10:39:25", "upload_time_iso_8601"=>"2020-01-16T10:39:25.132553Z", "url"=>"https://files.pythonhosted.org/packages/bc/82/1b0976ef56d24962249dd9c4ff1c8dff259413cb52cc99bb08bbea15e1f8/ocrd_typegroups_classifier-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"4e597d6a3f75c4991392b11e88f89f40", "sha256"=>"8c9b0f8253a2b34985128201ff155329ce23a5094e21f5f162d9ffa12ce8230b"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"4e597d6a3f75c4991392b11e88f89f40", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>15988, "upload_time"=>"2020-01-16T10:39:28", "upload_time_iso_8601"=>"2020-01-16T10:39:28.744590Z", "url"=>"https://files.pythonhosted.org/packages/13/6c/ad140f1e282941da373f19236cfffdc7b4dfe8190cef547175d33c3de8d9/ocrd_typegroups_classifier-0.0.2.tar.gz"}]}, "urls"=>[{"comment_text"=>"", "digests"=>{"md5"=>"733fcd5009cf54a7349aa314bf9a6e47", "sha256"=>"75057c3c0c8be6f664f04c903ce3fd4337a5f87dea8c825a423e006a2c406a03"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.2-py3-none-any.whl", "has_sig"=>false, "md5_digest"=>"733fcd5009cf54a7349aa314bf9a6e47", "packagetype"=>"bdist_wheel", "python_version"=>"py3", "requires_python"=>nil, "size"=>26294951, "upload_time"=>"2020-01-16T10:39:25", "upload_time_iso_8601"=>"2020-01-16T10:39:25.132553Z", "url"=>"https://files.pythonhosted.org/packages/bc/82/1b0976ef56d24962249dd9c4ff1c8dff259413cb52cc99bb08bbea15e1f8/ocrd_typegroups_classifier-0.0.2-py3-none-any.whl"}, {"comment_text"=>"", "digests"=>{"md5"=>"4e597d6a3f75c4991392b11e88f89f40", "sha256"=>"8c9b0f8253a2b34985128201ff155329ce23a5094e21f5f162d9ffa12ce8230b"}, "downloads"=>-1, "filename"=>"ocrd_typegroups_classifier-0.0.2.tar.gz", "has_sig"=>false, "md5_digest"=>"4e597d6a3f75c4991392b11e88f89f40", "packagetype"=>"sdist", "python_version"=>"source", "requires_python"=>nil, "size"=>15988, "upload_time"=>"2020-01-16T10:39:28", "upload_time_iso_8601"=>"2020-01-16T10:39:28.744590Z", "url"=>"https://files.pythonhosted.org/packages/13/6c/ad140f1e282941da373f19236cfffdc7b4dfe8190cef547175d33c3de8d9/ocrd_typegroups_classifier-0.0.2.tar.gz"}]}, "url"=>"https://github.com/seuretm/ocrd_typegroups_classifier"}, "url"=>"https://github.com/OCR-D/ocrd_typegroups_classifier"}