Working with OCR-D-(Ground-Truth)-Repository

Upload bagit container from scratch to OCR-D(-GT)-Repository

Example to upload a scanned page to OCR-D-Repo.

Preparation: Create Workspace

Requirements: ocrd (Version 1.0.0) See Setup OCR-D Stack

Activate virtualenv

user@hostname:~$source ~/env-ocrd/bin/activate
(env-ocrd) user@hostname:~$

Initialize Workspace

(env-ocrd) user@hostname:~$ ocrd workspace init communist_manifesto
(env-ocrd) user@hostname:~$ cd communist_manifesto

Create Folder for Scanned Page

(env-ocrd) user@hostname:~/communist_manifesto$ mkdir OCR-D-IMG

Download Image (Google)

(env-ocrd) user@hostname:~/communist_manifesto$ wget https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Manifesto_of_the_Communist_Party.djvu/page15-2745px-Manifesto_of_the_Communist_Party.djvu.jpg -O OCR-D-IMG/OCR-D-IMG_0015.jpg

Add Image to Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace add -g P0015 -G OCR-D-IMG -i OCR-D-IMG_0015 -m image/jpg OCR-D-IMG/OCR-D-IMG_0015.jpg

Set Unique ID for Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace set-id 'communist_manifesto'

Validate Workspace

For some images, the resolution of the image is not set. To avoid validation errors, the resolution check is skipped. For further details see ‘ocrd workspace validate –help’.

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace validate --skip pixel_density mets.xml

Create BagIt Container

(env-ocrd) user@hostname:~/communist_manifesto$ cd ..
(env-ocrd) user@hostname:~/$ ocrd zip bag -i communist_manifesto -d communist_manifesto/

Validate BagIt Container

(env-ocrd) user@hostname:~/$ ocrd zip validate communist_manifesto.ocrd.zip
[...]
OK

Upload BagIt Container

user@hostname:~/$ curl -u ingest:GENERATED_PASSWORD -v -F "file=@communist_manifesto.ocrd.zip" http://localhost:8080/api/v1/metastore/bagit 
[...]
OK

Download all BagIt Containers

user@hostname:~/Download$ wget -O listOfContainers.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit

user@hostname:~/Download$ ocrdzips=$(cat listOfContainers.json | tr ",[]\"" "\n")

user@hostname:~/Download$ for addr in $ocrdzips
do
  wget $addr
  filename=$(basename -- "$addr")
  directory="${filename%.*}"

  mkdir $directory
  cd $directory
  unzip ../$filename
  cd ..
done

List all Documents (in Browser)

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit The list shows all ingested documents with its

Download Document

https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/arent_dichtercharaktere_1885.zip Download of the complete document as bagit container.

List all Files inside Document

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/files All files of given resourceID referenced inside the mets.xml are listed here.

Download Single File

https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/bagit/data/DEFAULT/DEFAULT_0002 Download/view single file (Tiff) of given resourceID, file group and fileID.

List Metadata

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/metadata List metadata of the document (e.g.: title, author, year, identifier, languages, classifications) of given resourceID.

List Ground Truth Metadata

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/groundtruth List all semantic labels of given resourceID.

Search Inside Repository

All searches will return a list of fitting resourceIDs. In order to further investigate the found resources, the listings above can be used.

Search via browser

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search

Search on command line

Search for Semantic Label

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination Search for documents with e.g. uneven illumination.

Search for Documents Containing Multiple Semantic Labels at Once

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination,condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

Search for Documents with Classification ‘Fachtext’

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Fachtext

Search for Documents with Language ‘deu’

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/language?lang=deu

Search for Documents with Identifier ‘16488’

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=16529

Search for Documents with Specific Identifier and Type

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=urn:nbn:de:kobv:b4-200905196929&type=urn Search for document with specific identifier of a specific type. Possible types are:

user@hostname:~/Download$ allocrdzips=$(cat listOfAllContainers.json tr “,[]"” “\n”)

Get IDs of fitting containers

user@hostname:~/Download$ wget -O filteredList.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Belletristik

user@hostname:~/Download$ filteredIds=$(cat filteredList.json tr “,[]"” “\n”)

user@hostname:~/Download$ for bagitid in $filteredIds do for addr in $allocrdzips do if echo “$addr” | grep -q “$bagitid”; then wget $addr filename=$(basename – “$addr”) directory=”${filename%.*}”

  mkdir $directory
  cd $directory
  unzip ../$filename
  cd ..
fi   done done ```