ocrd.cli.workspace module

OCR-D CLI: workspace management

ocrd workspace

Managing workspaces

A workspace comprises a METS file and a directory as point of reference.

Operates on the file system directly or via a METS server (already running via some prior server start subcommand).

ocrd workspace [OPTIONS] COMMAND [ARGS]...

Options

-d, --directory <WORKSPACE_DIR>

Changes the workspace folder location [default: METS_URL directory or .]”

-M, --mets-basename <mets_basename>

METS file basename. Deprecated, use –mets/–directory

-m, --mets <METS_URL>

The path/URL of the METS file [default: WORKSPACE_DIR/mets.xml]

-U, --mets-server-url <mets_server_url>

TCP host URI or UDS path of METS server

--backup

Backup mets.xml whenever it is saved.

Environment variables

WORKSPACE_DIR

Provide a default for -d

add

Add a file or http(s) URL FNAME to METS in a workspace. If FNAME is not an http(s) URL and is not a workspace-local existing file, try to copy to workspace.

ocrd workspace add [OPTIONS] FNAME

Options

-G, --file-grp <FILE_GRP>

Required fileGrp USE

-i, --file-id <FILE_ID>

Required ID for the file

-m, --mimetype <TYPE>

Media type of the file. Guessed from extension if not provided

-g, --page-id <PAGE_ID>

ID of the physical page

-C, --check-file-exists

Whether to ensure FNAME exists

--ignore

Do not check whether file exists.

--force

If file with ID already exists, replace it. No effect if –ignore is set.

Arguments

FNAME

Required argument

backup

Backing and restoring workspaces - dev edition

ocrd workspace backup [OPTIONS] COMMAND [ARGS]...

add

Create a new backup

ocrd workspace backup add [OPTIONS]

list

List backups

ocrd workspace backup list [OPTIONS]

restore

Restore backup BAK

ocrd workspace backup restore [OPTIONS] BAK

Options

-f, --choose-first

Restore first matching version if more than one

Arguments

BAK

Required argument

undo

Restore the last backup

ocrd workspace backup undo [OPTIONS]

bulk-add

Add files in bulk to an OCR-D workspace.

FILE_GLOB can either be a shell glob expression to match file names, or a list of expressions or ‘-’, in which case expressions are read from STDIN.

After globbing, –regex is matched against each expression resulting from FILE_GLOB, and can define named groups reusable in the –page-id, –file-id, –mimetype, –url, –source-path and –file-grp options, e.g. by referencing the group name ‘grp’ from the regex as ‘{{ grp }}’.

If the FILE_GLOB expressions do not denote the file names themselves (but arbitrary strings for –regex matching), then use –source-path to set the actual file paths to use. (This could involve fixed strings or group references.)

Examples:
ocrd workspace bulk-add
–regex ‘(?P<fileGrp>[^/]+)/page_(?P<pageid>.*).[^.]+’
–page-id ‘PHYS_{{ pageid }}’
–file-grp “{{ fileGrp }}”
path/to/files//.*

echo “path/to/src/file.xml SEG/page_p0001.xml”
| ocrd workspace bulk-add
–regex ‘(?P<src>.*?) (?P<fileGrp>.+?)/page_(?P<pageid>.*).(?P<ext>[^.]*)’
–file-id ‘FILE_{{ fileGrp }}_{{ pageid }}’
–page-id ‘PHYS_{{ pageid }}’
–file-grp “{{ fileGrp }}”
–local-filename ‘{{ fileGrp }}/FILE_{{ pageid }}.{{ ext }}’
-

 { echo PHYS_0001 BIN FILE_0001_BIN.IMG-wolf BIN/FILE_0001_BIN.IMG-wolf.png;

echo PHYS_0001 BIN FILE_0001_BIN BIN/FILE_0001_BIN.xml; echo PHYS_0002 BIN FILE_0002_BIN.IMG-wolf BIN/FILE_0002_BIN.IMG-wolf.png; echo PHYS_0002 BIN FILE_0002_BIN BIN/FILE_0002_BIN.xml;

} | ocrd workspace bulk-add -r ‘(?P<pageid>.*) (?P<filegrp>.*) (?P<fileid>.*) (?P<local_filename>.*)’

-G ‘{{ filegrp }}’ -g ‘{{ pageid }}’ -i ‘{{ fileid }}’ -S ‘{{ local_filename }}’ -

ocrd workspace bulk-add [OPTIONS] FILE_GLOB...

Options

-r, --regex <regex>

Required Regular expression matching the FILE_GLOB filesystem paths to define named captures usable in the other parameters

-m, --mimetype <mimetype>

Media type of the file. If not provided, guess from filename

-g, --page-id <page_id>

physical page ID of the file

-i, --file-id <file_id>

ID of the file. If not provided, derive from fileGrp and filename

-u, --url <url>

Remote URL of the file

-l, --local-filename <local_filename>

Local filesystem path in the workspace directory (copied from source file if different)

-G, --file-grp <file_grp>

Required File group USE of the file

-n, --dry-run

Don’t actually do anything to the METS or filesystem, just preview

-S, --source-path <src_path_option>

File path to copy from (if different from FILE_GLOB values)

-I, --ignore

Disable checking for existing file entries (faster)

-f, --force

Replace existing file entries with the same ID (no effect when –ignore is set, too)

-s, --skip

Skip files not matching –regex (instead of failing)

Arguments

FILE_GLOB

Required argument(s)

clean

Removes files and directories from the workspace that are not referenced by any mets:files.

PATH_GLOB can be a shell glob expression to match file names, directory names (recursively), or plain paths. All paths are resolved w.r.t. the workspace.

If no PATH_GLOB are specified, then all files and directories may match.

ocrd workspace clean [OPTIONS] [PATH_GLOB]...

Options

-n, --dry-run

Don’t actually do anything to the filesystem, just preview

-d, --directories

Remove untracked directories in addition to untracked files

Arguments

PATH_GLOB

Optional argument(s)

clone

Create a workspace from METS_URL and return the directory

METS_URL can be a URL, an absolute path or a path relative to $PWD. If METS_URL is not provided, use –mets accordingly. METS_URL can also be an OAI-PMH GetRecord URL wrapping a METS file.

ocrd workspace clone [OPTIONS] METS_URL [WORKSPACE_DIR]

Options

-f, --clobber-mets

Overwrite existing METS file

-a, --download

Download all files and change location in METS file after cloning

-Q, --exclude-file-grps <exclude_fileGrp>

fileGrps to exclude

-q, --include-file-grps <include_fileGrp>

fileGrps to include

-i, --file-id <FILTER>

ID

-g, --page-id <FILTER>

Page ID

-m, --mimetype <FILTER>

Media type to look for

-G, --file-grp <FILTER>

fileGrp USE

Arguments

METS_URL

Required argument

WORKSPACE_DIR

Optional argument

find

Find files.

(If any FILTER starts with //, then its remainder

will be interpreted as a regular expression.)

ocrd workspace find [OPTIONS]

Options

-Q, --exclude-file-grps <exclude_fileGrp>

fileGrps to exclude

-q, --include-file-grps <include_fileGrp>

fileGrps to include

-i, --file-id <FILTER>

ID

-g, --page-id <FILTER>

Page ID

-m, --mimetype <FILTER>

Media type to look for

-G, --file-grp <FILTER>

fileGrp USE

-k, --output-field <output_field>

Output field. Repeat for multiple fields, will be joined with tab

Default:

'local_filename'

Options:

url | mimetype | page_id | pageId | file_id | ID | file_grp | fileGrp | basename | basename_without_extension | local_filename

--download

Download found files to workspace and change location in METS file

--undo-download

Remove all downloaded files from the METS and workspace

--keep-files

Do not remove downloaded files from the workspace with –undo-download

--wait <wait>

Wait this many seconds between download requests

get-id

Get METS id if any

ocrd workspace get-id [OPTIONS]

init

Create a workspace with an empty METS file in DIRECTORY or CWD.

ocrd workspace init [OPTIONS] [DIRECTORY]

Options

-f, --clobber-mets

Clobber mets.xml if it exists

Arguments

DIRECTORY

Optional argument

list-group

List fileGrp USE attributes

ocrd workspace list-group [OPTIONS]

list-page

List physical page IDs

(If any FILTER starts with //, then its remainder

will be interpreted as a regular expression.)

ocrd workspace list-page [OPTIONS]

Options

-k, --output-field <output_field>

Output field. Repeat for multiple fields, will be joined with tab

Default:

'ID'

Options:

ID | ORDER | ORDERLABEL | LABEL | CONTENTIDS

-f, --output-format <output_format>

Output format

Options:

one-per-line | comma-separated | json

-D, --chunk-number <chunk_number>

Partition the return value into n roughly equally sized chunks

-C, --chunk-index <chunk_index>

Output the nth chunk of results, -1 for all of them.

-r, --page-id-range <page_id_range>

Restrict the pages to those matching the provided range, based on the @ID attribute. Separate start/end with ..

-R, --numeric-range <numeric_range>

Restrict the pages to those in the range, in numerical document order. Separate start/end with ..

merge

Merges this workspace with the workspace that contains METS_PATH

Pass a JSON string or file to --fileGrp-mapping, --fileId-mapping or --pageId-mapping in order to rename all fileGrp, file ID or page ID values, respectively.

The --file-id, --page-id, --mimetype and --file-grp options have the same semantics as in ocrd workspace find, see ocrd workspace find --help for an explanation.

ocrd workspace merge [OPTIONS] METS_PATH

Options

--overwrite, --no-overwrite

Overwrite on-disk file in case of file name conflicts with data from METS_PATH

--force, --no-force

Overwrite mets:file from –mets with mets:file from METS_PATH if IDs clash

--copy-files, --no-copy-files

Copy files as well

Default:

True

--fileGrp-mapping <filegrp_mapping>

JSON object mapping src to dest fileGrp

--fileId-mapping <fileid_mapping>

JSON object mapping src to dest file ID

--pageId-mapping <pageid_mapping>

JSON object mapping src to dest page ID

-Q, --exclude-file-grps <exclude_fileGrp>

fileGrps to exclude

-q, --include-file-grps <include_fileGrp>

fileGrps to include

-i, --file-id <FILTER>

ID

-g, --page-id <FILTER>

Page ID

-m, --mimetype <FILTER>

Media type to look for

-G, --file-grp <FILTER>

fileGrp USE

Arguments

METS_PATH

Required argument

prune-files

Removes mets:files that point to non-existing local files

(If any FILTER starts with //, then its remainder

will be interpreted as a regular expression.)

ocrd workspace prune-files [OPTIONS]

Options

-G, --file-grp <FILTER>

fileGrp USE

-m, --mimetype <FILTER>

Media type to look for

-g, --page-id <FILTER>

Page ID

-i, --file-id <FILTER>

ID

remove

Delete files (given by their ID attribute ID).

(If any ID starts with //, then its remainder

will be interpreted as a regular expression.)

ocrd workspace remove [OPTIONS] [ID]...

Options

-k, --keep-file

Do not delete file from file system

-f, --force

Continue even if mets:file or file on file system does not exist

Arguments

ID

Optional argument(s)

remove-group

Delete fileGrps (given by their USE attribute GROUP).

(If any GROUP starts with //, then its remainder

will be interpreted as a regular expression.)

ocrd workspace remove-group [OPTIONS] [GROUP]...

Options

-r, --recursive

Delete any files in the group before the group itself

-f, --force

Continue removing even if group or containing files not found in METS

-k, --keep-files

Do not delete files from file system

Arguments

GROUP

Optional argument(s)

rename-group

Rename fileGrp (USE attribute NEW to OLD).

ocrd workspace rename-group [OPTIONS] OLD NEW

Arguments

OLD

Required argument

NEW

Required argument

server

Control a METS server for this workspace

ocrd workspace server [OPTIONS] COMMAND [ARGS]...

start

Start a METS server

(For TCP backend, pass a network interface to bind to as the ‘-U/–mets-server-url’ parameter.)

ocrd workspace server start [OPTIONS]

stop

Stop the METS server

ocrd workspace server stop [OPTIONS]

set-id

Set METS ID.

If one of the supported identifier mechanisms is used, will set this identifier.

Otherwise will create a new <mods:identifier type=”purl”>{{ ID }}</mods:identifier>.

ocrd workspace set-id [OPTIONS] ID

Arguments

ID

Required argument

update-page

Update the @ID, @ORDER, @ORDERLABEL, @LABEL or @CONTENTIDS attributes of the mets:div with @ID=PAGE_ID

ocrd workspace update-page [OPTIONS] PAGE_ID

Options

--set <ATTR VALUE>

set mets:div ATTR to VALUE. possible keys: [‘ID’, ‘ORDER’, ‘ORDERLABEL’, ‘LABEL’, ‘CONTENTIDS’]

--order <ORDER>

[DEPRECATED - use –set ATTR VALUE

--orderlabel <ORDERLABEL>

DEPRECATED - use –set ATTR VALUE

--contentids <ORDERLABEL>

DEPRECATED - use –set ATTR VALUE

Arguments

PAGE_ID

Required argument

validate

Validate a workspace

METS_URL can be a URL, an absolute path or a path relative to $PWD. If not given, use –mets accordingly.

Check that the METS and its referenced file contents abide by the OCR-D specifications.

ocrd workspace validate [OPTIONS] [METS_URL]

Options

-a, --download

Download all files

-s, --skip <skip>

Tests to skip

Options:

imagefilename | dimension | pixel_density | page | url | page_xsd | mets_fileid_page_pcgtsid | mets_unique_identifier | mets_file_group_names | mets_files | mets_xsd

--page-textequiv-consistency, --page-strictness <page_textequiv_consistency>

How strict to check PAGE multi-level textequiv consistency

Options:

strict | lax | fix | off

--page-coordinate-consistency <page_coordinate_consistency>

How fierce to check PAGE multi-level coordinate consistency

Options:

poly | baseline | both | off

Arguments

METS_URL

Optional argument

class ocrd.cli.workspace.WorkspaceCtx(directory, mets_url, mets_basename='mets.xml', mets_server_url=None, automatic_backup=False)[source]

Bases: object