ocrd.cli.workspace module¶
OCR-D CLI: workspace management
ocrd workspace¶
Managing workspaces
A workspace comprises a METS file and a directory as point of reference.
Operates on the file system directly or via a METS server (already running via some prior server start subcommand).
ocrd workspace [OPTIONS] COMMAND [ARGS]...
Options
- -d, --directory <WORKSPACE_DIR>¶
Changes the workspace folder location [default: METS_URL directory or .]”
- -M, --mets-basename <mets_basename>¶
METS file basename. Deprecated, use –mets/–directory
- -m, --mets <METS_URL>¶
The path/URL of the METS file [default: WORKSPACE_DIR/mets.xml]
- -U, --mets-server-url <mets_server_url>¶
TCP host URI or UDS path of METS server
- --backup¶
Backup mets.xml whenever it is saved.
Environment variables
- WORKSPACE_DIR
Provide a default for
-d
add¶
Add a file or http(s) URL FNAME to METS in a workspace. If FNAME is not an http(s) URL and is not a workspace-local existing file, try to copy to workspace.
ocrd workspace add [OPTIONS] FNAME
Options
- -G, --file-grp <FILE_GRP>¶
Required fileGrp USE
- -i, --file-id <FILE_ID>¶
Required ID for the file
- -m, --mimetype <TYPE>¶
Media type of the file. Guessed from extension if not provided
- -g, --page-id <PAGE_ID>¶
ID of the physical page
- -C, --check-file-exists¶
Whether to ensure FNAME exists
- --ignore¶
Do not check whether file exists.
- --force¶
If file with ID already exists, replace it. No effect if –ignore is set.
Arguments
- FNAME¶
Required argument
backup¶
Backing and restoring workspaces - dev edition
ocrd workspace backup [OPTIONS] COMMAND [ARGS]...
add¶
Create a new backup
ocrd workspace backup add [OPTIONS]
list¶
List backups
ocrd workspace backup list [OPTIONS]
restore¶
Restore backup BAK
ocrd workspace backup restore [OPTIONS] BAK
Options
- -f, --choose-first¶
Restore first matching version if more than one
Arguments
- BAK¶
Required argument
undo¶
Restore the last backup
ocrd workspace backup undo [OPTIONS]
bulk-add¶
Add files in bulk to an OCR-D workspace.
FILE_GLOB can either be a shell glob expression to match file names, or a list of expressions or ‘-’, in which case expressions are read from STDIN.
After globbing, –regex is matched against each expression resulting from FILE_GLOB, and can define named groups reusable in the –page-id, –file-id, –mimetype, –url, –source-path and –file-grp options, e.g. by referencing the group name ‘grp’ from the regex as ‘{{ grp }}’.
If the FILE_GLOB expressions do not denote the file names themselves (but arbitrary strings for –regex matching), then use –source-path to set the actual file paths to use. (This could involve fixed strings or group references.)
{ echo PHYS_0001 BIN FILE_0001_BIN.IMG-wolf BIN/FILE_0001_BIN.IMG-wolf.png;
echo PHYS_0001 BIN FILE_0001_BIN BIN/FILE_0001_BIN.xml; echo PHYS_0002 BIN FILE_0002_BIN.IMG-wolf BIN/FILE_0002_BIN.IMG-wolf.png; echo PHYS_0002 BIN FILE_0002_BIN BIN/FILE_0002_BIN.xml;
- } | ocrd workspace bulk-add -r ‘(?P<pageid>.*) (?P<filegrp>.*) (?P<fileid>.*) (?P<local_filename>.*)’
-G ‘{{ filegrp }}’ -g ‘{{ pageid }}’ -i ‘{{ fileid }}’ -S ‘{{ local_filename }}’ -
ocrd workspace bulk-add [OPTIONS] FILE_GLOB...
Options
- -r, --regex <regex>¶
Required Regular expression matching the FILE_GLOB filesystem paths to define named captures usable in the other parameters
- -m, --mimetype <mimetype>¶
Media type of the file. If not provided, guess from filename
- -g, --page-id <page_id>¶
physical page ID of the file
- -i, --file-id <file_id>¶
ID of the file. If not provided, derive from fileGrp and filename
- -u, --url <url>¶
Remote URL of the file
- -l, --local-filename <local_filename>¶
Local filesystem path in the workspace directory (copied from source file if different)
- -G, --file-grp <file_grp>¶
Required File group USE of the file
- -n, --dry-run¶
Don’t actually do anything to the METS or filesystem, just preview
- -S, --source-path <src_path_option>¶
File path to copy from (if different from FILE_GLOB values)
- -I, --ignore¶
Disable checking for existing file entries (faster)
- -f, --force¶
Replace existing file entries with the same ID (no effect when –ignore is set, too)
- -s, --skip¶
Skip files not matching –regex (instead of failing)
Arguments
- FILE_GLOB¶
Required argument(s)
clean¶
Removes files and directories from the workspace that are not referenced by any mets:files.
PATH_GLOB can be a shell glob expression to match file names, directory names (recursively), or plain paths. All paths are resolved w.r.t. the workspace.
If no PATH_GLOB are specified, then all files and directories may match.
ocrd workspace clean [OPTIONS] [PATH_GLOB]...
Options
- -n, --dry-run¶
Don’t actually do anything to the filesystem, just preview
- -d, --directories¶
Remove untracked directories in addition to untracked files
Arguments
- PATH_GLOB¶
Optional argument(s)
clone¶
Create a workspace from METS_URL and return the directory
METS_URL can be a URL, an absolute path or a path relative to $PWD. If METS_URL is not provided, use –mets accordingly. METS_URL can also be an OAI-PMH GetRecord URL wrapping a METS file.
ocrd workspace clone [OPTIONS] METS_URL [WORKSPACE_DIR]
Options
- -f, --clobber-mets¶
Overwrite existing METS file
- -a, --download¶
Download all files and change location in METS file after cloning
- -Q, --exclude-file-grps <exclude_fileGrp>¶
fileGrps to exclude
- -q, --include-file-grps <include_fileGrp>¶
fileGrps to include
- -i, --file-id <FILTER>¶
ID
- -g, --page-id <FILTER>¶
Page ID
- -m, --mimetype <FILTER>¶
Media type to look for
- -G, --file-grp <FILTER>¶
fileGrp USE
Arguments
- METS_URL¶
Required argument
- WORKSPACE_DIR¶
Optional argument
find¶
Find files.
- (If any
FILTER
starts with//
, then its remainder will be interpreted as a regular expression.)
ocrd workspace find [OPTIONS]
Options
- -Q, --exclude-file-grps <exclude_fileGrp>¶
fileGrps to exclude
- -q, --include-file-grps <include_fileGrp>¶
fileGrps to include
- -i, --file-id <FILTER>¶
ID
- -g, --page-id <FILTER>¶
Page ID
- -m, --mimetype <FILTER>¶
Media type to look for
- -G, --file-grp <FILTER>¶
fileGrp USE
- -k, --output-field <output_field>¶
Output field. Repeat for multiple fields, will be joined with tab
- Default:
'local_filename'
- Options:
url | mimetype | page_id | pageId | file_id | ID | file_grp | fileGrp | basename | basename_without_extension | local_filename
- --download¶
Download found files to workspace and change location in METS file
- --undo-download¶
Remove all downloaded files from the METS and workspace
- --keep-files¶
Do not remove downloaded files from the workspace with –undo-download
- --wait <wait>¶
Wait this many seconds between download requests
get-id¶
Get METS id if any
ocrd workspace get-id [OPTIONS]
init¶
Create a workspace with an empty METS file in DIRECTORY or CWD.
ocrd workspace init [OPTIONS] [DIRECTORY]
Options
- -f, --clobber-mets¶
Clobber mets.xml if it exists
Arguments
- DIRECTORY¶
Optional argument
list-group¶
List fileGrp USE attributes
ocrd workspace list-group [OPTIONS]
list-page¶
List physical page IDs
- (If any
FILTER
starts with//
, then its remainder will be interpreted as a regular expression.)
ocrd workspace list-page [OPTIONS]
Options
- -k, --output-field <output_field>¶
Output field. Repeat for multiple fields, will be joined with tab
- Default:
'ID'
- Options:
ID | ORDER | ORDERLABEL | LABEL | CONTENTIDS
- -f, --output-format <output_format>¶
Output format
- Options:
one-per-line | comma-separated | json
- -D, --chunk-number <chunk_number>¶
Partition the return value into n roughly equally sized chunks
- -C, --chunk-index <chunk_index>¶
Output the nth chunk of results, -1 for all of them.
- -r, --page-id-range <page_id_range>¶
Restrict the pages to those matching the provided range, based on the @ID attribute. Separate start/end with ..
- -R, --numeric-range <numeric_range>¶
Restrict the pages to those in the range, in numerical document order. Separate start/end with ..
merge¶
Merges this workspace with the workspace that contains METS_PATH
Pass a JSON string or file to --fileGrp-mapping
, --fileId-mapping
or --pageId-mapping
in order to rename all fileGrp, file ID or page ID values, respectively.
The --file-id
, --page-id
, --mimetype
and --file-grp
options have
the same semantics as in ocrd workspace find
, see ocrd workspace find --help
for an explanation.
ocrd workspace merge [OPTIONS] METS_PATH
Options
- --overwrite, --no-overwrite¶
Overwrite on-disk file in case of file name conflicts with data from METS_PATH
- --force, --no-force¶
Overwrite mets:file from –mets with mets:file from METS_PATH if IDs clash
- --copy-files, --no-copy-files¶
Copy files as well
- Default:
True
- --fileGrp-mapping <filegrp_mapping>¶
JSON object mapping src to dest fileGrp
- --fileId-mapping <fileid_mapping>¶
JSON object mapping src to dest file ID
- --pageId-mapping <pageid_mapping>¶
JSON object mapping src to dest page ID
- -Q, --exclude-file-grps <exclude_fileGrp>¶
fileGrps to exclude
- -q, --include-file-grps <include_fileGrp>¶
fileGrps to include
- -i, --file-id <FILTER>¶
ID
- -g, --page-id <FILTER>¶
Page ID
- -m, --mimetype <FILTER>¶
Media type to look for
- -G, --file-grp <FILTER>¶
fileGrp USE
Arguments
- METS_PATH¶
Required argument
prune-files¶
Removes mets:files that point to non-existing local files
- (If any
FILTER
starts with//
, then its remainder will be interpreted as a regular expression.)
ocrd workspace prune-files [OPTIONS]
Options
- -G, --file-grp <FILTER>¶
fileGrp USE
- -m, --mimetype <FILTER>¶
Media type to look for
- -g, --page-id <FILTER>¶
Page ID
- -i, --file-id <FILTER>¶
ID
remove¶
Delete files (given by their ID attribute ID
).
- (If any
ID
starts with//
, then its remainder will be interpreted as a regular expression.)
ocrd workspace remove [OPTIONS] [ID]...
Options
- -k, --keep-file¶
Do not delete file from file system
- -f, --force¶
Continue even if mets:file or file on file system does not exist
Arguments
- ID¶
Optional argument(s)
remove-group¶
Delete fileGrps (given by their USE attribute GROUP
).
- (If any
GROUP
starts with//
, then its remainder will be interpreted as a regular expression.)
ocrd workspace remove-group [OPTIONS] [GROUP]...
Options
- -r, --recursive¶
Delete any files in the group before the group itself
- -f, --force¶
Continue removing even if group or containing files not found in METS
- -k, --keep-files¶
Do not delete files from file system
Arguments
- GROUP¶
Optional argument(s)
rename-group¶
Rename fileGrp (USE attribute NEW
to OLD
).
ocrd workspace rename-group [OPTIONS] OLD NEW
Arguments
- OLD¶
Required argument
- NEW¶
Required argument
server¶
Control a METS server for this workspace
ocrd workspace server [OPTIONS] COMMAND [ARGS]...
start¶
Start a METS server
(For TCP backend, pass a network interface to bind to as the ‘-U/–mets-server-url’ parameter.)
ocrd workspace server start [OPTIONS]
stop¶
Stop the METS server
ocrd workspace server stop [OPTIONS]
set-id¶
Set METS ID.
If one of the supported identifier mechanisms is used, will set this identifier.
Otherwise will create a new <mods:identifier type=”purl”>{{ ID }}</mods:identifier>.
ocrd workspace set-id [OPTIONS] ID
Arguments
- ID¶
Required argument
update-page¶
Update the @ID, @ORDER, @ORDERLABEL, @LABEL or @CONTENTIDS attributes of the mets:div with @ID=PAGE_ID
ocrd workspace update-page [OPTIONS] PAGE_ID
Options
- --set <ATTR VALUE>¶
set mets:div ATTR to VALUE. possible keys: [‘ID’, ‘ORDER’, ‘ORDERLABEL’, ‘LABEL’, ‘CONTENTIDS’]
- --order <ORDER>¶
[DEPRECATED - use –set ATTR VALUE
- --orderlabel <ORDERLABEL>¶
DEPRECATED - use –set ATTR VALUE
- --contentids <ORDERLABEL>¶
DEPRECATED - use –set ATTR VALUE
Arguments
- PAGE_ID¶
Required argument
validate¶
Validate a workspace
METS_URL can be a URL, an absolute path or a path relative to $PWD. If not given, use –mets accordingly.
Check that the METS and its referenced file contents abide by the OCR-D specifications.
ocrd workspace validate [OPTIONS] [METS_URL]
Options
- -a, --download¶
Download all files
- -s, --skip <skip>¶
Tests to skip
- Options:
imagefilename | dimension | pixel_density | page | url | page_xsd | mets_fileid_page_pcgtsid | mets_unique_identifier | mets_file_group_names | mets_files | mets_xsd
- --page-textequiv-consistency, --page-strictness <page_textequiv_consistency>¶
How strict to check PAGE multi-level textequiv consistency
- Options:
strict | lax | fix | off
- --page-coordinate-consistency <page_coordinate_consistency>¶
How fierce to check PAGE multi-level coordinate consistency
- Options:
poly | baseline | both | off
Arguments
- METS_URL¶
Optional argument