Data Dictionary¶
Additional TODO¶
- Do a git comparison between version 1.6.2 and version 2.0.0 and document all changes between the two versions.
Introduction¶
This guide, designed for a Web Curator Tool developer and contributor, explains and documents the database for the Web Curator Tool. The source for both code and documentation for the Web Curator Tool can be found at: https://www.webcuratortool.org/
For information on how to install and setup the Web Curator Tool, see the Web Curator Tool System Administrator Guide. For information about developing and contributing to the Web Curator Tool, see the Developer Guide. For information on using the Web Curator Tool, see the Web Curator Tool Quick Start Guide and the Web Curator Tool online help.
Contents of this document¶
Following this introduction, the Web Curator Tool Developer Guide includes the following sections:
- Changes - Covers changes between different versioned releases.
- Data model diagram - provides a diagram of the WCT data model.
- Data descriptions - Data descriptions for the data fields.
- Database descriptions - Descriptions for the tables and their fields.
- Generating primary keys - How to generate primary keys.
Changes¶
Changes since 2.0.0¶
- Placeholder for changes since version 2.0.0. This list should be updated after every feature/bug fix is merged into the master branch.
1.6.2 to 2.0.0¶
- Placeholder for changes between version 1.6.2 and 2.0.0.
Data model diagram¶
The data model diagram shows the relationships between the different tables.
Note that this diagram cannot be updated since we don’t have the original source file. Any significant updates to the tables and/or their relationships should result in the commissioning of a new diagram.
Data descriptions¶
Overview¶
This section describes the tables in the WCT database.
Field types¶
Field types in this document are indicative only, and may depend on the implementation.
The types used are:
- Boolean
- A Boolean value (true or false, or 0 or 1, depending on implementation).
- Text
- A free text field.
- Constrained text
- A text field whose contents are constrained to a limited set of values by the application (see Constrained text fields below).
- Timestamp
- A timestamp encoding a date and time.
- Primary key
- A unique internal identifier (see Generating primary keys).
- Secondary key
- A key from another table.
- Float
- A floating-point number.
- Number
- An integer number.
Constrained text fields¶
Some tables have fields that are constrained to a fixed set of values.
These fields will be implemented in the database as Text fields, but will appear to users as enumerations (usually in a drop-down menu).
In most cases, the set of possible values can be set in a configuration file (to support different requirements at different institutions).
In each case, a single value can be assigned.
Database descriptions¶
Targets, Groups and Schedules¶
ABSTRACT_TARGET¶
The ABSTRACT_TARGET table is used to store information that is common to both Targets and Groups.
The table is needed because the WCT can be instructed to “harvest” an entire Group at once, as though it were a Target. This means that the ABSTRACT_TARGET is used to contain or manage all profile and scheduling information.
Name | Type | Description |
---|---|---|
AT_OID | Primary key | |
AT_DESC | Text | An internal description of the Target or Group. |
AT_NAME | Text | The name of the Target or Group. |
AT_OWNER_ID | Foreign key | The owner of the Target or Group. |
AT_PROF_OVERRIDE_OID | Foreign key | The key of the profile override information for this Target or Group. |
AT_STATE | Integer | The state of the Target or Group. Values will be different for Targets than for Groups. Target values correspond to: Pending, Nominated, Rejected, Approved, Completed, Cancelled, Reinstated. |
AT_PROFILE_ID | Foreign key | Reference to the profile information for this Target. |
AT_OBJECT_TYPE | Integer | Identifies whether this is a Target (1) or a Group (0). |
AT_CREATION_DATE | Timestamp | The date and time the ABSTRACT_TARGET was created. |
AT_REFERENCE | Text | An external reference number (e.g. catalogue number). |
AT_PROFILE_NOTE | Text | Records notable aspects of the site that relate to the choice of harvest profile and overrides. |
AT_DUBLIN_CORE_OID | Foreign key | Reference to the Dublin Core metadata for this Target. |
AT_ACCESS_ZONE | Integer | Access Zone (enumerated field): 0 – Public (default), 1 – On Site, 2 - Restricted. |
AT_DISPLAY_TARGET | Boolean | Display this Target. |
AT_DISPLAY_NOTE | Text | Records an explanation of the Access Zone and Display Target choices. |
AT_DISPLAY_CHG_REASON | Text | Records the reason the AT_DISPLAY_TARGET Boolean was last changed. |
AT_RR_OID | Foreign key | Reference to the rejection reason for this Target. |
TARGET¶
TARGET contains information specific to Target objects.
Each Target is based on an ABSTRACT_TARGET, and takes its primary key from the ABSTRACT_TARGET primary key.
Name | Type | Description | |
---|---|---|---|
|
Reference to ABSTRACT_TARGET corresponding to the Target. | ||
T_RUN_ON_APPROVAL | Boolean | If true, then an additional Target Instance will be scheduled to begin one minute after the Target state is set to Approved. | |
T_EVALUATION_NOTE | Text | Records notable aspects of the site that relate to its evaluation. | |
T_SELECTION_DATE | Timestamp | The date the Target was formally selected. This should be set automatically to the date and time the Target state first changed to Approved. | |
T_SELECTION_NOTE | Text | Records information relating to the selection process, in particular reasons for the selection decision. | |
T_SELECTION_TYPE | Constrained text | Records the type of schedule that has been applied to the site. Example values: one-off, ad hoc, regular. | |
T_HARVEST_TYPE | Constrained | Records type of selective harvest approach has been applied to the site. Example values: subject, event, theme. | |
T_USE_AQA | Boolean | Records whether TIs derived from this Target should be marked for inclusion in the automated quality assurance (AQA) post harvest processes. | |
T_ALLOW_OPTIMIZE | Boolean | Flag to indicate whether harvest optimization is permitted for this target’s harvests. |
SEED¶
SEED contains the set of seed URLs corresponding to a Target.
Name | Type | Description |
---|---|---|
S_OID | Primary key | |
S_SEED | URL | The seed URL. |
S_TARGET_ID | Foreign Key | The key of the Target the key belongs to. |
S_PRIMARY | Boolean | Records whether the URL is marked as a primary URL in the user interface. |
SEED_HISTORY¶
SEED_HISTORY contains the set of seed URLs corresponding to a Target Instance when harvested. Population of this table can be turned off in wct_core.xml. Once written the content is not used again by WCT.
Name | Type | Description |
---|---|---|
SH_OID | Primary key | |
SH_TI_OID | Foreign Key | The key of the Target Instance the key belongs to. |
SH_SEED | URL | The seed URL. |
SH_PRIMARY | Boolean | Records whether the URL is marked as a primary URL in the user interface. |
TARGET_GROUP¶
TARGET_GROUP contains information specific to Group objects.
Each Group is based on an ABSTRACT_TARGET, and takes its primary key from the ABSTRACT_TARGET primary key.
Groups can usually act as logical groupings that indicate that a set of Targets share some property. For example, a set of Targets in the Elections 2005 Group might all be relevant to a particular general election. They can also act as functional groupings that simplify the management of Targets by allowing all the Targets in a Group to have a crawl scheduled for specific time. This means they share much of the functionality of a Target (specifically, the ability to schedule a harvest, with all the profile and scheduling data required).
Group membership is recorded in the GROUP_MEMBER table.
Name | Type | Description |
---|---|---|
TG_AT_OID | Primary key (Foreign key) | Reference to ABSTRACT_TARGET corresponding to the Group. |
TG_SIP_TYPE | Boolean | Controls whether the members are crawled as separate jobs or as a single combined job when the Group is crawled. |
TG_START_DATE | Date | The date on which the Group becomes relevant to its members. |
TG_END_DATE | Date | The date after which the Group ceases to be relevant to its members. |
TG_OWNERSHIP_METADATA | Text | Additional information describing the ownership of a Group, particularly for Groups that have multiple owners. |
TG_TYPE | Constrained text | Records the type of Group. Example values: collection, subject, thematic, event, functional. |
GROUP_MEMBER¶
GROUP_MEMBER records Group membership information.
Name | Type | Description |
---|---|---|
AT_OID | Primary key (Foreign key) | |
GM_CHILD_ID | Foreign key | The key of the child (member) Target or Group. |
GM_PARENT_ID | Foreign key | The key of the parent (containing) Group. |
SCHEDULE¶
A SCHEDULE contains information about the times that a harvest will be run.
Name | Type | Description |
---|---|---|
S_OID | Primary key | |
S_CRON | Text | The cron pattern this schedule is based on. |
S_START | Timestamp | The date the harvests are to commence. |
S_END | Timestamp | The date the harvests are to end. |
S_ABSTRACT_TARGET_ID | Foreign key | ID of the AbstractTarget to which this schedule belongs. |
S_TYPE | The type of the schedule. This is 0 for a custom schedule, or the ID number of a SchedulePattern from the wct-core.xml. | |
S_OWNER_OID | Foreign key | The key of the User who is the owner of this schedule. |
S_NEXT_SCHEDULE_TIME | Timestamp | The date of the next harvest initiated by this schedule. |
S_ABSTRACT_TARGET_ID | The key of the Target or Group this schedule is part of. | |
S_LAST_PROCESSED_DATE | Timestamp | The date that the background batch scheduling processing last processed this record – used to optimise batch processing. |
Target Instances and Harvest Results¶
TARGET_INSTANCE¶
TARGET_INSTANCE contains information specific to the Target Instances. Target Instances represent the harvests that have occurred, are occurring, or will occur for a Target
Name | Type | Description |
---|---|---|
TI_OID | Primary key | |
TI_VERSION | Number | Internal version number for optimistic locking. |
TI_SCHEDULE_ID | Foreign key | The key of the schedule that initiated this harvest. |
TI_TARGET_ID | Foreign key | The key of the ABSTRACT_TARGET that this Target Instance is derived from. |
TI_PRIORITY | Number | 0 = High Priority; 100 = Normal Priority; 1000 = Low priority. |
TI_SCHEDULED_TIME | Timestamp | The date and time the harvest is (or was) scheduled to begin. |
TI_STATE | The current state of the Target Instance. Values correspond to: Scheduled, Running, Paused, Aborted, Harvested, Rejected, Endorsed, Archived. | |
TI_BANDWIDTH_PERCENT | The proportion of the total available bandwidth that has been manually assigned to this crawl job (empty if the default bandwidth allocation has not been overridden). | |
TI_ALLOCATED_BANDWIDTH | Number | The actual amount of bandwidth assigned in Kilobytes per second. |
TI_START_TIME | Timestamp | For harvests that have started, the date and time the harvest actually did begin. |
TI_OWNER_ID | Foreign key | The key of the User who is the owner of this schedule. |
TI_DISPLAY_ORDER | Number | A number to assist with the ordering of results in the Target Instance search results screen. This number is tied to the state of the target instance. |
TI_PROF_OVERRIDE_OID | Foreign key | The key of the profile override information for this harvest. |
TI_PURGED | Boolean | True if the Harvest Results have been purged from the Digital Asset Store. |
TI_ARCHIVE_ID | Text | The ID returned by the Archive when the Harvest Result is “Submitted to Archive”, if any. |
TI_REFERENCE | Text | Duplicate of the TI_ARCHIVE_ID field. |
TI_HARVEST_SERVER | Text | The name of the harvest agent that ran this Target Instance. |
TI_DISPLAY_TARGET_INSTANCE | Boolean | Display this Target Instance. |
TI_DISPLAY_NOTE | Text | Records an explanation of the Display Target Instance choice. |
TI_FLAGGED | Boolean | Flag this target instance. |
TI_PROFILE_ID | Number | If this target instance is in a running state or later, this is the ID of the locked profile used to run the target instance. |
TI_ARCHIVED_TIME | Timestamp | The time that this target instance was archived or rejected. |
TI_FIRST_FROM_TARGET | Boolean | Is this the first TI created from a particular Target? |
TI_DISPLAY_CHG_REASON | Text | The reason the TI_DISPLAY_TARGET_INSTANCE Boolean was last changed. |
TI_USE_AQA | Boolean | Records whether the TI is marked for inclusion in the automated quality assurance (AQA) post harvest processes. |
HARVEST_RESULT¶
A HARVEST_RESULT is a set of files that represent the result of a harvest of a Target Instance. Note there can be several harvest results for each Target Instance (the first created by the crawler, the rest by QR tools).
Name | Type | Description |
---|---|---|
HR_OID | Primary key | |
HR_HARVEST_NO | Number | The sequence number of the result. Harvest Result #1 is always the original harvest. Harvest Result #2 can be created through the prune tool. |
HR_TARGET_INSTANCE_ID | Foreign key | The key of the Target Instance this harvest result belongs to. |
HR_PROVENANCE_NOTE | The provenance note of this Harvest Result. | |
HR_CREATED_DATE | Timestamp | The date the harvest result was created. |
HR_CREATED_BY_ID | Foreign key | The key of the User who created the Harvest Result. |
HR_STATE | Number | The endorsement state of the Harvest Result. Values correspond to: 1 = Endorsed; 2 = Rejected |
HR_INDEX | Number | An internal number for list management, this is mandatory for a Hibernate List. |
HR_DERIVED_FROM | Number | The list index of the harvest result that this harvest result is derived from. This is used in the case of a pruned harvest result. |
HR_RR_OID | Foreign key | Reference to the rejection reason for this Harvest Result. |
ARC_HARVEST_RESULT¶
ARC_HARVEST_RESULT associates each ARC file (ARC_HARVEST_FILE) with a Harvest Result (HARVEST_RESULT). This allows for flexibility in the future, despite having no data at present.
Name | Type | Description |
---|---|---|
AHRS_HARVEST_RESULT_OID | Primary key | |
HR_MODIFICATION_NOTE | This table holds a record of the modifications made to a harvest through the Prune Tool. | |
HMN_HR_OID | Foreign key | The key of the Harvest Result that this belongs to. |
HMN_INDEX | Number | The list index number (used to keep the order of the list). |
HMN_NOTE | Text | The text describing the modification. |
ARC_HARVEST_FILE¶
ARC_HARVEST_FILE contains information describing a single ARC file that is part of an ARC_HARVEST_RESULT.
Name | Type | Description |
---|---|---|
AHF_OID | Primary key | |
AHF_COMPRESSED | Boolean | Specifies whether the ARC file is compressed. |
AHF_NAME | Text | The ARC file name. |
AHF_ARC_HARVEST_RESULT_ID | Foreign key | The key of the ARC_HARVEST_RESULT this file belongs to. |
HARVEST_RESOURCE¶
HARVEST_RESOURCE contains information about each resource harvested.
Name | Type | Description |
---|---|---|
HRC_OID | Primary key | |
HRC_LENGTH | Number | The length of the resource in bytes. |
HRC_NAME | Text | The URI of the resource. |
HRC_HARVEST_RESULT_OID | Foreign key | The key of the HARVEST_RESULT this file belongs to. |
HRC_STATUS_CODE | Number | The HTTP status code of the resource (e.g. 200 = OK, 500 = Internal Server Error, etc.). |
ARC_HARVEST_RESOURCE¶
ARC_HARVEST_RESOURCE contains information about a harvested resource that is particular to the ARC format.
Name | Type | Description |
---|---|---|
AHRC_HARVEST_RESOURCE_OID | Primary key | |
AHRC_RESOURCE_LENGTH | Number | Not used – we currently rely on the HarvestResource’s length attribute. |
AHRC_RESOURCE_OFFSET | Number | The offset of this resource in the ARC file. |
AHRC_ARC_FILE_NAME | Text | The ARC file that contains this resource. |
AHRC_COMPRESSED_YN | Boolean | True if the ARC file is compressed; otherwise false. |
SIP_PART_ELEMENT | The SIP_PART_ELEMENT table is used internally to store parts of the SIP that must be created when a target instance’s harvest is started. This ensures that the details in the SIP remain consistent, even if the target instance’s data is changed between harvest and archive. | |
SPE_KEY | Text | A key indicating what part of the SIP this row represents. |
SPE_TARGET_INSTANCE_OID | Foreign Key | The key of the Target Instance to which this belongs. |
SPE_VALUE | Text / CLOB | The value of this part of the SIP. |
TARGET_INSTANCE_ORIG_SEED | This table holds the seeds of a target instance at the time the harvest was started. This is used internally to the WCT to ensure that the seeds stated in the SIP represent those at the time of the harvest, rather than those at the time of archiving (for example, if the seeds of the Target were changed after the harvest had started). | |
TIOS_TI_OID | Foreign key | The key of the Target Instance to which this belongs. |
TIOS_SEED | Text | The seed at the time of harvest. |
REJECTION_REASON¶
This table holds the reason for rejection that may be assigned to a Target or Harvest Result when it is rejected by the user. An administration page within WCT allows system administrators to set these up on a per agency basis.
Name | Type | Description |
---|---|---|
RR_OID | Primary key | |
RR_NAME | Text | A description of the reason for rejection. |
RR_AVAILABLE_FOR_TARGET | Boolean | Should this reason be applicable to Targets? |
RR_AVAILABLE_FOR_TI | Boolean | Should this reason be applicable to TIs? |
RR_AGC_OID | Foreign key | The owning Agency that this rejection reason belongs to. |
Harvest Authorisations¶
SITE¶
The SITE table contains high-level information about a Harvest Authorisation, and is used to group all the information applying to a specific harvest authorisation.
Note that the SITE table is badly named through historical accident.
Name | Type | Description |
---|---|---|
ST_OID | Primary key | |
ST_TITLE | Text | The name of the Harvest Authorisation record. |
ST_DESC | Text | A description of the authorisation record. |
ST_LIBRARY_ORDER_NO | Text | An external Order Number (e.g. Library Order Number). |
ST_NOTES | Text | |
ST_PUBLISHED | Boolean | Records whether the “Published” checkbox is ticked. |
ST_ACTIVE | Boolean | Records whether the harvest authorisation (and all associated permissions) is enabled or disabled. |
ST_OWNING_AGENCY_ID | Foreign Key | The owning agency for this site. |
URL_PATTERN¶
The URL_PATTERN table contains a URL or URL pattern.
The scope of each harvest authorisation (SITE) is defined by a set of URL patterns.
Name | Type | Description |
---|---|---|
UP_OID | Primary key | |
UP_PATTERN | Text | The URL or URL pattern. |
UP_SITE_ID | Foreign key | The key of the SITE this URL_PATTERN belongs to. |
AUTHORISING_AGENT¶
The AUTHORISING_AGENT table contains information about an entity contacted in relation to harvesting a website.
Name | Type | Description |
---|---|---|
AA_OID | Primary key | |
AA_NAME | Text | The name of the authorising agent. |
AA_ADRESS | Text | The full address of the authorising agent. |
AA_CONTACT | Text | The name of the individual contact for an organisation. |
AA_EMAIL | Text | The email address of the authorising agent. |
AA_PHONE_NUMBER | Text | The phone number of the authorising agent. |
AA_DESC | Text | A description of the authorising agent. |
SITE_AUTH_AGENCY¶
The SITE_AUTH_AGENCY table links each site with its list of authorising agencies. (Note this is a many-to-many relationship.)
Name | Type | Description |
---|---|---|
SA_SITE_ID | Primary key, Foreign key | The key of the SITE. |
SA_AGENT_ID | Primary key, Foreign key | The key of the AUTHORISING_AGENT. |
PERMISSION¶
The PERMISSION table contains information about a single permission that has been granted by an AUTHORISING_AGENT for a SITE.
Name | Type | Description |
---|---|---|
PE_OID | Primary key | |
PE_ACCESS_STATUS | Constrained | The access status of the permission. This value is constrained by the accessStatusList list in wct-core-lists.xml. |
PE_APPROVED_YN | Boolean | Not used. |
PE_AVAILABLE_YN` | Boolean | Not used. |
PE_COPYRIGHT_STATEMENT | Text | A passage of text that the publisher requires be displayed with the harvested material. |
PE_COPYRIGHT_URL | A URL (linking to a copyright statement) that the publisher requires to be displayed with the harvested material. | |
PE_CREATION_DATE | Timestamp | The date and time the permission record was created. |
PE_END_DATE | Timestamp | The date the permission information stored in this record expires (i.e. this permission only applies to harvests that occur between PE_START_DATE and PE_END_DATE). |
PE_NOTES | Text | As of release 1.6.0 used to hold Auth Agency Response. |
PE_OPEN_ACCESS_DATE | Timestamp | The date the rights over the harvested material expire and the material can be freely distributed. |
PE_PERMISSION_GRANTED_DATE | Timestamp | The date the permission was granted (or rejected). |
PE_PERMISSION_REQUESTED_DATE | Timestamp | The date the permission was requested. |
PE_SPECIAL_REQUIREMENTS | Text | A passage of text describing any special requirements for the use of the harvested material. |
PE_START_DATE | The date the permission information stored in this record expires (i.e. this permission only applies to harvests that occur between PE_START_DATE and PE_END_DATE). | |
PE_STATUS | Number | The current state of the Target Instance. Values correspond to: Pending, Requested, Approved, Rejected. |
PE_AUTH_AGENT_ID | Foreign key | The key of the AUTHORISING_AGENT who has authorised this permission record. |
PE_SITE_ID | Foreign key | The key of the Harvest Authorisation (i.e. SITE) that this permission applies to. |
PE_QUICK_PICK | Boolean | Records whether this permission appears in the Authorisation drop-down menu in the Seeds tab in the Target editing interface. |
PE_DISPLAY_NAME | Text | Label to use in the “Authorisation” drop-down menu in the Seeds tab in the Target editing interface (if PE_QUICK_PICK is set). |
PE_OWNING_AGENCY_ID | Foreign key | The key of the Agency that has been granted authorisation by this permission record. |
PE_FILE_REFERENCE | Text | An external reference number relating to this permission record (e.g. the file number of a permission letter). |
PERMISSION_URLPATTERN¶
The PERMISSION_URLPATTERN table links PERMISSION records to the URL_PATTERN records that apply to them. Each permission will apply to one or more URL Patterns.
Name | Type | Description |
---|---|---|
PU_URLPATTERN_ID | Primary key, Foreign key | The key of the URL Pattern. |
PU_PERMISSION_ID | Primary key, Foreign key | The key of the Permission record. |
PERMISSION_EXCLUSION¶
The PERMISSION_EXCLUSION table contains information about a URL pattern that has been excluded from a PERMISIION.
Name | Type | Description |
---|---|---|
PEX_OID | Primary key | |
PEX_REASON | Text | The reason for the exclusion. |
PEX_URL | Text | The URL or URL Pattern that has been excluded. |
PEX_PERMISSION_OID | Foreign key | The key of the permission that this is an exclusion to. |
PEX_INDEX | Number | Internal number for maintaining the order of elements in a list. |
SEED_PERMISSION¶
SEED_PERMISSION contains information about the associations between Seed URLs and the permission records that apply to them.
Name | Type | Description |
---|---|---|
SP_SEED_ID | Primary key, Foreign key | The key of a Seed URL. |
SP_PERMISSION_ID | Primary key, Foreign key | The key of a permission record that is linked to the Seed URL. |
URL_PERMISSION_MAPPING¶
URL_PERMISSION_MAPPING contains information about the associations between URL_PATTERNS and the permission records they apply to.
Name | Type | Description |
---|---|---|
UPM_OID | ||
UPM_PERMISSION_ID | The key of the permission record. | |
UPM_URL_PATTERN_ID | The key of a URL Pattern that is linked to this permission record. | |
UPM_DOMAIN | The most specific part of the domain, used for quick matching of seeds to permissions. For global patterns, this will be *. |
Profiles and profile overrides¶
PROFILE¶
PROFILE contains information describing a single Heritrix profile.
Name | Type | Description |
---|---|---|
P_OID | Primary key | |
P_VERSION | Number | Internal version number for optimistic locking. |
P_DESC | Text | A textual description of the profile. |
P_NAME | Text | The name of the profile. |
P_PROFILE_STRING | Text | The profile itself, stored as an XML document. |
P_PROFILE_LEVEL | Number | The level of the profile (controls which users may use the profile). |
P_STATUS | Number | The current status of the profile. |
P_DEFAULT | Boolean | Records whether this profile is the default profile for the Agency. |
P_AGENCY_OID | Foreign key | The key of the Agency that this profile belongs to. |
P_ORIG_OID | Number | The oid of the profile that this is a (usually locked) copy of. |
PROFILE_OVERRIDES¶
PROFILE_OVERRIDES contains information describing the overrides to a profile pertaining to a specific ABSTRACT_TARGET (or its Target Instances).
Name | Type | Description |
---|---|---|
PO_OID | Primary key | |
PO_EXCL_MIME_TYPES | Text | A list of MIME types to exclude from the harvest. |
PO_MAX_BYES | Number | The maximum quantity of data to download (in bytes). |
PO_MAX_DOCS | Number | The maximum number of documents to download. |
PO_MAX_HOPS | Number | The maximum distance to crawl (in Heritrix “hops”). |
PO_MAX_PATH_DEPTH | Number | The maximum distance to crawl (in path depth from the website root). |
PO_MAX_TIME_SEC | Number | The maximum time to spend on the harvest (in seconds). |
PO_ROBOTS_POLICY | Text | Specifies whether the obots.txt file should be consulted or ignored. Either ignore or classic. |
PO_OR_CREDENTIALS | Boolean | Specifies whether the Target has any credentials (i.e. usernames and passwords) stored in the PROFILE_CREDENTIALS and related tables. |
PO_OR_EXCL_MIME_TYPES | Boolean | Specifies whether the PO_EXCL_MIME_TYPES override is activated. |
PO_OR_EXCLUSION_URI | Boolean | Specifies whether the Target has any URL exclusions stored in the PO_EXCLUSION_URI table. |
PO_OR_INCLUSION_URI | Boolean | Specifies whether the Target has any URL inclusions stored in the PO_INCLUSION_URI table. |
PO_OR_MAX_BYTES | Boolean | Specifies whether the PO_MAX_BYES override is activated. |
PO_OR_MAX_DOCS | Boolean | Specifies whether the PO_MAX_DOCS override is activated. |
PO_OR_MAX_HOPS | Boolean | Specifies whether the PO_MAX_HOPS override is activated. |
PO_OR_MAX_PATH_DEPTH | Boolean | Specifies whether the PO_MAX_PATH_DEPTH override is activated. |
PO_OR_MAX_TIME_SEC | Boolean | Specifies whether the PO_MAX_TIME_SEC override is activated. |
PO_OR_ROBOTS_POLICY | Boolean | Specifies whether the PO_ROBOTS_POLICY override is activated. |
PO_EXCLUSION_URI¶
The PO_EXCLUSION_URI table contains information about a URL patterns that have been excluded from a PROFILE_OVERRIDE.
Name | Type | Description |
---|---|---|
PEU_IX | Primary key | |
PEU_PROF_OVER_OID | Foreign key | The key of the PROFILE_OVERRIDES that this exclusion applies to. |
PEU_FILTER | Text | The URL pattern excluded (a PERL regular expression). |
PO_INCLUSION_URI¶
The PO_INCLUSION_URI table contains information about a URL patterns that have been un-excluded from a PROFILE_OVERRIDE (i.e. patterns that are exceptions to exclusions in PO_EXCLUSION_URI).
Name | Type | Description |
---|---|---|
PEU_IX | Primary key | |
PEU_PROF_OVER_OID | Foreign key | The key of the PROFILE_OVERRIDES that this un-exclusion applies to. |
PEU_FILTER | Text | The URL pattern included (a PERL regular expression). |
PROFILE_CREDENTIALS¶
PROFILE_CREDENTIALS contains shared credential information used by both basic and form credentials.
Name | Type | Description |
---|---|---|
PC_OID | Primary key | |
PC_DOMAIN | Text | The Internet domain this credential applies to. |
PC_PASSWORD | Text | The password for this credential. |
PC_USERNAME | Text | The username for this credential. |
PC_PROFILE_OVERIDE_OID | Foreign key | The key of the PROFILE_OVERRIDES that these credentials apply to. |
PC_INDEX | Number | Internal number for maintaining the order of elements in a list. |
PROFILE_BASIC_CREDENTIALS¶
PROFILE_BASIC_CREDENTIALS is an extension of PROFILE_CREDENTIALS that contains credential information in basic credential format.
Name | Type | Description |
---|---|---|
PBC_PC_OID | Primary key, Foreign key | The key of the PROFILE_CREDENTIALS that this credential extends. |
PBC_REALM | Text | The realm this credential applies to. |
PROFILE_FORM_CREDENTIALS¶
PROFILE_FORM_CREDENTIALS is an extension of PROFILE_CREDENTIALS that contains credential information in “form” credential format.
Name | Type | Description |
---|---|---|
PRC_PC_OID | Primary key, Foreign key | The key of the PROFILE_CREDENTIALS that this credential extends. |
PFC_METHOD | Text | The method for submitting the form. |
PFC_LOGIN_URI | Text | The URL of the login form to use this credential against. |
PFC_PASSWORD_FIELD | Text | The name of the password field used in the form. |
PFC_USERNAME_FIELD | Text | The name of the username field used in the form. |
Audit trail¶
WCTAUDIT¶
WCTAUDIT records all auditable events.
Each row in the table records a single auditable action, including the user who performed the action, the date and time, the object the action was performed on (i.e. the subject), and any message.
Name | Type | Description |
---|---|---|
AUD_OID | Primary key | |
AUD_ACTION | Action | The auditable action performed. |
AUD_DATE | Timestamp | The date and time the action was performed. |
AUD_FIRSTNAME | Text | The first name of the user who performed the action. |
AUD_LASTNAME | Text | The last name of the user who performed the action. |
AUD_MESSAGE | Text | Additional text describing the action. |
AUD_SUBJECT_TYPE | Text | The type of the object that was acted on. |
AUD_USERNAME | Text | The username of the user who performed the action. |
AUD_USER_OID | Foreign key | The key of the user who performed the action. |
AUD_SUBJECT_OID | Foreign key | The key of the object that was acted on. |
AUD_AGENCY_OID | Foreign key | The key of the agency that the user who performed the action belongs to. |
WCT_LOGON_DURATION¶
WCT_LOGON_DURATION records the time and duration of all user sessions.
Each row in the table records a single user session.
Name | Type | Description |
---|---|---|
LOGDUR_OID | Primary key | |
LOGDUR_DURATION | Number | The duration of the user session in seconds. |
LOGDUR_LOGON_TIME | Timestamp | The date and time the user logged on to the WCT. |
LOGDUR_LOGOUT_TIME | Timestamp | The date and time the user logged out of the WCT. |
LOGDUR_USERNAME | Text | The username of the user. |
LOGDUR_USER_OID | Foreign key | The key of the user. |
LOGDUR_USER_REALNAME | Text | The full name of the user. |
LOGDUR_SESSION_ID | Text | The Apache Tomcat Session ID. |
Agencies, Roles and Users¶
AGENCY¶
AGENCY contains information describing an agency.
Name | Type | Description |
---|---|---|
AGC_OID | Primary key | |
AGC_NAME | Text | The name of the agency. |
AGC_ADDRESS | Text | The address of the agency. |
AGC_LOGO_URL | Text | A URL for the logo of the agency. |
AGC_URL | Text | The URL of the Agency |
AGC_EMAIL | Text | The agency email address. |
AGC_FAX | Text | The agency fax number. |
AGC_PHONE | Text | The agency phone number. |
AGC_SHOW_TASKS | Boolean | Whether the tasks list is shown on the notifications page for users in this agency. Default is true. |
WCTROLE¶
WCTROLE contains information about a role.
Each role is associated with a single agency. The privileges attached to the role are stored in the ROLE_PRIVILEGE table.
Name | Type | Description |
---|---|---|
ROL_OID | Primary key | |
ROL_DESCRIPTION | Text | Description of the role. |
ROL_NAME | Text | Name of the role. |
ROL_AGENCY_OID | Foreign key | The key of the agency that this role belongs to. |
ROLE_PRIVILEGE¶
ROLE_PRIVILEGE records the privileges, and the scope of privileges, associated with each role.
Each role can have any number of privileges associated with it. Privileges are identified by the PRV_CODE, a unique code used by the WCT to represent each privilege. These are codes are hard-coded in the WCT, where they are used to determine whether users can perform particular actions.
Name | Type | Description |
---|---|---|
PRV_OID | Primary key | |
PRV_CODE | Text | The code identifying the privilege being set. |
PRV_ROLE_OID | Foreign key | The key of the role this privilege is associated with. |
PRV_SCOPE | Number | The scope of the privilege as it applies to this role (i.e. whether the privilege applies to all data, agency data, or only the data owned by the user). 0 = All; 100 = Agency; 200 = Owner; 500 = None. |
WCTUSER¶
WCTUSER contains information describing the WCT users.
Name | Type | Description |
---|---|---|
USR_OID | Primary key | |
USR_ACTIVE | Boolean | Specifies whether the user is currently active or disabled. |
USR_ADDRESS | Text | The user’s physical or postal address. |
USR_EMAIL | Text | The user’s email address. |
USR_EXTERNAL_AUTH | Boolean | Specifies whether the user should be authenticated using an external LDAP service, or using the internal authentication system. |
USR_FIRSTNAME | Text | The user’s first name. |
USR_FORCE_PWD_CHANGE | Boolean | Specifies whether the user should be forced to reset their password next time they log on to the WCT. |
USR_LASTNAME | Text | The user’s last name. |
USR_PASSWORD | Text | The user’s (encrypted) password. |
USR_PHONE | Text | The user’s phone number. |
USR_TITLE | Text | The user’s title. |
USR_USERNAME | Text | The unique username identifying the user. |
USR_AGC_OID | Foreign key | The key of the Agency that the user belongs to. |
USR_DEACTIVATE_DATE | Timestamp | The date when the user was deactivated. |
USR_NOTIFICATIONS_BY_EMAIL | Boolean | True if the user wants to receive notifications by emails as well as to their WCT in-tray. |
USR_TASKS_BY_EMAIL | Boolean | True if the user wants to receive tasks by email as well as to their WCT in-tray. |
USR_NOTIFIY_ON_GENERAL | Boolean | True if the user wants to receive general notifications. |
USR_NOTIFY_ON_WARNINGS | Boolean | True if the user wants to receive notifications for warnings (such as memory warnings from the Harvest Agent). |
USER_ROLE¶
USER_ROLE contains information linking users and roles.
Each row contains a user key and a role key, indicating that the specified user has been assigned the specified role.
Name | Type | Description |
---|---|---|
URO_USR_OID | Foreign key | The key of the user. |
URO_ROL_OID | Foreign key | The key of the role. |
PERMISSION_TEMPLATE¶
PERMISSION_TEMPLATE contains information describing a permission request template.
Name | Type | Description |
---|---|---|
PRT_OID | Primary key | |
PRT_AGC_OID | Foreign key | The key of the Agency this template belongs to. |
PRT_TEMPLATE_TEXT | Text | The text of the permission letter template. |
PRT_TEMPLATE_NAME | Text | The name of the template. |
PRT_TEMPLATE_TYPE | Text | The type of template (either Print Template or Email Template). |
PRT_TEMPLATE_DESC | Text | The description of the template. |
PRT_TEMPLATE_SUBJECT | Text | The subject of the Email. |
PRT_TEMPLATE_OVERWRITE_FROM | Boolean | A flag used to control if the from field of the email is overwritten by the PRT_TEMPLATE_FROM. |
PRT_TEMPLATE_FROM | Text | The email address used in the sent from field. |
PRT_TEMPLATE_CC | Text | Email address(s) the emails are cc’d to. |
PRT_TEMPLATE_BCC | Text | Email address(s) the emails are bcc’d to. |
PRT_TEMPLATE_REPLY_TO | Email address used as the reply-to in permission emails. |
TASK¶
TASK contains information describing a WCT task.
When a task is created, it is not assigned to a user, but will be displayed (and emailed) to all the users in the same agency who have sufficient rights to perform the task. When one of these users “claims” the task, it will no longer be displayed to the other users. When the user completes the task, it will be removed from their task list and deleted.
Name | Type | Description |
---|---|---|
TSK_OID | Primary key | |
TSK_USR_OID | Foreign key | The key of the user who has claimed (or been assigned) the task, if any. |
TSK_MESSAGE | Text | The message describing the task. |
TSK_SENDER | Text | The email address of the sender of the task. |
TSK_SENT_DATE | Timestamp | The date and time the task was created. |
TSK_SUBJECT | Text | The subject line of the task, used in the InTray and in email notifications. |
TSK_PRIVILEGE | Text | The privilege code that a user must have in order to complete the task. This field identifies which users will see an unassigned task. |
TSK_AGC_OID | Foreign key | The key of the agency this task belongs to. |
TSK_MSG_TYPE | Text | A type code for the message. |
TSK_RESOURCE_OID | Foreign key | The key of the object this task will be performed on. |
TSK_RESOURCE_TYPE | Text | The type of object the TSK_RESOURCE_OID identifies. |
Other tables¶
ANNOTATIONS¶
The ANNOTATIONS table contains information about annotations. Annotations can be attached to many types of object, including Targets, target Instances, and Permissions.
Name | Type | Description |
---|---|---|
AN_OID | Primary key | |
AN_DATE | Timestamp | The date the annotation was created. |
AN_NOTE | Text | The text of the annotation. |
AN_USER_OID | Foreign key | The foreign key of the user who created the annotation. |
AN_OBJ_OID | Foreign key | The foreign key of the object that the annotation is attached to. |
AN_OBJ_TYPE | Number | Specifies the type of object that the annotation is attached to. |
AN_ALERTABLE | Boolean | Is this annotation to display a warning in the GUI. |
BANDWIDTH_RESTRICTIONS¶
The BANDWIDTH_RESTRICTIONS table records the bandwidth restrictions in place at different intervals.
Name | Type | Description |
---|---|---|
BR_OID | Primary key | |
BR_BANDWIDTH | Number | The bandwidth level for an interval. |
BR_DAY | Text | The day the interval applies to. |
BR_END_TIME | Timestamp | The end time of the interval. |
BR_START_TIME | Timestamp | The start time of the interval. |
BR_OPTIMIZATION_ALLOWED | Boolean | Whether harvest optimization is permitted during this restriction period. |
DUBLIN_CORE¶
The DUBLIN_CORE table records the Dublin Core metadata for a Target.
Name | Type | Description |
---|---|---|
DC_OID | Primary key | |
DC_CONTRIBUTOR | Text | Dublin Core metadata value. |
DC_COVERAGE | Text | Dublin Core metadata value. |
DC_CREATOR | Text | Dublin Core metadata value. |
DC_DESCRIPTION | Text | Dublin Core metadata value. |
DC_FORMAT | Text | Dublin Core metadata value. |
DC_IDENTIFIER | Text | Dublin Core metadata value. |
DC_IDENTIFIER_ISBN | Text | Dublin Core metadata value. |
DC_IDENTIFIER_ISSN | Text | Dublin Core metadata value. |
DC_LANGUAGE | Text | Dublin Core metadata value. |
DC_PUBLISHER | Text | Dublin Core metadata value. |
DC_RELATION | Text | Dublin Core metadata value. |
DC_SOURCE | Text | Dublin Core metadata value. |
DC_SUBJECT | Text | Dublin Core metadata value. |
DC_TITLE | Text | Dublin Core metadata value. |
DC_TYPE | Text | Dublin Core metadata value. |
HARVEST_STATUS¶
The HARVEST_STATUS table records information about a specific Heritrix Harvest.
Name | Type | Description |
---|---|---|
HS_OID | Primary key | |
HS_AVG_KB | Double | Average Kilobytes per second downloaded. |
HS_AVG_URI | Double | Average number of URLs per second downloaded. |
HS_DATA_AMOUNT | Number | Total data downloaded. |
HS_ELAPSED_TIME | Number | Elapsed time of the harvest. |
HS_JOB_NAME | Text | The identifier of the harvest job. |
HS_STATUS | Text | The status of the harvest. |
HS_URLS_DOWN | Number | The number of URLs downloaded. |
HS_URLS_FAILED | Number | The number of URLs that filed to download. |
HS_ALERTS | Number | The umber of alerts reported by the harvester during the crawl. |
HS_HRTX_VERSION | Text | Version of Heretrix used during harvest. |
HS_APP_VERSION | Text | Version of WCT used during harvest. |
NOTIFICATION¶
The NOTIFICATION table records information about notifications sent to users.
Name | Type | Description |
---|---|---|
NOT_OID | Primary key | |
NOT_MESSAGE | Text | The message text. |
NOT_USR_OID | Foreign key | The foreign key of the user who will receive the notification. |
NOT_SENDER | Text | The email address of the sender of the notification. |
NOT_SENT_DATE | Timestamp | The date the notification was sent. |
NOT_SUBJECT | Text | The subject line of the notification. |
ID_GENERATOR¶
The ID_GENERATOR table is used to generate globally unique identifiers for objects in the database/. See Generating primary keys below for details.
Name | Type | Description |
---|---|---|
IG_TYPE | Text | The object type (or types) that this range of Identifier numbers applies to. |
IG_VALUE | Number | The range of identifier numbers. |
FLAG¶
The FLAG table defines arbitrary flag groups that are used to progress target instances through the WCT workflow. Each flag is allocated a description and colour.
Name | Type | Description |
---|---|---|
F_OID | Primary key | Unique identifier for the flag. |
F_NAME | Text | Name for the flag group. |
F_RGB | Text | Colour for the flag. |
F_COMPLEMENT_RGB | Text | Complement colour for the flag (used to set a contrasting colour for the flag name). |
F_AGC_OID | Foreign key | The foreign key of the agency that owns the flag. |
INDICATOR_CRITERIA¶
The INDICATOR_CRITERIA table defines a template for the QA indicators. The template is used to initialise the indicators for a specific target instance (see the INDICATOR table).
Name | Type | Description |
---|---|---|
IC_OID | Primary key | Unique identifier for the indicator criteria. |
IC_NAME | Text | Name for the indicator. |
IC_DESCRIPTION | Text | Description for the indicator. |
IC_UPPER_LIMIT_PERCENTAGE | Number | Upper limit used to define the upper watermark for the indicator as a percentage (eg: +10%). |
IC_LOWER_LIMIT_PERCENTAGE | Number | Lower limit used to define the lower watermark for the indicator as a percentage (eg: -10%). |
IC_UPPER_LIMIT | Number | Absolute value for the indicator’s upper limit. |
IC_LOWER_LIMIT | Number | Absolute value for the indicator’s lower limit. |
IC_AGC_OID | Foreign key | The foreign key of the agency that owns the indicator criteria. |
IC_UNIT | Text | Unit of measurement for the indicator’s value used to format the value for display (byte, millisecond or integer). |
IC_SHOW_DELTA | Boolean | Displays the indictor delta compared with the reference crawl if true. |
IC_ENABLE_REPORT | Boolean | Hyperlinks the indicator when set to true and generates a report based on the contents of the INDICATOR_REPORT_LINE table. |
INDICATOR¶
The INDICATOR table defines the QA indicators for a specific target instance.
Name | Type | Description |
---|---|---|
I_OID | Primary key | Unique identifier for the indicator. |
I_IC_OID | Foreign key | The foreign key of the indicator criteria on which this indicator is based. |
I_TI_OID | Foreign key | The foreign key of the target instance that owns this indicator. |
I_NAME | Text | Name for the indicator. |
I_FLOAT_VALUE | Number | Value of the indicator. |
I_UPPER_LIMIT_PERCENTAGE | Number | Upper limit used to define the upper watermark for the indicator as a percentage (eg: +10%). |
I_LOWER_LIMIT_PERCENTAGE | Number | Lower limit used to define the lower watermark for the indicator as a percentage (eg: -10%). |
I_UPPER_LIMIT | Number | Absolute value for the indicator’s upper limit. |
I_LOWER_LIMIT | Number | Absolute value for the indicator’s lower limit. |
I_ADVICE | Text | The advice issued for this indicator (eg: Reject). |
I_JUSTIFICATION | Text | The justification for the advice reached for this indicator (eg: The content downloaded is 0KB). |
I_AGC_OID | Foreign key | The foreign key of the agency that owns the indicator criteria. |
I_UNIT | Text | Unit of measurement for the indicator’s value used to format the value for display (byte, millisecond or integer). |
I_SHOW_DELTA | Boolean | Displays the indicator delta compared with the reference crawl if true. |
I_INDEX | Number | Display order for the indicator. |
I_DATE | Date | Date on which the indicator was generated. |
INDICATOR_REPORT_LINE¶
The INDICATOR_REPORT_LINE table is used to compile a report of the subject of the indicator (eg: for the Missing URLs indicator, each record in the INDICATOR_REPORT_LINE table represents a missing URL for that indicator).
Name | Type | Description |
---|---|---|
IRL_OID | Foreign key | The foreign key of the indicator that owns this report line. |
IRL_LINE | Text | The indicator report line (eg: the missing URL for the Missing URLs indicator). |
IRL_INDEX | Number | Display order for the indicator report line. |
HEATMAP_CONFIG¶
HEATMAP_CONFIG contains the names and colors of the thresholds for the scheduling heat-map introduced in WCT version 1.6.1.
Name | Type | Description |
---|---|---|
HM_OID | Primary key | |
HM_NAME | Text | Name of the threshold, used as an identifier. |
HM_DISPLAY_NAME | Text | Display name of the threshold. |
HM_COLOR | Text | RGB color of the threshold. |
HM_THRESHOLD_LOWEST | Number | The lowest number of scheduled harvests on a given day to allow this indicator to be used. |
Generating primary keys¶
The WCT stores all primary keys as numbers.
Tables involved¶
The ID_GENERATOR table is used to track the reservation of ID values in a number of different key sets.
The ABSTRACT_TARGET, TARGET, GROUP, SEED and other important tables share a set of keys that are controlled by the ID_GENERATOR.IG_TYPE value of General, ensuring that their object IDs will never clash. Other objects have their own ID_GENERATOR to ensure that the ID numbers do not grow too quickly.
Reserving sequence numbers¶
If you want to insert new rows into WCT fields, you need to reserve a sequence number. To get a sequence number you need to:
Ensure that WCT is shutdown.
List the sequences available by running:
SELECT * FROM id_generator;
Select the sequence for the objects you want to create. If there is not a specific sequence, choose the General sequence.
Run the following, substituting the sequence name as appropriate, and note the values returned:
SELECT ig_value, ig_value * 32768 AS MIN_RES_VAL, ig_value * 32768 + 32767 AS MAX_RES_VAL FROM id_generator WHERE ig_type LIKE '%General%';
Now update the table to reserve your sequence numbers, using the same ID Generator Key as above, and the IG_VALUE returned by the above select statement:
UPDATE id_generator SET ig_value = ig_value+1 WHERE ig_type LIKE '%General%' AND ig_value = :IG_VALUE;
If the update statements reports one record updated, then you have successfully reserved the range between MIN_RES_VAL and MAX_RES_VAL. If the update reports no records updated, then you must repeat the process from step three as someone else may have reserved the numbers you were after.
Once you have all the numbers you need you can restart WCT.
Notes¶
Note that different object types may use different runs of numbers; for example ANNOTATION objects have IG_TYPE Annotation. Also note that the IG_TYPE field contents include some strange whitespace (hence the use of like in the SQL code above).
Every time a sequence is reserved, all 32,676 values are reserved, regardless of whether they get used or not.