User Manual¶

Introduction¶

About the Web Curator Tool¶

The Web Curator Tool is a tool for managing the selective web harvesting process. It is typically used at national libraries and other collecting institutions to preserve online documentary heritage.

Unlike previous tools, it is enterprise-class software, and is designed for non-technical users like librarians. The software was developed jointly by the National Library of New Zealand and the British Library, and has been released as free software for the benefit of the international collecting community.

About this document¶

This document is the Web Curator Tool User Manual. It describes how to use the Web Curator Tool through its web browser interface. It assumes your system administrator has already set up the Web Curator Tool.

The manual is divided into chapters, each of which deals with a different aspect of the tool. The chapters generally correspond to the major Web Curator Tool modules.

System administrators will find an Administrators Guide and other technical documentation on the Web Curator Tool website (https://www.webcuratortool.org/).

Where to find more information¶

The primary source for information on the Web Curator Tool is the website:

https://www.webcuratortool.org/

The Github project pageThe includes links to download the tool, its corner that leads to the Github project page. Here you can navigate to the Web Curator Tool Wiki which is also hosted on Github.

Each page in the Web Curator Tool has a Help link in the top right corner that leads to the Github project page. Here you can navigate to the Web Curator Tool Wiki which is also hosted on Github.

System Overview¶

Background¶

More and more of our documentary heritage is only available online, but the impermanence and dynamic nature of this content poses significant challenges to any collecting institutions attempting to acquire it.

To solve these problems, the National Library of New Zealand and The British Library initiated a project to design and build a selective web harvesting tool, which has now been released to the collecting community as the Web Curator Tool.

Purpose and scope¶

The tool is designed to manage the selective web archiving process. It supports a harvesting workflow comprising a series of specialised tasks with the two main business processes supported being acquisition and description.

The Web Curator Tool supports:

Harvest Authorisation: obtaining permission to harvest web material and make it publicly accessible;
Selection, scoping and scheduling: deciding what to harvest, how, and when;
Description: adding basic Dublin Core metadata;
Harvesting: downloading the selected material from the internet;
Quality Review: ensuring the harvested material is of sufficient quality for archival purposes; and
Archiving: submitting the harvest results to a digital archive.

The scope of the tool is carefully defined to focus on web harvesting. It deliberately does not attempt to fulfil other enterprise functions:

it is not a digital repository or archive (an external repository or archive is required for storage and preservation)
it is not an access tool
it is not a cataloguing system (though it does provide some support for simple Dublin Core metadata)
it is not a document or records management system

Other, specialised tools can perform these functions more effectively and the Web Curator Tool has been designed to interoperate with such systems.

Essential terminology¶

Important terms used with the Web Curator Tool include:

Web Curator Tool or WCT - a tool for managing the selective web harvesting process.
Target - a portion of the web you want to harvest, such as a website or a set of web pages. Target information includes crawler configuration details and a schedule of harvest dates.
Target Instance - a single harvest of a Target that is scheduled to occur (or which has already occurred) at a specific date and time.
harvest or crawl - the process of exploring the internet and retrieving specific web pages.
harvest result - the files that are retrieved during a harvest.
seed or seed url - a starting URL for a harvest, usually the root address of a website. Most harvests start with a seed and include all pages “below” that seed.
harvest authorisation - formal approval for you to harvest web material. You normally need permission to harvest the website, and also to store it and make it accessible.
permission record - a specific record of a harvest authorisation, including the authorising agencies, the dates during which permissions apply and any restrictions on harvesting or access.
authorising agency - a person or organisation who authorises a harvest; often a web site owner or copyright holder.
indicator - a quality assurance metric used to quantify the success of a harvest (e.g. the amount of content downloaded)
recommendation - the advice obtained by using one or more indicators to determine if a harvest successfully captured the content from a website
automated QA - the automated quality assurance process that runs after a harvest completes that provides a recommendation
flag - an arbitrary group created and assigned to one or more target instances
reference crawl - a target instance that has been archived and marked as a baseline to which all future harvests will be compared for a specific target
harvest optimisation - enables a harvest to run at the optimum time when there is available space in the schedule. The default is to look forward 12 hours (configurable).
heat map - a calendar ‘pop up’ that indicates the spread of scheduled harvests over a period of time.

Impact of the tool¶

The Web Curator Tool is used at the National Libraries of New Zealand and the Netherlands, and has had these impacts since it was introduced into the existing selective web archiving programme:

Harvesting has become the responsibility of librarians and subject experts. These users control the software handling the technical details of web harvesting through their web browsers, and are much less reliant on technical support people.
Many harvest activities previously performed manually are now automated, such as scheduling harvests and generating preservation metadata.
The institution’s ability to harvest websites for archival purposes has been improved, and a more efficient and effective workflow is in place. The new workflow ensures material is safely managed from before it is harvested until the time it enters a digital archive.
The harvested material is captured in ARC/WARC format which has strong storage and archiving characteristics.
The system epitomises best practice through its use of auditing, permission management, and preservation metadata.

How Does it Work?¶

The Web Curator Tool has the following major components

The Control Centre

The Control Centre includes an access-controlled web interface where users control the tool.
It has a database of selected websites, with associated permission records and other settings, and maintains a harvest queue of scheduled harvests.

Harvest Agents

When the Control Centre determines that a harvest is ready to start, it delegates it to one of its associated harvest agents.
The harvest agent is responsible for crawling the website using the Heritrix web harvester, and downloading the required web content in accordance with the harvester settings.
Each installation can have more than one harvest agent, depending on the level of harvesting the organization undertakes.

Digital Asset Store

When a harvest agent completes a harvest, the results are stored on the digital asset store.
The Control Centre provides a set of quality review tools that allow users to assess the harvest results stored in the digital asset store.
Successful harvests can then be submitted to a digital archive for long-term preservation.

Home Page¶

The Web Curator Tool Home Page is pictured below.

Figure 1. Home Page

The left-hand side of the homepage gives access to the functionality used in the selection and harvest process:

In Tray - view tasks that require action and notifications that display information, specific to the user

Harvest Authorisations - create and manage harvest authorisation requests

Targets - create and manage Targets and their schedules

Target Instances - view the harvests scheduled in the future and review the harvests that are complete

Groups - create and manage collections of Targets, for collating meta-information or harvesting together

The right-hand side of the homepage gives access to administrative functions:

Permission Request Templates - create templates for permission request letters

Reports -generate reports on system activity

Harvest Configuration - view the harvester status, and configure harvest profiles (such as how many documents to download, whether to compress them, delays to accommodate the hosting server, etc.)

Users, Roles, Agencies, Rejection Reasons, Indicators & flags - create and manage users, agencies, roles, privileges, rejection reasons, QA indicators and flags

The functions that display on the Web Curator Tool Home Page depend on the user’s privileges.

Harvest Authorisations¶

Introduction¶

When you harvest a website, you are making a copy of a published document. This means you must consider copyright law when you harvest material, and also when you preserve it and when you make it accessible to users.

The Web Curator Tool has a sophisticated harvest authorisation module for recording your undertakings to copyright holders. Before you can harvest web pages, you must first confirm you are authorised to do so. The Web Curator Tool will record this information in its audit trail so that the person or agency that authorised a particular harvest can always be found. If you do not record who has authorised the harvest, the Web Curator Tool will defer the harvest until you confirm you are authorised.

In most cases, getting “harvest authorisation” means you must get permission from the website owner before you start the harvest. The Web Curator Tool lets you create harvest authorisation records that record what website or document you have requested permission for, who has authorised you to perform the crawl, whether you have been granted permission, and any special conditions.

Some institutions, such as national libraries, operate under special legislation and do not need to seek permission to harvest websites in their jurisdiction. The Web Curator Tool supports these organisations by allowing them to create a record that covers all such cases. See the section on Legislative and other sources of information below.

In other cases, your institution may decide to harvest a website before seeking permission, possibly because the target material is time-critical and it is in the public interest to capture it right away. In these cases, you must still record the entity who authorised the crawl, even if it is a person in your organisation, or even you yourself. This is also covered in the section on Legislative and other sources of information below.

Commercial search engines often harvest websites without seeking permission from the owners. Remember that these services do not attempt to preserve the websites, or to republish them, so have different legal obligations.

Terminology and status codes¶

Terminology¶

Important terms used with the Harvest Authorisation module include:

harvest authorisation - formal approval for you to harvest web material. You normally need the copyright holder’s permission to harvest the website, and also to store it and make it accessible.
authorising agency - a person or organisation who authorises a harvest; often a website owner or copyright holder.
permission record - a specific record of a harvest authorisation, including the authorising agencies, the dates during which permissions apply and any restrictions on harvesting or access.
url pattern - a way of describing a URL or a set of URLs that a permission record applies to. For example, http://www.example.com/* is a pattern representing all the URLs on the website at www.example.com.

Permission record status codes¶

Each permission record has one of these status codes:

pending - the permission record has been created, but permission has not yet been requested.
requested - a request for permission has been sent to the authorising agency, but no response has been received.
approved - the authorising agency has granted permission.
rejected - the authorising agency has refused permission.

URL Patterns¶

URL Patterns are used to describe a portion of the internet that a harvest authorisation applies to.

In the simplest case, a URL can be used as a URL Pattern. In more complex cases, you can use the wildcard * at the start of the domain or end of the resource to match the permission to multiple URLs.

For example:

http://www.alphabetsoup.com/* -include all resources within the Alphabet Soup site (a standard permission granted directly by a company)
http://www.alphabetsoup.com/resource/* -include only the pages within the ‘resource’ section of the Alphabet Soup site
http://*.alphabetsoup.com/* -include all resources on all sub sites of the specified domain.
http://www.govt.nz/* -include all pages on the domain www.govt.nz
http://*.govt.nz/* -include all NZ Government sites
http://*.nz/* -include all sites in the *.nz domain space (this can be used to supports a national permission based on government legislation)

How harvest authorisations work¶

Each harvest authorisation contains four major components:

A name and description for identifying the harvest authorisation, plus other general information such as an order number.
One or more authorising agencies, being the person or organisation who authorises the harvest. This is often a website owner or copyright holder. Some authorising agencies may be associated with more than one harvest authorisation.
A set of url patterns that describe the portion of the internet that the harvest authorisation applies to.
One or more permission records that record a specific permission requested from an authorising agency, including
- a set of URL patterns,
- the state of the request (pending, requested, approved, rejected),
- the time period the request applies to, and
- any special conditions or access restrictions (such as ‘only users in the Library can view the content’).

In most cases, only users with specific roles will be allowed to manage harvest authorisations. Unlike some other Web Curator Tool objects, harvest authorisations do not have an “owner” who is responsible for them.

Sample harvest authorisation¶

For example, to harvest web pages from ‘The Alphabet Soup Company’, you might create a harvest authorisation record called ‘Alphabet Soup’. This would include:

general information recording the company name and the library order number for this request:
- Name: ‘Alphabet Soup’
- Order Number: “AUTH 2007/03”
url patterns to identify the company’s three websites:
authorising agencies for the two organisations responsible for the content on these sites:
- The Alphabet Soup Company
- Food Incorporated.
permission records, linking each authorising agency with one or more URL patterns:
- The Alphabet Soup Company to approve restriction-free access, on an open-ended basis, to http://www.alphabetsoup.com/* and http://www2.alphabetsoup.com/*
- Food Incorporated to approve NZ-only access, for the period 1/1/2006 through 31/12/2006, to http://www.alphabetsoup.com/* and http://www2.alphabetsoup.com/*.

Harvest authorisation search page¶

The harvest authorisation search page lets you find and manage harvest authorisations.

Figure 2. Harvest Authorisations

At the top of the page are:

Fields to enter search criteria for existing harvest authorisation records (Identifier, Name, Authorising Agent, Order Number, Agency, URL Pattern, Permissions File Reference and Permissions Status), and a search button for launching a search.
There is also a drop down list that allows the user to define a sort order for the returned results (name ascending, name descending, most recent record displayed first, oldest record displayed first)
A button to create new harvest authorisation requests.

Below that are search results. For each harvest authorisation record found, you can:

- View details

- Edit details

- Copy the harvest authorisation and make a new one.

- Generate a permission request letter.

The first time you visit this page, all the active harvest authorisations for the user’s Agency are shown. You can then change the search parameters. On subsequent visits, the display is the same as the last harvest authorisation search.

All search pages that present the search results in a ‘page at a time’ fashion have been modified so that the user can elect to change the default page size from 10 to 20, or 50 or even 100! The user’s preference will be remembered across sessions in a cookie.

How to create a harvest authorisation¶

From the Harvest Authorisations search page:

Click create new.

The Create/Edit Harvest Authorisations page displays:

Figure 3. Create/Edit Harvest Authorisations

The page includes four tabs for adding or editing information on a harvest authorisation record:

General - general information about the request, such as a name, description and any notes
URLs - patterns of URLs for which you are seeking authorisation
Authorising Agencies - the persons and/or organisations from whom you are requesting authorisation
Permissions - details of the authorisation, such as dates and status.

Enter general information about the request¶

On the General tab, enter basic information about the authorisation request.

Required fields are marked with a red star. When the form is submitted, the system will validate your entries and let you know if you leave out any required information.

To add a note (annotation) to the record, type it in the Annotation text field and click add.

Enter URLs you want to harvest¶

Click the URL Patterns tab.

The URL Patterns tab includes a box for adding URL patterns and a list of added patterns.

Figure 4. URL Patterns tab

Enter a pattern for the URLs you are seeking permission to harvest, and click add. Repeat for additional patterns.

Enter agencies who grant permission¶

Click the Authorising Agencies tab.

The Authorising Agencies tab includes a list of authorising agencies and buttons to search for or create new agencies.

Figure 5. Authorising Agencies tab

To add a new agency, click create new.

The Create/Edit Agency page displays.

Figure 6. Create/Edit Agency

Enter the name, description, and contact information for the agency; and click Save.

The Authorising Agencies tab shows the added agency.

Create permissions record¶

Click the Permissions tab.

The Permissions tab includes a list of permissions requested showing the status, agent, dates, and URL pattern for each.

Figure 7. Permissions tab

The date requested column shows the date that a permission request (email or printed template) was generated.
To add a new permission, click create new.

The Create/Edit Permission page displays.

Figure 8. Create/Edit Permission

Select an agent, enter the dates you want to harvest, tick the URL patterns you want to harvest, enter special restrictions, etc.;

and click Save.

The Permissions tab redisplays, showing the added permission.

Click Save to save the harvest authorisation request.

The harvest authorisation search page will be displayed.

After adding or editing a harvest authorisation record, you must save before clicking another main function tab (eg, Targets or Groups), or your changes will be lost.

How to send and/or print a permission request email¶

From the harvest authorisation search page, click next to the harvest authorisation request.
In the next screen choose the template from the dropdown list against the appropriate URL and click

The system generates and displays the letter or Email template (depending on the template chosen)

Figure 9. Email Permission Request Letter

Click to print or e-mail the letter to the agent.

(print-only templates will only allow you to print)

The system sends the letter and changes the permission status to ‘requested’.

Click Done.

The Harvest Authorisations search page redisplays.

How to view or update the status of a permission record¶

Once permission has been granted (or declined)¶

When you hear back from the authorising agent that you are authorised to harvest the website, follow steps 1 through 5 below to change the Status of the permission record to ‘approved’ (if permission is granted) or ‘rejected’ (if permission is declined).

The authorising agent may also specify special conditions, which should be recorded in the permission record at this point.

From the harvest authorisation search page, click next to the harvest authorisation request that includes the permission for which you sent the request letter.

The General tab of the Create/Edit Harvest Authorisations page displays.

Click the Permissions tab.

The Permissions tab displays.

Click (View) or (Edit) next to the permission for which you sent the request letter.

The Create/Edit Permission page displays.

If editing, you can change the Status of the permission to ‘approved’ or ‘rejected’ as necessary, and click Save.
Click Save to close the Harvest Authorisation.

How to edit or view a harvest authorisation¶

Editing an existing authorisation is very similar to the process for creating a new record.

To start editing, go to the harvest authorisation search page, find the harvest authorisation you wish to edit, and click the

- Edit details

icon from the Actions column. This will load the harvest authorisation into the editor. Note that some users will not have access to edit some (or any) harvest authorisations.

An alternative to editing a harvest authorisation is to click the

- View details

icon to open the harvest authorisation viewer. Data cannot be changed from within the viewer. Once in the harvest authorisation viewer you may also switch to the editor using the ‘Edit’ button

Legislative and other sources of authorisation¶

Some national libraries and other collecting institutions have a legislative mandate to harvest web material within their national jurisdiction, and do not need to request permission from individual copyright holders. In other cases, the library might rely on some other source of authority to harvest material, or may choose to harvest before permission is sought then seek permission retroactively.

The Web Curator Tool requires that every Seed URL be linked to a permission record. When a library is specifically authorised to perform harvests by legislation, this can seem like a source of inefficiency, as no “permission” is really required.

However, the Web Curator Tool still requires a harvest record, so that the ultimate source of harvest authority is always documented and auditable.

When the tool is configured correctly, there should be no overhead in most cases, and very little overhead in other cases.

This is possible through two mechanisms. First, the use of broad URL Patterns allows us to create a permission record that is almost always automatically assigned to Seed URLs without requiring any user action. Second, the “Quick Pick” option in permission records makes the permission record an option in the menu used to associate seeds with permission records.

In practical terms, this means institutions can set up a single harvest authorisation that applies to all their harvesting of their national internet. It should be set up as follows:

general information should give the harvest authorisation a name that refers to the authorising legislation. For example:
- Name: “NZ e-legal deposit”
- Description: “All websites in the New Zealand domain acquired under legal deposit legislation”
url patterns should identify as much of the national website as possible. For example:
- http://*.nz/*
an authorising agency should describe the government that provided the mandate to harvest. For example:
- Name: “New Zealand Government”
- Contact: “National Librarian”
- Address: “National Library of New Zealand, Wellington”
a permission record should link the authorising agency with the URL patterns, as for other permission records. Some points to note:
- Dates: these fields should specify the date the legislation took (or takes) effect, and are typically open-ended.
- Status: Approved.
- Special restrictions / Access status: if your legislation places any restrictions on how the material may be harvested or access, record them here.
- Quick Pick: Selected.
- Display Name: The name used in the “Quick Pick” menu, such as “legal deposit legislation”. The quick pick will show up in the seed tab of the Target record. See the Targets section for more information.