Implementation guidelines for NDE alignment of cultural heritage data management and publication infrastructure

Living Standard,

This version:
https://netwerk-digitaal-erfgoed.github.io/cm-implementation-guidelines/
Editors:
(Dutch Digital Heritage Network)
(Dutch Digital Heritage Network)
(Dutch Digital Heritage Network)

Abstract

This document describes how IT suppliers can help the Dutch cultural heritage institutions to become more compliant with the principles for cooperation in the Dutch Digital Heritage Network (NDE).

1. Introduction

This document describes how IT suppliers can help the Dutch cultural heritage institutions to become more compliant with the principles for cooperation in the Dutch Digital Heritage Network (NDE). It is and will be constructed in full cooperation with suppliers. We ask IT suppliers of data management and publication software to implement the guidelines outlined in this document in their products. This will help the Dutch heritage institutions in making their information accessible to machines and humans on the web.

The goal of cooperation in the network is to increase the public value of the cultural heritage information by improving its visibility, usability and sustainability. A crucial part of this challenge is to improve the capabilities of institutions to share their information in the network and to integrate digital heritage information from various sources to build services for their audiences.

This document aims to initiate an open discussion and cooperation with the technical partners in the network. The final goal is to document practical solutions that are within reach of the various IT systems used by the institutions and developed by the IT suppliers actively cooperating with NDE.

1.1. Relation with other documentation

This document builds upon the Dutch digital heritage reference architecture (DERA) (Dutch) and the National Digital Heritage Strategy (Dutch). Where appropriate, formal specifications of solutions developed in this document will be proposed to the Architecture Board of the DERA to become part of the long term agreements within the cultural heritage domain.

1.2. Overview

The primary focus of the guidelines in this document is to improve the visibility and interoperability of the heritage collections. This document, therefore, follows the general practice for publishing Linked Data. As stated in the reference architecture (in Dutch) Linked Data is regarded as a key technology in the cultural heritage domain for integrating the huge amount of information resources within the network.

The information resources should be self-contained concerning persistent identifiers, (see this section) and metadata (see this section). This metadata should contain as many terms as possible (see this section). Information resources can be combined and published in datasets. Each dataset should be described as a separate resource with a unique identifier and additional metadata describing the characteristics of the dataset (see this section).

1.3. Linked Open Data Principles

Publishing information resources as Linked Data can be done by transforming the internal data of the collection management system to a Linked Data format as described by the Resource Description Framework (RDF). These guidelines describe how to create interoperability with Linked Data and do not prescribe which techniques must be used to store the data, let alone technology within the systems.

Describing information resources by the Resource Description Framework means three things, as is introduced by Tim Berners-Lee in this TED talk.

1.3.1. All conceptual things should have a name starting with HTTP.

This means that every entity you want to publish an information resource for must have a web address, more precisely a so-called Uniform Resource Identifier. Information resources could for example be the metadata about a painting, a creator, a depicted location or a data set.

1.3.2. Looking up an HTTP name should return useful data about the thing in question in a standard format.

If you call the web address, the webserver must serve the metadata in RDF format. In the RDF format, the metadata attributes are expressed. These attributes are for instance the title of a painting or the date of creation.

1.3.3. Anything else that that same thing has a relationship with through its data, should also be given a name beginning with HTTP.

In the best metadata, the painting is related with, for instance, its creator using the URI of the creator (eg. https://data.rkd.nl/artists/66219) instead of the label ("Rijn, Rembrandt van"). This means that you can follow the link and use the web address of the creator to obtain the data about the creator. This way you can, for instance, know its living years ("1606-1669"). Linking data like this results in a "Web of Data".

1.3.4. More information

This document is not intended as a complete introduction to Linked Data. You can learn the basics of Linked Data

2. Adding persistent identifiers

The first Linked Data principle, defined by Berners-Lee, is that every information resource needs a web address. This address identifies the resource that you want to share on the Web of Data. Because this identifying role such a web address is called a Uniform Resource Identifier.

2.1. Persistent Identifiers

Users of the data need to be able to trust that the identifier will not change. A Persistent Identifier (PID) is a unique identification code attached to a digital object and registered at an agreed location. Using a PID, you can ensure that a link to a digital object keeps working:

There are multiple PID systems, such as Archival Resource Keys (ARKs), Digital Object Identifiers (DOIs), Handle, OpenURL, Persistent Uniform Resource Locators (PURLs) and Uniform Resource Names (URNs). Each system has its own particular properties, strengths and weaknesses. The PID Guide helps heritage organisations learn and think about important PID subjects, and guides them into selecting a PID system.

IT suppliers should be able to offer integration with PID systems so PIDs can be created and should also be able to handle PIDs in the import/export process to maintain data durability.

2.2. Uniform Resource Identifiers Strategies

As an institution, you must decide on how your persistent URIs look like. Here are some examples from other (heritage) institutions.

3. Publishing collection information

3.1. Data model

After introducing URIs for the information resources you want to publish, you have to decide which set of properties you use for expressing the data. On the one hand, there is a more generic data model available, easy to use and combine. On the other hand, a more precise but complicated domain-specific model can be used. If triples are published in both the generic and the domain model they can all be included in the same information resource.

3.1.1. Generic data model

The main goal for publishing the object descriptions as Linked Data is to improve the data integration and visibility in the network. Because the heritage network spans institutions from the library, archive and museum domains, a generic data model that can support general visibility on the web is needed. To support broader visibility on the web outside the cultural heritage domain, data should be published with Schema.org. Schema.org is a joint initiative of three major search engines, Bing, Google and Yahoo, to set up a shared scheme to structure data. Vocabularies from Schema.org are developed through an open community process.

Thanks to Europeana, most institutions and collection management systems are familiar with Dublin Core (DC). Based on DC, a generic data model for cultural heritage institutions has been defined, the Europeana Data Model (EDM). Recent investigations have shown that well-formed Schema.org data can be transformed to EDM without significant loss of detail. Schema.org contains properties that are very similar to Dublin Core properties. A complete mapping between Schema.org and Dublin Core can be found here.

Interesting links:

Some basic examples for:

3.1.2. Domain data model

The generic model described above will improve the general visibility of the cultural heritage objects, the terms describing the objects and the datasets in which the objects and terms are grouped. In many cases, this will lead to further exploration of these objects. For specific users or communities, a deeper understanding of the knowledge in the data will be important. Because of the open structure of Linked Data, there is no restriction to simply add another layer of description to the resource. See for example this object description. So institutions can add extra properties and class definitions to the Linked Data they publish. It is up to the institutions in the separate domains to agree on the use and implementation of these shared standards. Best practices and information about actual implementations will be shared in the network.

In general CIDOC-CRM (and its derivative Linked Art) can be appropriate for museums. For archives, the RiC-O ontology seems to be promising and for libraries, RDA Elements or BIBFRAME could be relevant. The Royal Library of The Netherlands uses Schema.org because this standard contains elaborate classes and properties for publications. How they use Schema.org as a domain model, can be studied here.

See Awesome Humanities Ontologies for more options.

3.1.3. Rights

The DERA requires institutions to publish their metadata with an open license (for some institutions this requirement might still be a challenge). The actual access to the digital object, such as an image or audio file, can be restricted; additional license statements for use and reuse of the objects must be specified in the metadata. You can use the schema:license property to relate a license to an information resource. Remember that you have to relate a license to the dataset, the object metadata and the reproduction. The licenses might differ.

3.2. RDF serialization

RDF - the framework to express data in triples - can be formatted in various serializations. Different serializations are useful in different situations.

The oldest type of serialization is RDF/XML. This format is complex and old-fashioned, but it could be very useful in XML ecosystems. The best readable serialization is Turtle. Examples in this document are written in this form. Linked data provided in 'N-Triples' format can be processed easily in all linked data tooling. Please offer your data in at least N-Triples format. See for example this photo example that we introduced in Turtle format earlier.

JSON-LD is the youngest format and comprehensible to web developers that are used to JSON. The format is more difficult to process by Linked Data applications. We recommend only using JSON-LD inline in the landing page of an information resource. (see this section)

In almost all programming languages libraries are available to transform one RDF serialization into another. Check for libraries in your favourite programming language. If you create one type of RDF (eg. RDF/XML) in your infrastructure, be sure to validate the syntax of your output by trying to transform it into another serialization. You could provide more than one serialization and offer the user a choice with 'content negotiation'. (see this section).

Interesting links:

3.3. Publishing Linked Data

Linked Data can be published in several ways. In this section, we will discuss four methods ranging in levels of complexity of implementation. User applications could prefer one of these methods to obtain the data for their purpose, so the publication methods should be offered in parallel.

3.3.1. Basic level: Data dumps

Although not a proper Linked Data implementation, publishing a file in RDF format supports basic data services. This section describes the requirements for such a data dump.

Most of the current collection management systems support XML-based export formats. In many cases, a conversion using XSLT Stylesheets is possible to generate a Linked Data dump. The system should support a workflow that can be operated by the system administrator to produce the dump. Remember that a new dump must be created if data is added or corrected. The dump should be described by a dataset description. The link pointing to the dump should be fit for automatic processing and therefore be accessible by HTTP clients, like curl.

A dump could be either a zip file containing files for every information resource or one big file containing all the triples. NDE prefers a data dump in one file containing N-Triples because files with N-Triples from various sources can be easily combined and easily cut into pieces for processing.

In some cases, conversion cannot be done by the collection management system itself and a separate publication functionality is required. This can be another system within the institution or a system run by a partner in the network (an aggregator or other service provider). Export in CSV format could for instance be converted into RDF with the LDWizard.

3.3.2. Web compliance level: resolvable URIs

Although dumps are sufficient to support basic services, proper Linked Data implementations should provide Linked Data per object, in Linked Data terminology called a ‘resource’. Each resource in the collection should have a stable URI. This URI should be ‘resolvable’. This means that the web server provides data about the resource in a format depending on the Accept header in the HTTP request. This functionality is called content negotiation which enables browsers or other applications to ask for a specific language or representation of the available content. For Linked Data this mechanism is used to differentiate between human-readable content (in HTML) and machine-readable content (in RDF).

To improve the overall discoverability on the web it is recommended to add a representation of the data of the information resource with concepts and properties in Schema.org to the landing page of the object on the website. Google recommends doing this by including the Linked Data as JSON-LD in the page. The JSON-LD could be provided separately from the webpage and included with a JavaScript function. See this example.

3.3.3. Advanced level: queryable Linked Data

Instead of building large centralized services, NDE envisions a distributed network of empowered organizations that share and reuse their digital information directly. This means that organizations, data providers or service providers, can easily query and combine data from several sources in the network. The first step in this direction is to make queryable Linked Data available. This can be done by implementing a SPARQL endpoint. Another, less demanding technology is the use of Triple Pattern Fragments. This offers an easy way of querying data and is much more scalable but puts a larger burden on the client for processing the results.

Creating queryable Linked Data satisfies two needs.

Firstly, it enables users to create a dedicated query for selecting and harvesting specific data. This data could be combined in a thematic data service like Netwerk Oorlogsbronnen or Van Gogh Worldwide. A query-facility like a SPARQL endpoint could be used as a more sophisticated endpoint, in addition to an OAI-PMH endpoint or a custom API.

Secondly, this kind of querying could be used dynamically: whenever a user wants to combine data, various endpoints could be queried and results could be combined on the fly. This way of combining data, although a technological challenge, is already feasible in small-scale applications.

Further exploration in the area of decentralized querying has to take place and NDE invites the IT suppliers to work together on example implementations to determine what practical steps can be made.

3.3.4. Other considerations for Linked Data publication

Linked Data in the core of the collection management systems is not an explicit requirement for NDE. In the future, it will become more important to reorganize the internal data structure of the systems so that it aligns with the Linked Data principles. This will make the transformation to Linked Data more straightforward and will offer more opportunities to profit from the reuse of external Linked Data. NDE welcomes cooperation with IT suppliers considering moving in this direction.

The second area that needs further exploration is the handling of multiple layers of object descriptions, for example originating from different sources of other institutions, digital humanities research projects or crowdsourcing projects. Keeping track of the provenance of different versions or the ability to maintain and publish different (time) versions of the same resource descriptions (e.g. with Memento, Web Annotations) are other areas where exploration needs to be done. For these topics, NDE seeks cooperation with IT suppliers and with the CLARIAH-PLUS project, responsible for building research infrastructures in the Digital Humanities domain.

A third development is the concept of 'profile negotiation'. In addition to content negotiation, profile negotiation enables a user to request data according to a certain data model. In our case, the user could be able to choose between the generic data model or the domain data model.

4. Connecting sources

"Connected sources" are sources that refer to each other and use each other’s information. This is done using terms. Terms are descriptions of concepts or entities, such as subjects, people or places. Each term has a URI as an identifier. Because a URI is a unique web address it makes unambiguously clear which term is meant. Terms and their URIs are managed in terminology sources, such as thesauri, reference lists or classification systems. Examples are AAT, GeoNames or WO2 Thesaurus (in Dutch).

For connecting sources, institutions must assign URIs of terms to the descriptions of their heritage objects. There are various options for this which are discussed below. The options are increasing in complexity – for example, they require changes to the collection management system – but also provide better functionality.

4.1. Using your own terminology source

An institution can create and use its own terminology source. Most of the time a collection management system facilitates the development and maintenance of a thesaurus or set of persons. These sources contain the terms that the collection manager can use when describing heritage objects. This makes it easy to relate terms. A limitation, however, is that the maintenance of one’s own source could be cumbersome. Another limitation is that the terms in the source may not be known to users outside the institution. This reduces the ability to connect sources from different institutions.

In the terminology management module of a collection management system, the user should be able to relate the internal term to an external one. One could for instance state that the internal term for 'schilderij' is an exact match with the external term 'paintings' in the AAT. Other types of relations are possible, for instance, an internal term 'oil-painting' is a narrower term of the AAT 'paintings' (see: this section). The process of relating terms from one (internal) terminology source with another (external) one, is called alignment. Remember that this external term is defined with a complete URI and stored as such instead of an identifier that needs to be augmented with a baseURI.

If a cultural heritage institute develops its own terminology source and uses this internal one instead of an external one, the terminology source apparently has an added value to the field. For that reason, the collection management system must be able to publish the terminology source, with alignments, as Linked Data (see: this section).

4.2. Using external terminology sources

Using terms that are not connected to the network of internationally standardized terminology sources reduces the ability to connect sources from different institutions. We recommend using as many standardized terminology sources as possible. These terms can be linked manually or imported in bulk. If you import a terminology source into the collection management system, a data curator can use a selection function to link to this terminology source. An imported terminology source must be updated frequently, so a more sustainable solution is to connect a terminology source with a real-time connection on the web. The NDE facility Network of Terms can help you create one single connection for a larger collection of terminology sources.

4.2.1. Manually searching a terminology source

This is the simplest, low-tech approach to assign the URI of a term to the description of a heritage object. A collection manager goes to the online system of a terminology source and searches for a term, for example, 'paintings' in AAT. The manager then copies the URI of the term, goes back to her collection management system and pastes the URI in a designated field. This approach requires few changes to the collection management system: a field has to be created to store the URI. On the other hand, this approach is cumbersome: a manager has to perform some actions manually.

4.2.2. Adding terms in bulk

Instead of adding a (URI of a) term to an object description one by one, the curator could export the data from her system and add the terms and their URIs in bulk. This could be done with applications like OpenRefine. For this approach, the collection management system must be able to export the data, for instance in CSV format, and import the new data with its additional link to the object.

An existing field containing only a string of the data, for instance 'schilderij', can be used to derive a proper URI for a term. By using search-and-replace the link could be added in an extra field. This type of work is called reconciliation.

To reconcile against a terminology source, it needs to offer a reconciliation service. For sources that do not, you can use the Network of Terms reconciliation service.

4.2.3. Importing a terminology source

An institution can import one or more terminology sources into its collection management system. This makes it easy to use terms: the collection manager can search for terms in her management system, without having to switch to the system of the source, as is the case when searching terms manually (see this section). A limitation is that the manager does not use current data: an import takes place periodically – for example: once a day or month – so that changes to the terminology source are not immediately usable.

4.2.4. Searching a terminology source in real-time

An institution can query a terminology source in real-time in its collection management system. A collection manager first enters a search query in a search field. Next, her management system – in the background – queries the terminology source, retrieves the terms that correspond to the query and presents them to the manager. The manager then chooses the desired term and the management system stores its URI. This ensures that the institution works with current data. However, the implementation of this approach in the management system can entail some difficulties: a connection must be established with the source – or multiple connections if the institution wants to use multiple sources.

4.2.5. Using the Network of Terms

An institution can use the Network of Terms. The Network of Terms is a search engine for terms: it searches one or more terminology sources in real-time. The Network of Terms can be used in a collection management system. A collection manager first enters a search query in a search field. Next, her management system – in the background – queries the Network of Terms, retrieves the terms that correspond to the query and presents them to the manager. The manager then chooses the desired term and the management system stores its URI. This makes it easier to use multiple terminology sources: you only need to connect to the Network of Terms.

4.2.6. Indexing data from the terminology source

If (the URI of) a term is stored, collection managers might be interested to store related data, for instance, alternative labels, pseudonyms, or labels in different languages, in the collection management system as well. This helps the user of the collection management system to retrieve data from her system. As long as the web of data is not as decentralized as we’d wish (see this section), collection management systems could offer to obtain the needed data via resolving the URI (see this section) and store and index the extra data in the database. The related data must be synchronized periodically and the collection management user interface must have a function to force the synchronization.

4.3. Publishing terminology sources

"Published terminology sources" are sources that are important to the cooperating institutions in the Dutch Digital Heritage Network. It enables collection managers to find and use terms in the sources more easily when describing their heritage objects. And that makes digital heritage easier to find by users. Terminology sources should comply with these requirements (in Dutch).

A terminology source is a dataset, with sustainable URIs as described in this section and described according to the guidelines in this section. A terminology source can be published as a dump, with resolvable URIs and in a SPARQL endpoint as described in this section, but your terminology source can only be incorporated into the Network of Terms for others to be used if you create a SPARQL endpoint.

A terminology source can only be used if it meets a few conditions. Are you the manager of a source and are you considering making it available to the network? Then take the following aspects into account.

4.3.1. Authority

Terms are not just ordinary words: they are official notations. The manager of a terminology source guarantees their quality. For example, terms must be accurate and current. Another example: the manager ensures that her terms are permanently available and can be used for a long time. This makes a terminology source an authority on which other institutions can build.

4.3.2. Technology and data model

A terminology source is used by both humans and machines. For example, institutions' collection management systems must be able to connect to the source. To make this possible, the source must meet certain technological requirements. These requirements follow the Linked Data principles and best practices. This means, for example, that every term in the source gets a sustainable URI; when dereferencing this URI, metadata about the term is returned. This metadata must be expressed according to a specific RDF metadata model, for example, Simple Knowledge Organization System (SKOS) or Schema.org. SKOS contains various types of relations, e.g. skos:exactMatch, skos:closeMatch or skos:broader, to store an alignment between your terminology source and terms defined in other terminology sources.

In addition, terms in the source must be searchable through a SPARQL endpoint. This enables a connection to the Network of Terms.

4.3.3. Cooperation

A terminology source is used by institutions. There is a good chance that these institutions do not use one, but several sources. To increase the ease of use of sources, collaboration between source managers and collection managers is essential. For example about how collection managers can propose changes to terms to sources: this could follow a uniform process. The collection management system should support these processes. Another example: two sources may define the same term (such as 'painting'). The sources should then jointly agree that these terms have the same meaning and refer to each other’s terms – this clarifies the relationship between both terms for collection managers.

5. Publishing dataset descriptions

To increase the findability of datasets of heritage institutions, it is important to publish the dataset descriptions according to a well-documented, machine-readable model. When dataset descriptions are used, the findability and use of datasets that heritage institutions make available will improve. Publishing dataset descriptions should obey these requirements, which contains a more elaborate description of the publishing process.

5.1. Publication model for dataset descriptions

NDE has created a data model for the description of datasets based on Schema.org. For datasets, the class https://schema.org/Dataset has been adopted.

5.2. How to publish dataset descriptions

For good findability, every dataset description must be accessible via the web, must be readable for humans and machines, and use the publication model. The serialization of dataset descriptions can take several forms. JSON-LD is preferred.

A user can obtain a dataset description with a sustainable URI as described in this section and published as a resolvable URI as described in this section.

Most (automated) users expect to find the metadata inline in the human readable webpage itself. Spiders of search engines such as Google might not follow linked JSON-LD files. Even if the linked files (via JavaScript) are "injected" into the page, most spiders do not pick up on this. There are more serializations of RDF, such as RDF/XML and Turtle. Spiders of search engines such as Google currently only support microdata, RDFa and JSON-LD. Because search engine findability is an important driver, the use of inline JSON-LD is recommended. However, this does not prevent the publication of the dataset description directly (not in-lined) in other serialization formats via content-negotiation.

5.3. Spreading the knowledge about datasets

The choice for Schema.org as a basis for the publication model for dataset descriptions comes with the benefit that this model is known (and used) by big search engines. By adding the dataset description pages to the sitemap of the website, search engines are likely to pick up the dataset description and make them available in tools like Google Dataset Search.

By registering the datasets with the NDE Register, the network of Dutch heritage institutions can easily find relevant datasets to link with. More generic (open) dataset registers, like data.overheid.nl, can also be used to promote the datasets of the institutions. Collection Management systems could provide functions to register datasets on various platforms.

6. International Image Interoperability Framework (IIIF)

The International Image Interoperability Framework, abbreviated as IIIF and pronounced as "triple-eye-eff" is a set of standards and APIs designed to provide a consistent and interoperable way of delivering, navigating, and manipulating digital images and audio/visual files across different platforms and organizations.

Implementing IIIF offers several benefits that can enhance the accessibility, usability, and discoverability of digital collections. IIIF does not prescribe any particular software, but determines how different pieces of software communicate with each other. If this communication is standardized across organizations, the same viewer software can be used for displaying different types of digitized media.

6.1. Image API

Depending on the implementation level, the IIIF Image API supports requests for specific parts, sizes and orientations of images based on a URL scheme. This can be compared to tiled web maps that split up large raster maps into small square tiles at various zoom levels. Viewers can request only those tiles needed for the current viewport and zoom level, and do not need to load the complete high resolution image. Different from maps, the coordinate system of IIIF is based on the original pixel dimensions of the image.

6.2. Presentation API

While the Image API provides access to individual images, it does not describe the relationship between groups of images, such as the individual images that make up the pages of a book. A IIIF manifest is a JSON-LD document that describes the structure and metadata of a digital object or collection of objects, such as images, audio, or video. It is based on the IIIF Presentation API and comprises several key components, including pages, canvases, table of contents (ranges), and annotations.

The ability to overlay any IIIF-compliant map image on top of a standard web map with no specialized GIS software or file formats required (used in connecting historical maps to modern ones, also known as georeferencing) is described in Georeference Extension specification as a extension on the Presentation API.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119