Requirements for Data

Living Standard,

This version:
https://netwerk-digitaal-erfgoed.github.io/requirements-data/
Issue Tracking:
GitHub
Inline In Spec
Editors:
David de Boer (Netwerk Digitaal Erfgoed)

Abstract

This document specifies the generic data model to be used when publishing linked data in the heritage network. The model enables dataset consumers, including software developers building data-consuming applications, to use, understand and combine datasets from multiple sources, thus fulfilling the promise of linked data.

Please don’t rely on this document yet as it’s still under discussion and development.

1. Introduction

1.1. Goal

This document prescribes the generic data model to be used when publishing linked data in the heritage network. The model consists of a minimal set of classes and properties. It is based on:

By adhering to this model, dataset publishers ensure that their data is visible and can be consumed and combined with other datasets in the network.

1.2. Scope

These requirements are restricted in three ways:

  1. they apply only to the way published data is expressed, not how it is stored or managed internally;

  2. they prescribe a generic data model and leave the use of domain data models up to dataset publishers;

  3. they bear upon datasets, not their descriptions; for the latter see [NDE-DATASETS].

1.3. Examples

While RDF examples in this document are in the [JSON-LD] RDF serialization, publishers MAY use any RDF serialization format, such as [Turtle] or [N3].

2. Definitions

Data model

Set of classes and their properties that defines how data is expressed.

Generic data model

A simple, shared data model; the scope of this document. See also [NDE-ALIGNMENT]. Can be used alongside domain data models.

Domain data model

A domain-specific data model, such as CIDOC-CRM, Linked Art, RiC-O or RDA. Can be used alongside a generic data model. Adds precision at the cost of complexity. Out of this document’s scope.

3. Data model

3.1. Generic and domain data models

The purpose of generic data models is to integrate data in the heritage network and make it more visible. Domain models are usually more richly populated and provide consumers with more possibilities for further processing, for example in service platforms.

This document is limited to a set of classes and properties that together form the generic data model. For most datasets, this generic data model expresses only a subset of data properties that are available. If done well, the generic data invites consumers to explore the data in more depth. To facilitate further exploration, publishers MAY use domain data models of their choosing alongside the generic data model. Examples are:

3.2. Publication principles

With RDF, it’s perfectly fine to express the same data in multiple ways. Therefore, the generic and domain data models MAY coexist in the same information resource.

Combine generic (Schema.org) with domain-specific (Linked Art) modelling.
{
  "@context": {
    "schema": "https://schema.org/",
    "edm": "http://www.europeana.eu/schemas/edm/",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "dcterms": "http://purl.org/dc/terms/"
  }
  "@id": "https://literatuurmuseum.nl/id/123456789"
  "@type": ["schema:CreativeWork", "schema:VisualArtwork"],
  "schema:name": "Het fluitketeltje en andere versjes",
  "rdfs:label": "Het fluitketeltje en andere versjes",
  "schema:creator": {
    "@type": "schema:Person",
    "@id": "http://data.rkd.nl/artists/8342"
  },
  "dcterms:creator": {
    "@type": "dcterms:Agent",
    "@id": "http://data.rkd.nl/artists/8342"
  }

Alternatively, publishers MAY separate the generic data model by using profile-based content negotiation (see [DX-PROF-CONNEG]). To do so, publish a profile with URI https://netwerk-digitaal-erfgoed.github.io/requirements-data/:

Expose the generic data model in its own content-negotiated profile.
# Get the list of profiles.
GET /resource/a?profile=alt HTTP/1.1

# Server responds with a list of profiles that includes the NDE generic data model.
HTTP/1.1 200 OK
Content-Type: application/json

{
    "resource": "http://example.org/resource/a",
    "profiles": [
        {
            "token": "nde",
            "uri": "https://netwerk-digitaal-erfgoed.github.io/requirements-data/",
            "media_types": ["application/ld+json", "text/turtle"]
        },
        ...
    ]
}

3.3. Schema.org vocabulary

The generic data model presented in this document is designed as a [SCHEMA-ORG] application profile. The choice for Schema.org is substantiated in Implementation guidelines for NDE alignment § generic-data-model.

While the Scheme.org website considers “both 'https://schema.org' and 'http://schema.org' (...) fine”, mixing the namespaces makes it harder to consume datasets.

Therefore, Publishers MUST use the https://schema.org/ (HTTPS) namespace for Schema.org; not http://schema.org/ (HTTP).

3.4. Classes

Publishers MUST type each published resource as one or more of the following classes.

Publishers SHOULD use more fine-grained classes alongside these top-level classes.

A photograph is typed as both top-level schema:CreativeWork and the more specific schema:Photograph:
{
  "@context": "https://schema.org/",
  "@id": "https://example.com/dataset1/resource1"
  "@type": ["CreativeWork", "Photograph"]
}

3.4.1. CreativeWork

3.4.2. Event

3.4.3. MediaObject

Should we only have MediaObject or the more specific types VideoObject, AudioObject, ImageObject etc., too?

3.4.4. Organization

3.4.5. Person

3.4.6. Place

A place with required name and optional address:
{
  "@context": "https://schema.org/",
  "@id": "https://example.com/dataset/place"
  "@type": ["Place"],
  "name": "Anne Frank Huis",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "Prinsengracht 263",
    "postalCode": "1016 GV",
    "addressLocality": "Amsterdam",
    "addressRegion": "Noord-Holland",
    "addressCountry": "Netherlands"
  }
}

3.5. Properties

This section describes how to express the data using a minimal set of properties (RDF predicates), and their ranges. The range of an RDF predicate is the set of allowed values for that predicate.

Publishers MUST express data with these properties.

3.5.1. Overview

Property Description Range Cardinality Usage
schema:name Name of a person, title of a book etc. Language-tagged string 1 Required
schema:description A description of the resource. Language-tagged string 0..n Required (if available)
schema:image An image of the resource. URL? 0..n Required (if available)
schema:creator An image of the resource. URI 0..n Required (if available)
schema:geo An image of the resource. GeoCoordinates or GeoShape 0..n Required (if available)

How to describe dates? Require only very weak schema:temporal?

3.5.2. Language

For each property with a literal value, the value’s language MUST be specified. The language MUST be expressed as a language code from [BCP47], such as ‘nl’ or ‘nl-NL’.

Specifying the language of the https://schema.org/name property:
{
  "@context": "https://schema.org/",
  "@id": "https://example.com/dataset1/resource1"
  "@type": ["CreativeWork"],
  "name": {
    "@language": "nl-NL",
    "@value": "Het fluitketeltje en andere versjes"
  }  
}

3.5.3. name (required)

Use schema:name and/or rdfs:label?

Should we require a language-tagged string? E.g. person (Vincent van Gogh) or organization names (Van Gogh Museum) can be considered to be untagged.

3.5.4. description

3.5.5. image

3.5.6. license

Does license make sense on the level of individual resources? Or perhaps only on certain types of resources, such as media?

3.5.7. author

Use schema:author or schema:creator?

Where available, reference a person from the Network of Terms.

3.5.8. geo

A place with coordinates:
{
  "@context": "https://schema.org",
  "@type": "Place",
  "geo": {
    "@type": "GeoCoordinates",
    "latitude": "40.75",
    "longitude": "-73.98"
  },
}

4. Formal definition

A formal definition of the generic data model in [SHACL].
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

_:CreativeWorkShape
    a sh:NodeShape ;
    sh:targetClass schema:CreativeWork ;
    sh:property
        _:NameProperty ,
        _:DescriptionProperty ,
        _:CreatorProperty .

_:NameProperty
    a sh:PropertyShape ;
    sh:path schema:name ;
    sh:datatype rdf:langString ;
    sh:minCount 1.

_:DescriptionProperty
    a sh:PropertyShape ;
    sh:path schema:description ;
    sh:datatype rdf:langString ;
    sh:minCount 1 .

_:ImageProperty
    a sh:PropertyShape ;
    sh:path schema:image ;
    sh:class schema:ImageObject ;
    sh:minCount 0 .

_:CreatorProperty
    a sh:PropertyShape ;
    sh:path schema:creator ;
    sh:or (
              [ sh:datatype schema:Person ]
              [ sh:datatype schema:Organization ]
          ) ;
    sh:minCount 1 .

_:GeoCoordinatesShape
    a sh:NodeShape ;
    sh:targetClass schema:GeoCoordinates ;
    sh:property [
        sh:path schema:latitude ;
        sh:datatype xsd:float  ;
        sh:minCount 1 ;
        sh:maxCount  1 ;
    ] ,
    [
        sh:path schema:longitude  ;
        sh:datatype xsd:float ;
        sh:minCount 1 ;
        sh:maxCount  1 ;
    ] .

_:PlaceShape
    a sh:NodeShape ;
    sh:targetClass schema:Place ;
    sh:property [
        sh:path schema:geo ;
        sh:or (
            [ sh:class schema:GeoCoordinates ]
            [ sh:class schema:GeoShape ]
        ) ;
        sh:minCount 0 ;
        sh:maxCount 1 ;
    ] .

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

References

Normative References

[BCP47]
A. Phillips, Ed.; M. Davis, Ed.. Tags for Identifying Languages. September 2009. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc5646
[DX-PROF-CONNEG]
Lars G. Svensson; Rob Atkinson; Nicholas Car. Content Negotiation by Profile. URL: https://w3c.github.io/dx-connegp/connegp/
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 3 November 2020. REC. URL: https://www.w3.org/TR/json-ld/
[N3]
Tim Berners-Lee; Dan Connolly. Notation3 (N3): A readable RDF syntax. 14 January 2008. W3C Team Submission. URL: https://www.w3.org/TeamSubmission/2008/SUBM-n3-20080114/
[NDE-ALIGNMENT]
Sjors de Valk; Ivo Zandhuis; Bob Coret. Implementation guidelines for NDE alignment. URL: https://netwerk-digitaal-erfgoed.github.io/cm-implementation-guidelines/
[NDE-DATASETS]
David de Boer; Bob Coret. NDE Requirements for Datasets. Living Specification. URL: https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[SCHEMA-ORG]
W3C Schema.org Community Group. Schema.org. 6.0. URL: https://schema.org/
[SHACL]
Holger Knublauch; Dimitris Kontokostas. Shapes Constraint Language (SHACL). URL: https://w3c.github.io/data-shapes/shacl/
[Turtle]
Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Turtle. URL: https://w3c.github.io/rdf-turtle/spec/

Issues Index

Please don’t rely on this document yet as it’s still under discussion and development.
Should we only have MediaObject or the more specific types VideoObject, AudioObject, ImageObject etc., too?
How to describe dates? Require only very weak schema:temporal?
Use schema:name and/or rdfs:label?
Should we require a language-tagged string? E.g. person (Vincent van Gogh) or organization names (Van Gogh Museum) can be considered to be untagged.
Does license make sense on the level of individual resources? Or perhaps only on certain types of resources, such as media?
Use schema:author or schema:creator?