Requirements for Datasets

Living Standard,

This version:
https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/
Issue Tracking:
GitHub
Inline In Spec
Editors:
David de Boer (Netwerk Digitaal Erfgoed)
Bob Coret (Netwerk Digitaal Erfgoed)

Abstract

This document describes requirements for publishing datasets. By following these requirements, publishers enable users to find and use their datasets.

This document is a working draft.

1. Introduction

This section is non-normative.

To enable datasets to be found and used, they must be published according to a well-documented, shared and machine-readable publication model.

This document describes such a model and its rules. When publishers make their datasets adhere to these rules, they enable consumers – both humans and machines – to use the published data in new and better ways.

The requirements prescribe the metadata that publishers should provide with their data. This metadata tells consumers:

1.1. Audience

This document is mainly geared towards two groups of readers.

Digital heritage collection managers can follow the requirements in this document to make their published datasets findable and usable, for instance through Google Dataset Search and the NDE Registry.

Suppliers of collection management systems can implement these requirements in their software to help collection managers using it to publish datasets in the correct format. These requirements are scoped to the online publication output of collection management systems; they do not prescribe how those systems should store data internally.

1.2. Context

While focused on digital heritage institutions in The Netherlands, this document is based on broader, international best practices for publishing datasets, including [DWBP-UCR], [DWBP] and [LD-BP].

These requirements incorporate a previous publication model, which provides more background on choices made here.

1.3. Code examples

RDF code examples are in the [SCHEMA-ORG] vocabulary, serialized as [JSON-LD].

While other vocabularies, such as [VOCAB-DCAT-2] can also be used, Schema.org has the advantage that it’s better picked up by search engines, improving findability, one of the main goals of publishing datasets on the web.

2. Definitions

Dataset

The description of a collection of data. A set of metadata that includes the dataset’s name and publisher. The data objects themselves are not part of the dataset proper but provided in its distributions.

For example: imagine a dataset of Van Gogh paintings called ‘Sunflowers’, which is published by the Van Gogh Museum under a specific license. These are all part of the dataset. The dataset also tells us the URLs of distributions where we can download or query the data. If we consult the data provided by one of these distributions, we’ve left the sphere of the dataset proper. That is to say, the data itself, which may include descriptions of paintings, persons and places, are not properties of the dataset.

Distribution

A channel through which a dataset is available, for example a CSV file download or a SPARQL endpoint.

Application Programming Interface (API)

TODO

Machine-readability

TODO

Publisher

An individual or organization that provides one or more datasets.

is this a good translation for ‘bronhouder’? And add examples. This would probably be ‘heritage institution’ in DERA.

Consumer

On organization or an individual that uses one or more datasets that are provided by a publisher.

add examples

3. Conceptual model

DataDownload
DataDownload
DataDownload
DataDownload
distribution
distribution
creator
creator
license
license
Dataset
Dataset
Organization
Organization
URL
URL
DataDownload
DataDownload
Person
Person
or
or
Viewer does not support full SVG 1.1

4. Requirements

4.1. Available in RDF

For machine-readable access to data, it needs to be published in an RDF format. RDF formats include [JSON-LD], [N3] and [Turtle].

Publishers MUST make their dataset description available in RDF.

Both the Schema.org and DCAT vocabularies MAY be used; Schema.org is recommended.

Google recommends including the JSON-LD directly in the HTML source of web pages.

So, on your organization’s web page, for instance www.kb.nl, include:

<html>
  <head>
    <title>Koninklijke Bibliotheek</title>
    <script type="application/ld+json">
      {
        "@context": "https://schema.org/",
        "@type": "Organization",
        "name": "Koninklijke Bibliotheek",
        "url": "https://www.kb.nl/"
      }
    </script>
  </head>
  <body>
    Here continues the web page of the organization...
  </body>
</html>    

See also Data on the Web Best Practices Use Cases & Requirements §R-FormatMachineRead.

4.1.1. Durable identifiers

Consumers want to refer to datasets. They prefer to do so by linking to them.

Therefore, publishers must maintain a permanent and unique identifier for each dataset, as well as their organization and dataset information. Publishers should use IRIs as identifiers.

See also Data on the Web Best Practices Use Cases & Requirements §R-UniqueIdentifier and Data on the Web Best Practices Use Cases & Requirements §R-PersistentIdentification.

4.1.2. Information remains available

Datasets will be used by all kinds of consumers and their systems. For stability, users must be able to trust that the datasets will remain available so they can be consulted in the future.

Therefore, publishers must ensure information remains available in the future.

4.2. Dataset information

Consumers want to consult information about the dataset to decide whether and how they want to use its data. This information answers user questions such as:

4.2.1. Basic information

Publishers MUST include basic information about the dataset, at the very minimum its [IRI] and name.

Basic dataset information:
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "@id": "http://data.bibliotheken.nl/doc/dataset/rise-alba",
  "name": "Alba amicorum van de Koninklijke Bibliotheek"
}

4.2.2. License

Publishers MUST make known under which license the dataset can be used. This SHOULD be an open license that allows the data to used by consumers, for example one of the Creative Commons licenses.

Specify a license for the dataset:
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/"
}

See also Data on the Web Best Practices Use Cases & Requirements §R-LicenseAvailable.

4.2.3. Creation, publication and modification dates

Publishers SHOULD make known when the dataset description was originally created, published and when it was last updated.

Specify dataset description dates:
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "dateCreated": "2019-04-14",
  "datePublished": "2019-05-21",
  "dateModified": "2019-08-15"
}

4.2.4. Version

A dataset description may change over time. Consumers, such as researchers, may want to determine which information was valid at a certain moment.

Therefore, publishers SHOULD not only publish the current version of the dataset description, but keep historical versions accessible to users.

It is up to the publisher to determine when to publish new versions.

See also Data on the Web Best Practices §dataVersioning.

4.2.5. More information

If more information is available, publishers SHOULD add it.

More information about the dataset.
{  "@context": "https://schema.org/",  "@type": "Dataset",  "description": "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project.",  "url": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",  "keywords": [    "alba amicorum"  ]}

See § 4.6.1 Dataset attributes for an overview of attributes.

4.3. Organization information

Users want to know where the dataset came from (provenance). Providing information about the organization behind the dataset (the publisher) answers user questions such as:

Therefore, publishers must publish basic information about the organization. At the least, the organization’s name and URL must be provided.

An organization description:
{
  "@context": "https://schema.org/",
  "@type": "Organization",
  "name": "Koninklijke Bibliotheek",
  "url": "https://www.kb.nl/"
}

See § 4.6.2 Organization attributes for a full overview of organization attributes.

4.3.1. Dataset creator

The organization data is then included as the dataset’s creator:

A dataset with a creator. The creator has some optional attributes.
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "creator": {
    "@type": "Organization",
    "name": "Koninklijke Bibliotheek",
    "url": "https://www.kb.nl/",
    "alternateName": "KB",
    "sameAs": "https://ror.org/02w4jbg70"
  }
}

4.4. Distributions

Consumers that are interested in a dataset should be able to access the data in it. Distributions tell consumers where and how they can get the data.

Therefore, publishers SHOULD add at least one distribution. Each distribution MUST have at least a MIME format and the URL where the distribution can be accessed.

Distributions can be data dumps in RDF or some other format such as CSV, live SPARQL endpoints, OAI-PMH endpoints or other APIs.

A minimal definition of a SPARQL endpoint distribution. In the Schema.org vocabulary, each type of distribution is called a DataDownload, even if it is a query endpoint.
{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "encodingFormat": "application/sparql-query",
  "contentUrl": "http://vocab.getty.edu/sparql"
}

The distributions are then included under the distribution attribute with the dataset.

A dataset with two distributions.
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "distribution": [
    {
      "@id": "http://vocab.getty.edu/aat/sparql",
      "@type": "DataDownload",
      "encodingFormat": "application/sparql-query",
      "contentUrl": "http://vocab.getty.edu/sparql"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "application/ld+json",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.json"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "text/csv",
      "contentUrl": "https://example.com/data.csv"
    }
  ]
}

See § 4.6.3 Distribution attributes for a full overview.

TODO Elaborate (and give examples of) the types of distribution (in schema:name), like SPARQL-endpoint, OAI-PMH-endpoint, LDF-endpoint, Data dump, Opensearch API. Or introduce proprietary property for the type (might be better the using the free-format name).

4.4.1. Developer documentation

If the distribution is non-standard API, publishers MUST publish documentation on how to use that API.

Can we specify a Distribution property for this?

See also Data on the Web Best Practices Use Cases & Requirements §R-APIDocumented.

4.5. Data catalog

A collection of datasets is a data catalog. A catalog provides the Consumers with a complete overview of available datasets, which improves discoverability.

Therefore, publishers SHOULD provide a catalog.

Catalog A catalog is a compiled collection of metadata about datasets. Dataset A dataset is a general description of a collection of data, published or compiled by a single administrator. Distribution A distribution describes the channels through which a dataset is made available. available via contains 1..n 1..n
A catalog of available datasets:
{
  "@context": "https://schema.org/","@type": "DataCatalog",
  "name": "Linked Data KB",
  "description": "Alle linked data zoals beschikbaar gesteld door de Koninklijke Bibliotheek.",
  "publisher": {
    "@type": "Organization",
    "name": "Koninklijke Bibliotheek",
    "url": "https://www.kb.nl/"
  },
  "dataset": [
    {
      "@type": "Dataset",
      "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
      ...
    }, 
    { 
      ...
    }
  ]
}

4.6. Overview of attributes

This is an overview of required and recommended attributes.

4.6.1. Dataset attributes

schema:Dataset properties
Property Description Cardinality Usage
@id The [IRI] of the dataset. 1 Required
schema:name See § 4.2.1 Basic information 1 Required
schema:description See § 4.2.1 Basic information. 0..1 Recommended
schema:creator The owner of the dataset. 1 Required
schema:license See § 4.2.2 License. 1 Required
schema:distribution See § 4.4 Distributions. 0..n Recommended
schema:publisher The publisher of the dataset. 0..1 Recommended
schema:dateCreated See § 4.2.3 Creation, publication and modification dates. 0..1 Recommended
schema:datePublished See § 4.2.3 Creation, publication and modification dates. 0..1 Recommended
schema:dateModified See § 4.2.3 Creation, publication and modification dates. 0..1 Recommended
schema:version See § 4.2.4 Version. 0..1 Recommended
schema:mainEntityOfPage URL of a webpage where the dataset is described. 0..1 Recommended
schema:inLanguage Language or languages in which the dataset is available. Use one of the language codes from the [BCP47], such as "nl-NL". 0..n Recommended
schema:isBasedOnUrl The URL of dataset this dataset is based on. 0..1 Recommended
schema:citation A citation or reference for the dataset. 0..1 Recommended
schema:genre The genre or genres which describe the dataset. 0..n Recommended
schema:keywords One or more keyword which describe the dataset. 0..n Recommended
schema:spatialCoverage Indicates the place(s) which are the focus of the dataset. 0..1 Recommended
schema:temporalCoverage Indicates the period that the dataset applies to. 0..1 Recommended
schema:includedInDataCatalog The URI of the data catalog in which the dataset is. 0..1 Recommended

4.6.2. Organization attributes

schema:Organization properties
Property Description Cardinality Usage
schema:name The organization’s full name. 1 Required
schema:url A link to the organization’s website. 1 Required
schema:alternateName Alternative names such as an abbreviation that the organization is known under. 0..n Recommended
schema:sameAs Links to the organization in other databases. 0..n Recommended

4.6.3. Distribution attributes

schema:DataDownload properties
Property Description Cardinality Usage
@id The IRI where the distribution is available in a browser. 1 Required
schema:name Name of the distribution, depicting the type of distribution. 0..1 Recommended
schema:description A description of the the distribution. 0..1 Recommended
schema:encodingFormat The distribution’s MIME format, for example application/sparql-query for a SPARL endpoint or application/ld+json for a data dump serialized as JSON-LD. See [IANA-MEDIA-TYPES] and MDN-reference for more information. 1..n Required
schema:contentUrl The URL where the distribution can be directly accessed. 1 Required
schema:datePublished Date (or datatime) the distribution was published. 0..1 Recommended
schema:dateModified Date (or datetime) the distribution was last modified. 0..1 Recommended
schema:inLanguage Language or languages in which the dataset is available. Use one of the language codes from the [BCP47], such as "nl-NL". 0..1 Recommended
schema:license See § 4.2.2 License. 0..1 Recommended
schema:contentSize A measure indicating the size of the distribution. 0..1 Recommended

4.6.4. DataCatalog attributes

schema:DataCatalog properties
Property Description Cardinality Usage
schema:name The name of the datacatalog. 1 Required
schema:description A description of the data catalog. 1 Required
schema:publisher The publisher of the data catalog. 1 Required
schema:dataset The datasets containing all dataset information. 1..n Required

4.6.5. Full example

A full dataset description that includes basic information (§ 4.2.1 Basic information), a license (§ 4.2.2 License) and publication dates (§ 4.2.3 Creation, publication and modification dates):
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
  "name": "Alba amicorum van de Koninklijke Bibliotheek",
  "description": "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project.",
  "url": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",
  "keywords": [
    "alba amicorum"
  ],
  "license": "http://creativecommons.org/publicdomain/zero/1.0/",
  "creator": {
    "@type": "Organization",
    "url": "https://www.kb.nl/",
    "name": "Koninklijke Bibliotheek",
    "sameAs": "https://ror.org/02w4jbg70"
  },
  "distribution": [
    {
      "@type": "DataDownload",
      "encodingFormat": "application/rdf+xml",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.rdf"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "text/turtle",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.ttl"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "application/n-triples",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.nt"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "application/ld+json",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.json"
    }
  ]
}

TODO Expand example based on required (and recommended) properties.

5. Tools

This section is non-normative.

Google’s Rich Results Test (previously Structured Data Testing Tool) can help with testing RDF metadata.

Reference SHACL.

Validation with NDE Register.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119

Informative References

[BCP47]
A. Phillips; M. Davis. Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL: https://tools.ietf.org/html/bcp47
[DWBP]
Bernadette Farias Loscio; Caroline Burle; Newton Calegari. Data on the Web Best Practices. 31 January 2017. REC. URL: https://www.w3.org/TR/dwbp/
[DWBP-UCR]
Deirdre Lee; Bernadette Farias Loscio; Phil Archer. Data on the Web Best Practices Use Cases & Requirements. 24 February 2015. NOTE. URL: https://www.w3.org/TR/dwbp-ucr/
[IANA-MEDIA-TYPES]
Media Types. URL: https://www.iana.org/assignments/media-types/
[IRI]
M. Duerst; M. Suignard. Internationalized Resource Identifiers (IRIs). January 2005. Proposed Standard. URL: https://tools.ietf.org/html/rfc3987
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. REC. URL: https://www.w3.org/TR/json-ld/
[LD-BP]
Bernadette Hyland; Ghislain Auguste Atemezing; Boris Villazón-Terrazas. Best Practices for Publishing Linked Data. 9 January 2014. NOTE. URL: https://www.w3.org/TR/ld-bp/
[N3]
Tim Berners-Lee; Dan Connolly. Notation3 (N3): A readable RDF syntax. 14 January 2008. W3C Team Submission. URL: https://www.w3.org/TeamSubmission/2008/SUBM-n3-20080114/
[SCHEMA-ORG]
W3C Schema.org Community Group. Schema.org. 6.0. URL: https://schema.org/
[SPARQL11-OVERVIEW]
The W3C SPARQL Working Group. SPARQL 1.1 Overview. 21 March 2013. REC. URL: https://www.w3.org/TR/sparql11-overview/
[Turtle]
Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Turtle. 25 February 2014. REC. URL: https://www.w3.org/TR/turtle/
[VOCAB-DCAT-2]
Riccardo Albertoni; et al. Data Catalog Vocabulary (DCAT) - Version 2. 4 February 2020. REC. URL: https://www.w3.org/TR/vocab-dcat-2/

Issues Index

TODO
TODO
is this a good translation for ‘bronhouder’? And add examples. This would probably be ‘heritage institution’ in DERA.
add examples
TODO Elaborate (and give examples of) the types of distribution (in schema:name), like SPARQL-endpoint, OAI-PMH-endpoint, LDF-endpoint, Data dump, Opensearch API. Or introduce proprietary property for the type (might be better the using the free-format name).
Can we specify a Distribution property for this?
TODO Expand example based on required (and recommended) properties.
Reference SHACL.
Validation with NDE Register.