Requirements for Datasets

1. Introduction

This section is non-normative.

To enable datasets to be found and used, they must be described according to a well-documented, shared and machine-readable publication model.

This document describes such a model and its rules. When publishers make their dataset descriptions adhere to these rules, they enable consumers – both humans and machines – to use the datasets in new and better ways.

These requirements prescribe the metadata that publishers must provide for their dataset. This metadata tells consumers:

what the dataset is called and under what license it is published (§ 4.2 Dataset information);
which person/organization has published the dataset (§ 4.3 Creator/publisher information);
where the data can be found (§ 4.4 Distributions).

1.1. Audience

This document is mainly geared towards two groups of readers.

Digital heritage collection managers can follow the requirements in this document to make their published datasets findable and usable, for instance through Google Dataset Search and the NDE Dataset Register.

Suppliers of collection management systems can implement these requirements in their software to help collection managers using it to publish datasets in the correct format. These requirements are scoped to the online publication output of collection management systems; they do not prescribe how those systems should store data internally.

1.2. Context

While focused on digital heritage institutions in The Netherlands, this document is based on broader, international best practices for publishing datasets, including [DWBP-UCR], [DWBP] and [LD-BP].

These requirements incorporate a previous publication model, which provides more background on choices made here.

1.3. Code examples

RDF code examples are in the [SCHEMA-ORG] vocabulary, serialized as [JSON-LD].

While other vocabularies, such as [VOCAB-DCAT-2] can also be used, Schema.org has the advantage that it’s better picked up by search engines, improving findability, one of the main goals of publishing datasets on the web.

2. Definitions

Dataset

A collection of metadata records. These are made available through the dataset’s distributions.

Dataset description

Metadata about the dataset, including the dataset’s name and publisher. This description must be distinguished from the metadata records themselves.

For example: imagine a dataset of Van Gogh paintings called ‘Sunflowers’, which is published by the Van Gogh Museum under a specific license. The name, license and publisher are all part of the dataset description. The dataset description also tells us the URLs of distributions where we can download or query the data. Using these distributions, we can access the metadata records themselves, which may include descriptions of paintings, persons, places etc. These are not part of the dataset description.

Data catalog

A collection of dataset descriptions.

Distribution

A channel through which a dataset is made available, either for downloading (such as a CSV file download or RDF dump), or for querying (such as a SPARQL endpoint).

Web API

An API that is available over HTTP, for example an OAI-PMH, OpenAPI or SPARQL endpoint.

Machine-readability

TODO

Publisher

An individual or organization that provides one or more datasets.

is this a good translation for ‘bronhouder’? And add examples. This would probably be ‘heritage institution’ in DERA.

Consumer

On organization or an individual that uses one or more datasets that are provided by a publisher.

3. Conceptual model

The model consists of four resource types: organizations publish datasets, which are available in distributions. Optionally, the datasets are grouped in data catalogs.

4. Requirements

4.1. Available in RDF

For machine-readable access to data, it needs to be published in an RDF format. RDF formats include [JSON-LD], [N3] and [Turtle].

Publishers MUST make their dataset description available in RDF.

Both the Schema.org and DCAT vocabularies MAY be used; Schema.org is recommended.

Google recommends including the JSON-LD directly in the HTML source of web pages.

So, on your organization’s web page, for instance www.kb.nl, include:

<html>
  <head>
    <title>Koninklijke Bibliotheek</title>
    <script type="application/ld+json">
      {
        "@context": "https://schema.org/",
        "@type": "Organization",
        "@id": "https://www.kb.nl",
        "name": "Koninklijke Bibliotheek"
      }
    </script>
  </head>
  <body>
    Here continues the web page of the organization...
  </body>
</html>

4.1.1. Durable identifiers

Consumers want to refer to datasets. They prefer to do so by linking to them.

Therefore, publishers MUST maintain a permanent and unique identifier for each dataset. Publishers MUST use HTTP IRIs as identifiers.

4.1.2. Information remains available

Datasets will be used by all kinds of consumers and their systems. For stability, users must be able to trust that the datasets will remain available so they can be consulted in the future.

Therefore, publishers MUST ensure information remains available in the future.

4.2. Dataset information

Consumers want to consult information about the dataset to decide whether and how they want to use its data. This information answers user questions such as:

What is the name of the dataset?
What is the dataset about? What kind of data does the dataset contain?
How recent is the data? When was the dataset last published?
How can I use the data? Are there any restrictions? Under which license is the data published?
Where can I get the data? In what formats?

4.2.1. Basic information

Publishers MUST include basic information about the dataset, at the very minimum its HTTP [IRI] and name.

Basic dataset information:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "@id": "http://data.bibliotheken.nl/doc/dataset/rise-alba",
  "name": "Alba amicorum van de Koninklijke Bibliotheek"
}

4.2.2. License

Publishers MUST make known under which license the dataset can be used. This SHOULD be an open license that allows the data to used by consumers, for example one of the Creative Commons licenses.

Specify a license for the dataset:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/"
}

4.2.3. Creation, publication and modification dates

Publishers SHOULD make known when the dataset description was originally created, published and when it was last updated.

Specify dataset description dates:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "dateCreated": "2019-04-14",
  "datePublished": "2019-05-21",
  "dateModified": "2019-08-15"
}

4.2.4. Version

A dataset description may change over time. Consumers, such as researchers, may want to determine which information was valid at a certain moment.

Therefore, publishers SHOULD not only publish the current version of the dataset description, but keep historical versions accessible to users.

It is up to the publisher to determine when to publish new versions.

4.2.5. More information

If more information is available, publishers SHOULD add it.

More information about the dataset.

{  "@context": "https://schema.org/",  "@type": "Dataset",  "description": "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project.",  "url": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",  "keywords": [    "alba amicorum"  ]}

See § 4.6.1 Dataset attributes for an overview of attributes.

4.3. Creator/publisher information

Users want to know where the dataset came from (provenance). The dataset’s creator and/or publisher is either a person or an organization. Providing information about the person/organization behind the dataset answers user questions such as:

Which person/organization has published this dataset? How reliable and credible does that make the dataset?
How can I contact the person/organization for questions or feedback?

Therefore, publishers MUST publish basic information about the person/organization. At the least, the organization’s name and HTTP IRI must be provided.

An organization description:

{
  "@context": "https://schema.org/",
  "@type": "Organization",
  "@id": "https://www.kb.nl",
  "name": "Koninklijke Bibliotheek",
  "alternateName": "KB"
}

See § 4.6.2 Organization attributes for a full overview of organization attributes.

A person description:

{
  "@context": "https://schema.org/",
  "@type": "Person",
  "@id": "https://example.com",
  "name": "Jan Jansen"
}

4.3.1. Dataset publisher

The person/organization data is then included as the dataset’s publisher:

A dataset with an organization as publisher. The publisher has some optional attributes.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "publisher": {
    "@type": "Organization",
    "@id": "https://www.kb.nl",
    "name": "Koninklijke Bibliotheek",
    "alternateName": "KB",
    "contactPoint": {
      "@type": "ContactPoint",
      "name": "Dataset manager’s name",
      "email": "datasets@kb.nl",
      "telephone": "+31 6 12345678"
    }
  }
}

A dataset with a person as publisher. The publisher has some optional attributes.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "publisher": {
    "@type": "Person",
    "@id": "https://example.com",
    "name": "Jan Jansen"
  }
}

4.4. Distributions

Consumers that are interested in a dataset should be able to access the data in it. Distributions tell consumers where and how they can get the data.

Therefore, publishers SHOULD add at least one distribution. Each distribution MUST have at least a MIME format and the URL where the distribution can be accessed.

Examples of distributions are data dumps in one or more RDF serializations, such as JSON-LD and Turtle, CSV files, SPARQL endpoints, OAI-PMH endpoints or other web APIs. All distributions of a dataset MUST contain broadly the same data.

A minimal definition of a SPARQL endpoint distribution. In the Schema.org vocabulary, each type of distribution is called a DataDownload, even if it is a query endpoint.

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "encodingFormat": "application/sparql-results+xml",
  "contentUrl": "http://vocab.getty.edu/sparql"
}

The distributions are then included under the distribution attribute with the dataset.

A dataset with two distributions.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "distribution": [
    {
      "@id": "http://vocab.getty.edu/aat/sparql",
      "@type": "DataDownload",
      "encodingFormat": "application/sparql-results+json",
      "contentUrl": "http://vocab.getty.edu/sparql"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "application/ld+json",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.json"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "text/csv+gzip",
      "contentUrl": "https://example.com/data.csv.gz"
    }
  ]
}

See § 4.6.3 Distribution attributes for a full overview.

4.4.1. Developer documentation

Publishers SHOULD link to documentation URLs via the schema:usageInfo attribute. For datadumps this should be documentation about the data model. For web APIs this should be documentation about the specific capabilities of the API (like content-type support) AND the generic specification of the protocol which also types the distribution such as OAI-PMH, SPARQL and REST.

Recommended URIs for typing
Protocol	URL specification
OAI-PMH	http://www.openarchives.org/pmh/
SPARQL	https://www.w3.org/TR/sparql11-query/
TPF	https://linkeddatafragments.org/specification/triple-pattern-fragments/
REST	https://developers.arcgis.com/rest/
WMS	http://www.opengeospatial.org/standards/wms
GraphQL	https://spec.graphql.org/

A distribution with a link to the generic documentation and the protocol specification (typing the distribution as SPARQL-endpoint).

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "encodingFormat": "application/sparql-results+json",
  "contentUrl": "http://vocab.getty.edu/sparql",
  "usageInfo": [ "https://www.w3.org/TR/sparql11-query/", "https://vocab.getty.edu/" ]
}

4.5. Data catalog

A data catalog provides consumers with a complete overview of available dataset descriptions, which improves discoverability.

Therefore, publishers SHOULD provide a catalog.

A catalog of available datasets:

{
  "@context": "https://schema.org/",
  "@type": "DataCatalog",
  "@id": "http://data.bibliotheken.nl/id/datacatalog",
  "name": "Linked Data KB",
  "description": "Alle linked data zoals beschikbaar gesteld door de Koninklijke Bibliotheek.",
  "publisher": {
    "@type": "Organization",
    "@id": "https://www.kb.nl/",
    "name": "Koninklijke Bibliotheek"
  },
  "dataset": [
    {
      "@type": "Dataset",
      "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
      ...
    }, 
    { 
      ...
    }
  ]
}

See § 4.6.4 DataCatalog attributes for a full overview of catalog attributes.

4.5.1. Pagination

Large data catalogs may be harder to process for clients.

Therefore, publishers SHOULD split large data catalogs in parts of at most a 1000 datasets, using the Hydra Core Vocabulary.

Each page MUST be a complete RDF document in itself. Related resources, such as the publishing organization, must be described on each page, even if that resource is the same on all pages.

A paginated catalog:

{  "@context": [    "https://schema.org/",    {"hydra": "http://www.w3.org/ns/hydra/core#"}  ],  "@type": ["DataCatalog", "hydra:Collection"],  "@id": "https://example.com/catalog",  "name": "Paginated catalog of datasets",  "description": "This catalog is paginated using the Hydra Core Vocabulary.",  "publisher": {    "@type": "Organization",    "@id": "/publisher",    "name": "Example Publisher"  },  "hydra:view": {    "@id": "/catalog?page=1",    "@type": "hydra:PartialCollectionView",    "hydra:first": {"@id": "/catalog?page=1"},    "hydra:next": {"@id": "/catalog?page=2"},    "hydra:first": {"@id": "/catalog?page=498"}  },  "dataset": [    {      "@type": "Dataset",      "@id": "https://example.com/dataset/1",      ...    },     {      "@type": "Dataset",      "@id": "https://example.com/dataset/2",      ...    },     ...  ]}

4.6. Overview of attributes

This is an overview of required and recommended attributes.

4.6.1. Dataset attributes

schema:Dataset properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the dataset.	1	Required
schema:name	See § 4.2.1 Basic information	1	Required
schema:description	See § 4.2.1 Basic information.	0..1	Recommended
schema:publisher	The publisher of the dataset.	1	Required
schema:license	See § 4.2.2 License.	1	Required
schema:distribution	See § 4.4 Distributions.	0..n	Recommended
schema:creator	The owner of the dataset.	0..n	Recommended
schema:funder	An organization that supports (sponsors) something through some kind of financial contribution.	0..n	Recommended
schema:dateCreated	See § 4.2.3 Creation, publication and modification dates.	0..1	Recommended
schema:datePublished	See § 4.2.3 Creation, publication and modification dates.	0..1	Recommended
schema:dateModified	See § 4.2.3 Creation, publication and modification dates.	0..1	Recommended
schema:version	See § 4.2.4 Version.	0..1	Recommended
schema:mainEntityOfPage	URL of a landing page where the dataset is described.	0..1	Recommended
schema:inLanguage	Language or languages in which the dataset is available. Use one of the language codes from the [BCP47], such as "nl-NL".	0..n	Recommended
schema:isBasedOn	The URL of dataset this dataset is based on (previously schema:isBasedOnUrl).	0..1	Recommended
schema:citation	A citation or reference for the dataset.	0..1	Recommended
schema:genre	The genre or genres which describe the dataset.	0..n	Recommended
schema:keywords	The keyword or keywords which describe the dataset.	0..n	Recommended
schema:spatialCoverage	Indicates the place(s) which are the focus of the dataset.	0..n	Recommended
schema:temporalCoverage	Indicates the period that the dataset applies to.	0..n	Recommended
schema:includedInDataCatalog	The HTTP [IRI] of the data catalog(s) that the dataset belongs to.	0..n	Recommended

4.6.2. Organization attributes

schema:Organization properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the organization.	1	Required
schema:name	The organization’s full name.	1	Required
schema:alternateName	Alternative names such as an abbreviation that the organization is known under.	0..n	Recommended
schema:sameAs	Links to the organization in other databases.	0..n	Recommended
schema:contactPoint	Where consumers can reach the organization for questions and suggestions about the dataset.	1	Recommended

4.6.3. Distribution attributes

When the distribution is compressed, the compression format (e.g. zip, gzip) should be added to the schema:encodingFormat (eq. text/turtle+gzip).

schema:DataDownload properties
Property	Description	Cardinality	Usage
schema:contentUrl	The URL where the distribution can be directly accessed.	1	Required
schema:encodingFormat	The distribution’s MIME format, for example `application/sparql-query` for a SPARL endpoint or `application/ld+json` for a data dump serialized as JSON-LD. Use a value from the [IANA-MEDIA-TYPES] list. The value should indicate the Media type of the response of the schema:contentUrl when no Accept header is included in the request. It is recommended to provide an URL of documentation via the schema:usageInfo property in which the supported encoding formats which can be requested via an Accept HTTP header are listed.	1..n	Required
schema:name	Name of the distribution, depicting the type of distribution.	0..1	Recommended
schema:description	A description of the the distribution.	0..1	Recommended
schema:datePublished	Date (or datetime) the distribution was published.	0..1	Recommended
schema:dateModified	Date (or datetime) the distribution was last modified.	0..1	Recommended
schema:inLanguage	Language or languages in which the distribution is available. Use one of the language codes from the [BCP47], such as "nl-NL".	0..1	Recommended
schema:license	See § 4.2.2 License.	0..1	Recommended
schema:contentSize	A measure indicating the size of the distribution.	0..1	Recommended
schema:usageInfo	A link to the web API documentation; see § 4.4.1 Developer documentation.	0..1	Required if custom API

4.6.4. DataCatalog attributes

schema:DataCatalog properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the data catalog.	1	Required
schema:name	The name of the data catalog.	1	Required
schema:description	A description of the data catalog.	0..1	Recommended
schema:publisher	The publisher of the data catalog.	1	Required
schema:dataset	The datasets that are contained in the data catalog.	1..n	Required

4.6.5. Full example

A full dataset description that includes required and recommended attributes.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
  "name": "Alba amicorum van de Koninklijke Bibliotheek",
  "description": "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project.",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/",
  "dateCreated": "2019-04-14",
  "datePublished": "2019-05-21",
  "dateModified": "2019-08-15",
  "keywords": [
    "alba amicorum"
  ],
  "includedInDataCatalog": "http://data.bibliotheken.nl/id/datacatalog",
  "inLanguage": [
    "nl-NL",
    "en-US"
  ],
  "mainEntityOfPage": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",
  "publisher": {
    "@type": "Organization",
    "@id": "https://www.kb.nl",
    "name": "Koninklijke Bibliotheek",
    "alternateName": "KB",
    "sameAs": "https://ror.org/02w4jbg70"
  },
  "creator": {
    "@type": "Organization",
    "@id": "https://www.kb.nl",
    "name": "Koninklijke Bibliotheek",
    "alternateName": "KB",
    "sameAs": "https://ror.org/02w4jbg70",
    "contactPoint": {
      "@type": "ContactPoint",
      "name": "Dataset manager’s name",
      "email": "datasets@kb.nl",
      "telephone": "+31 6 12345678"
    }
  },
  "distribution": [
    {
      "@type": "DataDownload",
      "encodingFormat": "application/sparql-query",
      "contentUrl": "http://data.bibliotheken.nl/sparql"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "text/turtle",
      "contentUrl": "http://data.bibliotheken.nl/files/alba_20201016.ttl.gz",
      "name": "Datadump Alba Amicorum",
      "datePublished": "2019-05-21",
      "dateModified": "2019-08-15",
      "contentSize": "614K"
    }
  ]
}

5. Tools