1. Introduction
This section is non-normative.
To enable datasets to be found and used, they must be published according to a well-documented, shared and machine-readable publication model.
This document describes such a model and its rules. When publishers make their datasets adhere to these rules, they enable consumers – both humans and machines – to use the published data in new and better ways.
The requirements prescribe the metadata that publishers should provide with their data. This metadata tells consumers:
-
what the dataset is called and under what license it is published (§ 4.2 Dataset information);
-
which organization has published the dataset (§ 4.3 Organization information);
-
where the data can be found (§ 4.4 Distributions).
1.1. Audience
This document is mainly geared towards two groups of readers.
Digital heritage collection managers can follow the requirements in this document to make their published datasets findable and usable, for instance through Google Dataset Search and the NDE Registry.
Suppliers of collection management systems can implement these requirements in their software to help collection managers using it to publish datasets in the correct format. These requirements are scoped to the online publication output of collection management systems; they do not prescribe how those systems should store data internally.
1.2. Context
While focused on digital heritage institutions in The Netherlands, this document is based on broader, international best practices for publishing datasets, including [DWBP-UCR], [DWBP] and [LD-BP].
These requirements incorporate a previous publication model, which provides more background on choices made here.
1.3. Code examples
RDF code examples are in the [SCHEMA-ORG] vocabulary, serialized as [JSON-LD].
While other vocabularies, such as [VOCAB-DCAT-2] can also be used, Schema.org has the advantage that it’s better picked up by search engines, improving findability, one of the main goals of publishing datasets on the web.
2. Definitions
- Dataset
-
The description of a collection of data. A set of metadata that includes the dataset’s name and publisher. The data objects themselves are not part of the dataset proper but provided in its distributions.
For example: imagine a dataset of Van Gogh paintings called ‘Sunflowers’, which is published by the Van Gogh Museum under a specific license. These are all part of the dataset. The dataset also tells us the URLs of distributions where we can download or query the data. If we consult the data provided by one of these distributions, we’ve left the sphere of the dataset proper. That is to say, the data itself, which may include descriptions of paintings, persons and places, are not properties of the dataset.
- Data catalog
-
A collection of datasets.
- Distribution
-
A channel through which a dataset is available, for example a CSV file download or a SPARQL endpoint.
- Application Programming Interface (API)
- Machine-readability
- Publisher
-
An individual or organization that provides one or more datasets.
is this a good translation for ‘bronhouder’? And add examples. This would probably be ‘heritage institution’ in DERA.
- Consumer
-
On organization or an individual that uses one or more datasets that are provided by a publisher.
3. Conceptual model
The model consists of four resource types: organizations publish datasets, which are available in distributions. Optionally, the datasets are grouped in data catalogs.
4. Requirements
4.1. Available in RDF
For machine-readable access to data, it needs to be published in an RDF format. RDF formats include [JSON-LD], [N3] and [Turtle].
Publishers MUST make their dataset description available in RDF.
Both the Schema.org and DCAT vocabularies MAY be used; Schema.org is recommended.
So, on your organization’s web page, for instance www.kb.nl, include:
< html > < head > < title > Koninklijke Bibliotheek</ title > < script type = "application/ld+json" > { "@context" : "https://schema.org/" , "@type" : "Organization" , "@id" : "https://www.kb.nl" , "name" : "Koninklijke Bibliotheek" } </ script > </ head > < body > Here continues the web page of the organization...</ body > </ html >
See also Data on the Web Best Practices Use Cases & Requirements §R-FormatMachineRead.
4.1.1. Durable identifiers
Consumers want to refer to datasets. They prefer to do so by linking to them.
Therefore, publishers must maintain a permanent and unique identifier for each dataset. Publishers must use HTTP IRIs as identifiers.
See also Data on the Web Best Practices Use Cases & Requirements §R-UniqueIdentifier and Data on the Web Best Practices Use Cases & Requirements §R-PersistentIdentification.
4.1.2. Information remains available
Datasets will be used by all kinds of consumers and their systems. For stability, users must be able to trust that the datasets will remain available so they can be consulted in the future.
Therefore, publishers must ensure information remains available in the future.
4.2. Dataset information
Consumers want to consult information about the dataset to decide whether and how they want to use its data. This information answers user questions such as:
-
What is the name of the dataset?
-
What is the dataset about? What kind of data does the dataset contain?
-
How recent is the data? When was the dataset last published?
-
How can I use the data? Are there any restrictions? Under which license is the data published?
-
Where can I get the data? In what formats?
4.2.1. Basic information
Publishers MUST include basic information about the dataset, at the very minimum its HTTP [IRI] and name.
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "@id" : "http://data.bibliotheken.nl/doc/dataset/rise-alba" , "name" : "Alba amicorum van de Koninklijke Bibliotheek" }
4.2.2. License
Publishers MUST make known under which license the dataset can be used. This SHOULD be an open license that allows the data to used by consumers, for example one of the Creative Commons licenses.
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "license" : "http://creativecommons.org/publicdomain/zero/1.0/" }
See also Data on the Web Best Practices Use Cases & Requirements §R-LicenseAvailable.
4.2.3. Creation, publication and modification dates
Publishers SHOULD make known when the dataset description was originally created, published and when it was last updated.
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "dateCreated" : "2019-04-14" , "datePublished" : "2019-05-21" , "dateModified" : "2019-08-15" }
4.2.4. Version
A dataset description may change over time. Consumers, such as researchers, may want to determine which information was valid at a certain moment.
Therefore, publishers SHOULD not only publish the current version of the dataset description, but keep historical versions accessible to users.
It is up to the publisher to determine when to publish new versions.
See also Data on the Web Best Practices §dataVersioning.
4.2.5. More information
If more information is available, publishers SHOULD add it.
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "description" : "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project." , "url" : "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum" , "keywords" : [ "alba amicorum" ] }
See § 4.6.1 Dataset attributes for an overview of attributes.
4.3. Organization information
Users want to know where the dataset came from (provenance). Providing information about the organization behind the dataset (the publisher) answers user questions such as:
-
Which organization has published this dataset? How reliable and credible does that make the dataset?
-
How can I contact the organization for questions or feedback?
Therefore, publishers must publish basic information about the organization. At the least, the organization’s name and HTTP IRI must be provided.
{ "@context" : "https://schema.org/" , "@type" : "Organization" , "@id" : "https://www.kb.nl" , "name" : "Koninklijke Bibliotheek" , "alternateName" : "KB" }
See § 4.6.2 Organization attributes for a full overview of organization attributes.
4.3.1. Dataset publisher
The organization data is then included as the dataset’s publisher:
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "publisher" : { "@type" : "Organization" , "@id" : "https://www.kb.nl" , "name" : "Koninklijke Bibliotheek" , "alternateName" : "KB" } }
4.4. Distributions
Consumers that are interested in a dataset should be able to access the data in it. Distributions tell consumers where and how they can get the data.
Therefore, publishers SHOULD add at least one distribution. Each distribution MUST have at least a MIME format and the URL where the distribution can be accessed.
Distributions can be data dumps in RDF or some other format such as CSV, live SPARQL endpoints, OAI-PMH endpoints or other APIs.
{ "@context" : "https://schema.org/" , "@type" : "DataDownload" , "encodingFormat" : "application/sparql-query" , "contentUrl" : "http://vocab.getty.edu/sparql" }
The distributions are then included under the distribution
attribute with the dataset.
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "distribution" : [ { "@id" : "http://vocab.getty.edu/aat/sparql" , "@type" : "DataDownload" , "encodingFormat" : "application/sparql-query" , "contentUrl" : "http://vocab.getty.edu/sparql" }, { "@type" : "DataDownload" , "encodingFormat" : "application/ld+json" , "contentUrl" : "http://data.bibliotheken.nl/id/dataset/rise-alba.json" }, { "@type" : "DataDownload" , "encodingFormat" : "text/csv" , "contentUrl" : "https://example.com/data.csv" } ] }
See § 4.6.3 Distribution attributes for a full overview.
TODO Elaborate (and give examples of) the types of distribution (in schema:name), like SPARQL-endpoint, OAI-PMH-endpoint, LDF-endpoint, Data dump, Opensearch API. Or introduce proprietary property for the type (might be better the using the free-format name).
4.4.1. Developer documentation
If the distribution is a non-standard API, publishers MUST publish human-readable documentation on how to use that API and link to the documentation URL in the schema:usageInfo attribute:
{ "@context" : "https://schema.org/" , "@type" : "DataDownload" , "encodingFormat" : "application/json" , "contentUrl" : "https://example.com/custom-json-api" , "usageInfo" : "https://example.com/custom-json-api/docs" }
See also Data on the Web Best Practices Use Cases & Requirements §R-APIDocumented.
4.5. Data catalog
A data catalog provides consumers with a complete overview of available datasets, which improves discoverability.
Therefore, publishers SHOULD provide a catalog.
{ "@context" : "https://schema.org/" , "@type" : "DataCatalog" , "@id" : "http://data.bibliotheken.nl/id/datacatalog" , "name" : "Linked Data KB" , "description" : "Alle linked data zoals beschikbaar gesteld door de Koninklijke Bibliotheek." , "publisher" : { "@type" : "Organization" , "@id" : "https://www.kb.nl/" "name" : "Koninklijke Bibliotheek" , }, "dataset" : [ { "@type" : "Dataset" , "@id" : "http://data.bibliotheken.nl/id/dataset/rise-alba" , ...}, { ...} ] }
See § 4.6.4 DataCatalog attributes for a full overview of catalog attributes.
4.6. Overview of attributes
This is an overview of required and recommended attributes.
4.6.1. Dataset attributes
Property | Description | Cardinality | Usage |
---|---|---|---|
@id | The HTTP [IRI] of the dataset. | 1 | Required |
schema:name | See § 4.2.1 Basic information | 1 | Required |
schema:description | See § 4.2.1 Basic information. | 0..1 | Recommended |
schema:publisher | The publisher of the dataset. | 1 | Required |
schema:license | See § 4.2.2 License. | 1 | Required |
schema:distribution | See § 4.4 Distributions. | 0..n | Recommended |
schema:creator | The owner of the dataset. | 0..n | Recommended |
schema:dateCreated | See § 4.2.3 Creation, publication and modification dates. | 0..1 | Recommended |
schema:datePublished | See § 4.2.3 Creation, publication and modification dates. | 0..1 | Recommended |
schema:dateModified | See § 4.2.3 Creation, publication and modification dates. | 0..1 | Recommended |
schema:version | See § 4.2.4 Version. | 0..1 | Recommended |
schema:mainEntityOfPage | URL of a webpage where the dataset is described. | 0..1 | Recommended |
schema:inLanguage | Language or languages in which the dataset is available. Use one of the language codes from the [BCP47], such as "nl-NL". | 0..n | Recommended |
schema:isBasedOn | The URL of dataset this dataset is based on (previously schema:isBasedOnUrl). | 0..1 | Recommended |
schema:citation | A citation or reference for the dataset. | 0..1 | Recommended |
schema:genre | The genre or genres which describe the dataset. | 0..n | Recommended |
schema:keywords | The keyword or keywords which describe the dataset. | 0..n | Recommended |
schema:spatialCoverage | Indicates the place(s) which are the focus of the dataset. | 0..n | Recommended |
schema:temporalCoverage | Indicates the period that the dataset applies to. | 0..n | Recommended |
schema:includedInDataCatalog | The HTTP [IRI] of the data catalog(s) that the dataset belongs to. | 0..n | Recommended |
4.6.2. Organization attributes
Property | Description | Cardinality | Usage |
---|---|---|---|
@id | The HTTP [IRI] of the organization. | 1 | Required |
schema:name | The organization’s full name. | 1 | Required |
schema:alternateName | Alternative names such as an abbreviation that the organization is known under. | 0..n | Recommended |
schema:sameAs | Links to the organization in other databases. | 0..n | Recommended |
4.6.3. Distribution attributes
Property | Description | Cardinality | Usage |
---|---|---|---|
schema:contentUrl | The URL where the distribution can be directly accessed. | 1 | Required |
schema:encodingFormat | The distribution’s MIME format, for example application/sparql-query for a SPARL endpoint
or application/ld+json for a data dump serialized as JSON-LD. Use a value from the [IANA-MEDIA-TYPES] list.
| 1..n | Required |
schema:name | Name of the distribution, depicting the type of distribution. | 0..1 | Recommended |
schema:description | A description of the the distribution. | 0..1 | Recommended |
schema:datePublished | Date (or datetime) the distribution was published. | 0..1 | Recommended |
schema:dateModified | Date (or datetime) the distribution was last modified. | 0..1 | Recommended |
schema:inLanguage | Language or languages in which the dataset is available. Use one of the language codes from the [BCP47], such as "nl-NL". | 0..1 | Recommended |
schema:license | See § 4.2.2 License. | 0..1 | Recommended |
schema:contentSize | A measure indicating the size of the distribution. | 0..1 | Recommended |
schema:usageInfo | A link to the API documentation; see § 4.4.1 Developer documentation. | 0..1 | Required if custom API |
4.6.4. DataCatalog attributes
Property | Description | Cardinality | Usage |
---|---|---|---|
@id | The HTTP [IRI] of the data catalog. | 1 | Required |
schema:name | The name of the data catalog. | 1 | Required |
schema:description | A description of the data catalog. | 0..1 | Recommended |
schema:publisher | The publisher of the data catalog. | 1 | Required |
schema:dataset | The datasets that are contained in the data catalog. | 1..n | Required |
4.6.5. Full example
{ "@context" : "https://schema.org/" , "@type" : "Dataset" , "@id" : "http://data.bibliotheken.nl/id/dataset/rise-alba" , "name" : "Alba amicorum van de Koninklijke Bibliotheek" , "description" : "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project." , "license" : "http://creativecommons.org/publicdomain/zero/1.0/" , "dateCreated" : "2019-04-14" , "datePublished" : "2019-05-21" , "dateModified" : "2019-08-15" , "keywords" : [ "alba amicorum" ], "includedInDataCatalog" : "http://data.bibliotheken.nl/id/datacatalog" , "inLanguage" : [ "nl-NL" , "en-US" ], "mainEntityOfPage" : "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum" , "publisher" : { "@type" : "Organization" , "@id" : "https://www.kb.nl" , "name" : "Koninklijke Bibliotheek" , "alternateName" : "KB" , "sameAs" : "https://ror.org/02w4jbg70" }, "creator" : { "@type" : "Organization" , "@id" : "https://www.kb.nl" , "name" : "Koninklijke Bibliotheek" , "alternateName" : "KB" , "sameAs" : "https://ror.org/02w4jbg70" }, "distribution" : [ { "@type" : "DataDownload" , "encodingFormat" : "application/sparql-query" , "contentUrl" : "http://data.bibliotheken.nl/sparql" }, { "@type" : "DataDownload" , "encodingFormat" : "text/turtle" , "contentUrl" : "http://data.bibliotheken.nl/files/alba_20201016.ttl.gz" , "name" : "Datadump Alba Amicorum" , "datePublished" : "2019-05-21" , "dateModified" : "2019-08-15" , "contentSize" : "614K" } ] }
5. Tools
This section is non-normative.
Developers can use the NDE Register API to validate datasets and catalogs against the requirements described in this document. The [SHACL] shape graph used to validate datasets and catalogs is available at /shacl.
Google’s Rich Results Test (previously Structured Data Testing Tool) can help with testing RDF metadata in general.