Data on the Web Best Practices

This document provides best practices related to the publication and usage of data on the Web designed to help support a self-sustaining ecosystem. Data should be discoverable and understandable by humans and machines. Where data is used in some way, whether by the originator of the data or by an external party, such usage should also be discoverable and the efforts of the data publisher recognized. In short, following these best practices will facilitate interaction between publishers and consumers.

Namespaces used in the document
Prefix	Namespace IRI
cnt	http://www.w3.org/2011/content
dcat	http://www.w3.org/ns/dcat#
dct	http://purl.org/dc/terms/
dqv	http://www.w3.org/ns/dqv#
duv	http://www.w3.org/ns/duv#
foaf	http://xmlns.com/foaf/0.1/
oa	http://www.w3.org/ns/oa#
owl	http://www.w3.org/2002/07/owl#
pav	http://pav-ontology.github.io/pav/
prov	http://www.w3.org/ns/prov#
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs	http://www.w3.org/2000/01/rdf-schema#
skos	http://www.w3.org/2004/02/skos/core#

The Best Practices

This section contains the best practices to be used by data publishers in order to help them and data consumers to overcome the different challenges faced when publishing and consuming data on the Web. One or more best practices were proposed for each one of the challenges, which are described in the section Data on the Web Challenges.

Each BP is related to one or more requirements from the Data on the Web Best Practices Use Cases & Requirements document document and the development of Data on the Web Best Practices was guided by these requirements, in such a way that each best practice should have at least one of these requirements as an evidence of its relevance.

Basic Example

John works for the Transport Agency of MyCity and he is in charge of the publication of data on the Web about bus stops as well as real time data about the traffic of the city. John decides to create two datasets: one for the bus stops and other one to offer real time traffic data.

When necessary RDF examples in Turtle syntax will be used to show the result of the application of some best practices.

Metadata

The Web is an open information space, where the absence of a specific context, such a company's internal information system, means that the provision of metadata is a fundamental requirement. Data will not be discoverable or reusable by anyone other than the publisher if insufficient metadata is provided. Metadata provides additional information that helps data consumers better understand the meaning of data, its structure, and to clarify other issues, such as rights and license terms, the organization that generated the data, data quality, data access methods and the update schedule of datasets.

Metadata can be used to help tasks such as dataset discovery and reuse, and can be assigned considering different levels of granularity from a single property of a resource to a whole dataset, or all datasets from a specific organization.

Metadata can be of different types. These types can be classified in different taxonomies, with different grouping criteria. For example, a specific taxonomy could define three metadata types according to descriptive, structural and administrative features. Descriptive metadata serves to identify a dataset, structural metadata serves to understand the structure in which the dataset is distributed and administrative metadata serves to provide information about the version, update schedule etc. A different taxonomy could define metadata types with a scheme according to tasks where metadata are used, for example, discovery and reuse.

Provide metadata

Metadata must be provided for both human users and computer applications.

Why

Providing metadata is a fundamental requirement when publishing data on the Web because data publishers and data consumers may be unknown to each other. Then, it is essential to provide information that helps human users and computer applications to understand the data as well as other important aspects that describes a dataset or a distribution.

Intended Outcome

The metadata should be coherent with the described resource (i.e. dataset, distribution or the data itself) and it must be possible for humans to interpret the metadata, which makes it human-readable metadata. It also should be possible for computer applications, notably user agents, to process the metadata, which makes it machine-readable metadata.

Possible Approach to Implementation

Possible approaches to provide human readable metadata:

to provide metadata as part of an HTML Web page
to provide metadata as a separate text file

Possible approaches to provide machine readable metadata:

machine readable metadata may be provided in a serialization format such as Turtle and JSON, or it can be embedded in the HTML page using [[HTML-RDFA]] or [[JSON-LD]]. If multiple formats are published separately, they should be served from the same URL using content negotiation. Maintenance of multiple formats is best achieved by generating each available format on the fly based on a single source of the metadata.
when defining machine readable metadata, reusing existing standard terms and popular vocabularies are strongly recommended. For example, Dublin Core Metadata (DCMI) terms [[DC-TERMS]] and Data Catalog Vocabulary [[VOCAB-DCAT]] should be used to provide descriptive metadata.

How to Test

Check if there are coherent metadata available, both human and machine readable, for the resource.

Check if the metadata is available in a valid machine readable format, without syntax error and processable by machine.

Evidence

Relevant requirements: R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead

Benefits

Reuse
Comprehension
Discoverability
Processability

Provide descriptive metadata

The overall features of datasets and distributions must be described by metadata.

Why

Explicitly providing dataset descriptive information allows user agents to automatically discover datasets available on the Web and it allows humans to understand the nature of the dataset and its distributions.

Intended Outcome

Humans should be able to interpret the nature of the dataset and its distributions and user agents should be able to automatically discover datasets and distributions.

Possible Approach to Implementation

Descriptive metadata can include the following overall features of a dataset:

The title and a description of the dataset.
The keywords describing the dataset.
The date of publication of the dataset.
The entity responsible (publisher) for making the dataset available.
The contact point of the dataset.
The spatial coverage of the dataset.
The temporal period that the dataset covers.
The themes/categories covered by a dataset.

Descriptive metadata can include the following overall features of a distribution:

The title and a description of the distribution.
The date of publication of the distribution.
The media type of the distribution.

The machine readable version of the descriptive metadata can be provided using the vocabulary recommended by W3C to describe datasets, i.e. the Data Catalog Vocabulary [[VOCAB-DCAT]]. This provides a framework in which datasets can be described as abstract entities.

Machine-readable

The example below shows how to use [[VOCAB-DCAT]] to provide the machine readable discovery metadata for the bus stops dataset (:bus-stops-2015-05-05). The dataset has one CSV distribution (:bus-stops-2015-05-05.csv) that is also described using the [[VOCAB-DCAT]]. The dataset is classified under the domain represented by the relative URI :mobility . This domain may be defined as part of a set of domains identified by the URI :themes . To describe both concepts and schema concepts, John used SKOS . To express frequency of update an instance from the Content-Oriented Guidelines developed as part of the W3C Data Cube Vocabulary efforts was used. John chose to describe the spatial and temporal coverage of the example dataset using URIs from Geonames and the Interval dataset from data.gov.uk, respectively.

  :bus-stops-2015-05-05
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport","mobility","bus" ;
      dct:issued "2015-05-05"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dcat:theme :mobility ;
      dcat:distribution :bus-stops-2015-05-05.csv ;
      .

  :mobility
      a skos:Concept ;
      skos:inScheme :themes ;
      skos:prefLabel "Mobility" ;
      .

  :themes
      a skos:ConceptScheme ;
      skos:prefLabel "A set of domains to classify documents" ;
      .

  :bus-stops-2015-05-05.csv
      a dcat:Distribution ;
      dct:title "CSV distribution of bus-stops-2015-05-05 dataset" ;
      dct:description "CSV distribution of the bus stops dataset of MyCity" ;
      dcat:mediaType "text/csv" ;
      .

Human-readable

Example page with human-readable description of dataset is available.

How to Test

Check that the metadata for the dataset itself includes the overall features of the dataset.

Check if a user agent can automatically discover the dataset.

Evidence

Relevant requirements: R-MetadataAvailable, R-MetadataMachineRead, R-MetadataStandardized

Benefits

Reuse
Comprehension
Discoverability

Provide locale parameters metadata

Information about locale parameters (date, time, and number formats, language) should be described by metadata.

Why

Providing locale parameters metadata helps human users and computer applications to understand and to manipulate the data, improving the reuse of the data. Providing information about the locality for which the data is currently published aids data users in interpreting its meaning. Date, time, and number formats can have very different meanings, despite similar appearances. Making the language explicit allows users to determine how readily they can work with the data and may enable automated translation services.

Intended Outcome

It should be possible for human users and computer applications to interpret the meaning of dates, times and numbers accurately by referring to locale information.

Possible Approach to Implementation

Locale parameters metadata should include the following information:

The language(s) of the dataset.
The formats used for numeric values, dates and time.

The machine readable version of the discovery metadata may be provided according to the vocabulary recommended by W3C to describe datasets, i.e. the Data Catalog Vocabulary [[VOCAB-DCAT]].

Machine-readable

The example below shows the machine readable metadata for the bus stops dataset (:bus-stops-2015-05-05) with the inclusion of the locale parameters metadata. The property dct:language is used to declare the languages the dataset is published in. If the dataset is available in multiple languages, use multiple values for this property [[VOCAB-DCAT]]. The property dct:conformsTo is used to specify the standard adopted for date and time formats.

  
  :bus-stops-2015-05-05
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport","mobility","bus" ;
      dct:issued "2015-05-05"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dcat:theme :mobility ;
      dcat:distribution :bus-stops-2015-05-05.csv ;
      dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
      dct:language <http://id.loc.gov/vocabulary/iso639-1/pt> ;
      dct:conformsTo "ISO 8601" ; 
      .

Human-readable

Example page with human-readable description of dataset is available.

How to Test

Check that the metadata for the dataset itself includes the language in which it is published and that all numeric, date, and time fields have locale metadata provided either with each field or as a general rule.

Evidence

Relevant requirements: R-FormatLocalize, R-MetadataAvailable

Benefits

Reuse
Comprehension

Provide structural metadata

Information about the schema and internal structure of a distribution must be described by metadata.

Why

Providing information about the internal structure of a distribution can be helpful when exploring or querying the dataset. Besides, structural metadata provides information that helps to understand the meaning of the data.

Intended Outcome

Humans should be able to interpret the schema of a dataset as well as software agents should be able to automatically process the structural metadata of the dataset distributions.

Possible Approach to Implementation

Human readable strucutral metadata usually provides the properties or columns of the dataset schema.

Machine readable structural metadata is available according to the format of a specific distribution and it may be provided within separate documents or embedded into the document. For more details see the links below.

Tabular data: see Model for Tabular Data and Metadata on the Web
JSON-LD: see JSON-LD 1.0
XML: see XML Schema

Machine-readable

John used the Model for Tabular Data and Metadata on the Web for publishing the CSV distribution of the bus stops dataset (:bus-stops-2015-05-05.csv). The example below presents the structural metadata for this distribution .

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.csv",
  "dc:title": "CSV distribution of bus-stops-2015-05-05 dataset",
  "dcat:keyword": ["bus", "stop", "mobility"],
  "dc:publisher": {
    "schema:name": "Transport Agency of MyCity",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2015-05-05", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "stop_id",
      "titles": ["Identifier"],
      "dc:description": "An identifier for the bus stop.",
      "datatype": "string",
      "required": true
    }, {
      "name": "stop_name",
      "titles": ["Name"],
      "dc:description": "The name of the bus stop.",
      "datatype": "string"
    },  {
      "name": "stop_desc",
      "titles": ["Description"],
      "dc:description": "A description for the bus stop.",
      "datatype": "string"
   {
      "name": "stop_lat",
      "titles": ["Latitude"],
      "dc:description": "The latitude of the bus stop.",
      "datatype": "number"
    },
    {
      "name": "stop_long",
      "titles": ["Longitude"],
      "dc:description": "The longitude of the bus stop.",
      "datatype": "number"
    },
    {
      "name": "zone_id",
      "titles": ["ZONE"],
      "dc:description": "An identifier for the zone where the bus stop is located.",
      "datatype": "string"
    },
    {
      "name": "stop_url",
      "titles": ["URL"],
      "dc:description": "URL that identifies the bus stop.",
      "datatype": "string"
    }],
    "primaryKey: "stop_id"
  }
}

Human-readable

Example page with human-readable structural metadata is available.

How to Test

Check if the structural metadata of the dataset is provided in a human-readable format.

Check if the machine-readable metadata of the distribution includes structural information about the data organization.

Evidence

Relevant requirements: R-MetadataAvailable

Benefits

Reuse
Comprehension
Processability

Data Licenses

A license is a very useful piece of information to be attached to data on the Web. According to the type of license adopted by the publisher, there might be more or fewer restrictions on sharing and reusing data. In the context of data on the Web, the license of a dataset can be specified within the data, or outside of it, in a separate document to which it is linked.

Provide data license information

Data license information should be available.

Why

The presence of license information is essential for data consumers to assess the usability of data. User agents, for example, may use the presence/absence of license information as a trigger for inclusion or exclusion of data presented to a potential consumer.

Intended outcome

It should be possible for humans to understand possible restrictions placed on the use of a distribution.

It should be possible for machines to automatically detect the data license of a distribution.

Possible Approach to Implementation

Data license information can be provided as a link to a human-readable license or as a link/embedded machine-readable license.

The machine readable version of the data license metadata may be provided using one of the following vocabularies that include properties for linking to a license:

Dublin Core [[DC-TERMS]] (dct:license)
Creative Commons [[CC-VOCAB]] (cc:license)
schema.org [[SCHEMA-ORG]] (schema:license)
XHTML [[XHTML-VOCAB]] (xhtml:license)

There are also a number of machine readable rights languages, including:

The Creative Commons Rights Expression Language [[ccREL]]
The Open Digital Rights Language [[ODRL]]
The Open Data Rights Statement Vocabulary [[ODRS]]

Machine-readable

The CSV distribution of the bus stops dataset (:bus-stops-2015-05-05.csv) will be published under the Creative Commons Attribution-ShareAlike 3.0 Unported license. The property dct:license is used to include this information as part of the distribution metadata.

  :bus-stops-2015-05-05.csv
      a dcat:Distribution ;
      dct:title "CSV distribution of bus-stops-2015-05-05 dataset" ;
      dct:description "CSV distribution of the bus stops dataset of MyCity" ;
      dcat:mediaType "text/csv" ;
      dct:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
      .

Human-readable

Example page with human-readable data license information of the distribution.

How to Test

Check that the metadata for the dataset itself includes the data license information.

Check if a user agent can automatically detect the data license of the dataset.

Evidence

Relevant use cases: R-LicenseAvailable and R-MetadataMachineRead

Benefits

Reuse
Trust

Data Provenance

Data provenance becomes particularly important when data is shared between collaborators who might not have direct contact with one another either due to proximity or because the published data outlives the lifespan of the data provider projects or organizations.

The Web brings together business, engineering, and scientific communities creating collaborative opportunities that were previously unimaginable. The challenge in publishing data on the Web is providing an appropriate level of detail about its origin. The data producer may not necessarily be the data provider and so collecting and conveying this corresponding metadata is particularly important. Without provenance, consumers have no inherent way to trust the integrity and credibility of the data being shared. Data publishers in turn need to be aware of the needs of prospective consumer communities to know how much provenance detail is appropriate.

Provide data provenance information

Data provenance information should be available.

Why

Without accessible data provenance, data consumers will not know the origin or history of the published data.

Intended Outcome

It should be possible for humans to know the origin or history of the dataset.

It should be possible for machines to automatically process the provenance information about the dataset.

Possible Approach to Implementation

The machine readable version of the data provenance can be provided using an ontology recommended to describe provenance information, such as W3C's Provenance Ontology [[PROV-O]].

Machine-readable

The example below shows the machine readable metadata for the bus stops dataset (bus-stops) with the inclusion of the provenance metadata. The metadata specifies that John created the bus stops dataset. The property prov:actedOnBehalfOf is used to designate that John acted on behalf of the Transport Agency of MyCity.

  :bus-stops-2015-05-05
      a dcat:Dataset, prov:Entity ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport", "mobility", "bus" ;
      dct:issued "2015-05-05"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
      dct:creator :john ; 
      .

  :john
      a foaf:Person, prov:Agent ;
      foaf:givenName "John" ;
      foaf:mbox <mailto:john@mycitytransport.org> ;
      prov:actedOnBehalfOf :transport-agency-mycity ; 
      .
  :transport-agency-mycity
      a foaf:Organization, prov:Agent ;
      foaf:name "Transport Agency of Mycity" ;
      .

Human-readable

to be done.

How to Test

Check that the metadata for the dataset itself includes the provenance information about the dataset.

Check if a computer application can automatically process the provenance information about the dataset.

Evidence

Relevant requirements: R-ProvAvailable, R-MetadataAvailable

Benefits

Reuse
Comprehension
Trust

Data Quality

The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of data quality considerations in data publishing and consumption pipelines is of primary importance. Usually, the assessment of quality involves different kinds of quality dimensions, each representing groups of characteristics that are relevant to publishers and consumers. Measures and metrics are defined to assess the quality for each dimension [[DQV]]. There are heuristics designed to fit specific assessment situations that rely on quality indicators, namely, pieces of data content, pieces of data meta-information, and human ratings that give indications about the suitability of data for some intended use.

Provide data quality information

Data Quality information should be available.

Why

Data quality might seriously affect the suitability of data for specific applications, including applications very different from the purpose for which it was originally generated. Documenting data quality significantly eases the process of datasets selection, increasing the chances of reuse. Independently from domain-specific peculiarities, the quality of data should be documented and known quality issues should be explicitly stated in metadata.

Intended Outcome

It should be possible for humans to have access to information that describes the quality of the dataset and its distributions.

It should be possible for machines to automatically process the quality information about the dataset and its distributions.

Possible Approach to Implementation

The machine readable version of the dataset quality metadata may be provided according to the vocabulary that is being developed by the DWBP working group , i.e., the Data Quality Vocabulary [[DQV]].

Machine-readable

The example below shows the metadata for the CSV distribution of the bus stops dataset with the inclusion of the data quality metadata. The metadata was defined according to the Data Quality Vocabulary. Further examples can be found in the Data Quality Vocabulary document [[DQV]].

  :bus-stops-2015-05-05.csv
      a dcat:Distribution ;
      dcat:downloadURL <http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.csv> ;
      dct:title "CSV distribution of bus-stops-2015-05-05 dataset" ;
      dct:description "CSV distribution of the bus stops dataset of MyCity" ;
      dcat:mediaType "text/csv" ;
      dct:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
      dqv:hasQualityMeasurement :measure1, :measure2  
      .
  :measure1
      a dqv:QualityMeasurement ;
      dqv:computedOn :bus-stops.csv ;
      dqv:metric :downloadURLAvailabilityMetric ;
      dqv:value "true"^^xsd:boolean 
      .
  :measure2
      a dqv:QualityMeasurement ;
      dqv:computedOn :bus-stops.csv ;
      dqv:metric :csvCompletenessMetric ;
      dqv:value "0.5"^^xsd:double 
      .

#definition of dimensions and metrics
  :availability
      a dqv:Dimension ;
      skos:prefLabel "Availability"@en ;
      skos:definition "Availability of a dataset is the extent to which data (or some portion of it) is present, obtainable and ready for use."@en ;
      dqv:inCategory :accessibility 
      .
  :completeness
      a dqv:Dimension ;
      skos:prefLabel "Completeness"@en ;
      skos:definition "Completeness refers to the degree to which all required information is present in a particular dataset."@en ;
      dqv:inCategory :intrinsicDimensions	
      .
  :downloadURLAvailabilityMetric
      a dqv:Metric ;
      skos:definition "It checks if dcat:downloadURL is available and if its value is dereferenceable."@en ;
      dqv:inDimension :availability
      .
  :csvComplitenessMetric
      a dqv:Metric ;
      skos:definition "Ratio between the number of objects represented in the cvs and the number of objects expected to be represented according to the declared dataset scope."@en ;
      dqv:inDimension :completeness
      .

Human-readable

Example page with human-readable data quality information.

How to Test

Check that the metadata for the dataset itself includes quality information about the dataset.

Check if a computer application can automatically process the quality information about the dataset.

Evidence

Relevant Requirements: R-QualityMetrics, R-DataMissingIncomplete R-QualityOpinions

Benefits

Reuse
Trust

Data Versioning

Datasets published on the Web may change over time. Some datasets are updated on a scheduled basis, and other datasets are changed as improvements in collecting the data make updates worthwhile. In order to deal with these changes, new versions of a dataset may be created. Unfortunately, there is no consensus about when changes to a dataset should cause it to be considered a different dataset altogether rather than a new version. In the following, we present some scenarios where most publishers would agree that the revision should be considered a new version of the existing dataset.

Scenario 1: a new bus stop is created and it should to be added to the dataset;
Scenario 2: an existing bus stop is removed and it should be deleted from the dataset;
Scenario 3: an error was identified in one of the existing bus stops stored in the dataset and this error must be corrected.

In general, multiple datasets that represent time series or spatial series, e.g. the same kind of data for different regions or for different years, are not considered multiple versions of the same dataset. In this case, each dataset covers a different set of observations about the world and should be treated as a new dataset. This is also the case with a dataset that collects data about weekly weather forecasts for a given city, where every week a new dataset is created to store data about that specific week.

Scenarios 1 and 2 might trigger a major version, whereas Scenario 3 would likely trigger only a minor version. But how you decide whether versions are minor or major is less important than that you avoid making changes without incrementing the version indicator. Even for small changes, it is important to keep track of the different dataset versions to make the dataset trustworthy. Publishers should remember that a given dataset may be in use by one or more data consumers, and they should take reasonable steps to inform those consumers when a new version is released. For real-time data, an automated timestamp can serve as a version identifier. For each dataset, the publisher should take a consistent, informative approach to versioning, so data consumers can understand and work with the changing data.

Provide a version indicator

An indication of the version number or date should be available for each dataset.

Why

Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ.

Intended Outcome

It should be possible for data consumers to easily determine which version of a dataset they are working with. It should be possible to determine whether and how a given dataset differs from another of the same set.

Possible Approach to Implementation

The best method for providing versioning information will vary according to the context; however, there are some basic guidelines that can be followed, for example:

Include a unique version number or date as part of the metadata for the dataset.
Use a consistent numbering scheme with a meaningful approach to incrementing digits, such as [[SchemaVer]].
If the data is made available through an API, the URI used to request the latest version of the data should not change as the versions change, but it should be possible to request a specific version through the API.
Use Memento [[RFC7089]], or components thereof, to express temporal versioning of a dataset and to access the version that was operational at a given datetime. The Memento protocol aligns closely with the approach for assigning URIs to versions that is used for W3C specifications, described below.

The Web Ontology Language [[OWL2-QUICK-REFERENCE]] and the Provenance, Authoring and versioning Ontology [[PAV]] provides a number of annotation properties for version information.

Machine-readable

The example below shows the metadata for bus stops with the inclusion of the versioning metadata. The properties owl:versionInfo and pav:version are used to denote the version of the dataset.

  :bus-stops-2015-05-05 
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport","mobility","bus" ;
      dct:issued "2015-05-05"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
      dct:creator :john ;
      owl:versionInfo "1.0" ; 
      pav:version "1.0" ; 
      .

The following example needs to be updated to use the URI http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.

Using Memento

Assume:

http://example.org/dataset is the “generic URI” at which the current version of a dataset is always available
http://example.org/bus-stops-002 is the versioned URI for the current dataset
http://example.org/bus-stops-001 is the versioned URI of the prior version of the dataset
http://example.org/bus-stops-000 is the versioned URI of the first version of the dataset

In the Memento protocol, the versioned URIs provide HTTP response header information to express their version datetime and their relation to the generic URI:

curl -I http://example.org/bus-stops-001

HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMT
Link: <http://example.org/bus-stops>; rel=“original”

The versioned URIs can provide a link to a TimeGate, which supports datetime negotiation as a means to determine which version of a dataset was operational at a given datetime:

curl -I http://example.org/bus-stops-001

HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMT
Link:<http://example.org/bus-stops>; rel=“original”,
<http://example.org/timegate/bus-stops>; rel=“timegate”

The generic URI can also provide a link to a TimeGate:

curl -i -H http://example.org/bus-stops

HTTP/1.1 200 OK
Link: <http://example.org/timegate/bus-stops>; rel=“timegate”

This is how a client determines which dataset version was operational on March 20 2015:

curl -I -H "Accept-Datetime: Fri, 20 Mar 2015  12:00:00 GMT" http://example.org/timegate/dataset

HTTP/1.1 302 Found
Vary: accept-datetime
Location: http://example.org/bus-stops-000
Link: <http://example.org/bus-stops> rel="original"

Human-readable

Example page with human-readable data versioning information.

How to Test

Check that a unique version number or date is provided with the metadata describing the dataset.

Evidence

Relevant requirements: R-DataVersion

Benefits

Reuse
Trust

Provide version history

A version history for the dataset should be available.

Why

In creating applications that use data, it can be helpful to understand the variability of that data over time. Interpreting the data is also enhanced by an understanding of its dynamics. Determining how the various versions of a dataset differ from each other is typically very laborious unless a summary of the differences is provided.

Intended Outcome

It should be possible for data consumers to understand how the dataset typically changes from version to version and how any two specific versions differ.

Possible Approach to Implementation

Provide a list of published versions and a description for each version that explains how it differs from the previous version. An API can expose a version history with a single dedicated URL that retrieves the latest version of the complete history.

Machine-readable

Suppose that a new bus stop was created. To keep bus-stops-2015-05-05 up to date, a new version of the dataset (bus-stops-2015-12-17) was created. bus-stops-2015-12-17 includes all the data from bus-stops-2015-05-05 plus the data about the new bus stop. The machine readable metadata for bus-stops-2015-12-17 is shown below.

  :bus-stops-2015-12-17
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport","mobility","bus" ;
      dct:issued "2015-12-17"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
      dct:creator :john ;
       ...
      dct:isVersionOf :bus-stops-2015-05-05 ;
      pav:previousVersion: bus-stops-2015-05-05 ;
      rdfs:comment "The bus stops dataset was updated to reflect the creation of a new bus stop at 1115 Pearl Street." ;
      owl:versionInfo "1.1" ;
      pav:version "1.1" ; 
      .

The following example needs to be updated to use the URI http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.

Using Memento:

Assume:

http://example.org/bus-stops is the “generic URI” at which the current version of a dataset is always available
http://example.org/bus-stops-002 is the versioned URI for the current dataset
http://example.org/bus-stops-001 is the versioned URI of the prior version of the dataset
http://example.org/bus-stops-000 is the versioned URI of the first version of the dataset

The versioned URIs, the generic URI, and the TimeGate can provide a link to a TimeMap that provides an overview of all temporal versions of the dataset:

curl -I http://example.org/bus-stops-001

HTTP/1.1 200 OK
Date: Sat, 24 Oct 2015 16:23:30 GMT
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMT
Link: <http://example.org/bus-stops>; rel=“original”,
 <http://example.org/timemap/bus-stops>; rel=“timemap”;
type="application/link-format"

curl -I http://example.org/timemap/bus-stops

HTTP/1.1 200 OK
Content-Type: application/link-format

<http://example.org/bus-stops>;rel="original”,
<http://example.org/timedate/bus-stops>;rel="timegate”,
<http://example.org/timedate/bus-stops>;rel="timemap”;
type="application/link-format",
<http://example.org/bus-stops-000>; rel=“first memento"; datetime="Thu,05 Mar 2015 00:00:00 GMT",
<http://example.org/bus-stops-002>; rel=“memento"; datetime=“Sun, 05 Apr2015 00:00:00 GMT"
<http://example.org/bus-stops-002>; rel=“last memento"; datetime="Tue,05 May 2015 00:00:00 GMT"

The versioned URI can provide information regarding relations with other dataset versions:

curl -I http://example.org/bus-stops-001

HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMT
Link: <http://example.org/bus-stops>; rel=“original”,
<http://example.org/bus-stops-000>; rel=“prev first memento";
datetime="Thu, 05 Mar 2015 00:00:00 GMT",
<http://example.org/bus-stops-002>; rel=“next last memento";
datetime="Tue, 05 May 2015 00:00:00 GMT"

Human-readable

Example page with human-readable data versioning history information.

How to Test

Check that a list of published versions is available as well as a change log describing precisely how each version differs from the previous one.

Evidence

Relevant requirements: R-DataVersion

Benefits

Reuse
Trust

Data Identifiers

Identifiers take many forms and are used extensively in every information system. Data discovery, usage and citation on the Web depends fundamentally on the use of HTTP (or HTTPS) URIs: globally unique identifiers that can be looked up by dereferencing them over the Internet [[RFC3986]]. It is perhaps worth emphasizing some key points about URIs in the current context.

URIs are 'dumb strings', that is, they carry no semantics. Their function is purely to identify a resource.
Although the previous point is accurate, it would be perverse for a URI such as http://example.com/dataset.csv to return anything other than a CSV file. Human readability is helpful.
When de-referenced (looked up), a single URI may offer the same resource in more than one format. http://example.com/dataset may offer the same data in, say, CSV, JSON and XML. The server returns the most appropriate format based on content negotiation .
One URI may redirect to another.
De-referencing a URI triggers a computer program to run on a server so that the URI acts as a call to an API. The server may therefore do something as simple as return a single, static file, or it may carry out complex processing. Precisely what processing is carried out, i.e. the software on the server, is completely independent of the URI itself.

Use persistent URIs as identifiers of datasets

Datasets must be identified by a persistent URI.

Why

Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and reuse.

Developers may build URIs into their code and so it is important that those URIs persist and that they dereference to the same resource over time without the need for human intervention.

Intended Outcome

Datasets or information about datasets, must be discoverable and citable through time, regardless of the status, availability or format of the data.

Possible Approach to Implementation

To be persistent, URIs must be designed as such. This requires a different mindset to that used when creating a Web site designed for humans to navigate their way through. A lot has been written on this topic, see, for example, the European Commission's Study on Persistent URIs [[PURI]] which in turn links to many other resources.

Where a data publisher is unable or unwilling to manage a URI space directly for persistence, an alternative approach is to use a redirection service such as Permanent Identifiers for the Web or purl.org. These provide persistent URIs that can be redirected as required so that the eventual location can be ephemeral. The software behind such services is freely available so that it can be installed and managed locally if required.

Digital Object Identifiers (DOIs) offer a similar alternative. These identifiers are defined independently of any Web technology but can be appended to a 'URI stub.' DOIs are an important part of the digital infrastructure for research data and and libraries.

The URI http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops has several features that support persistence:

All names are subject to change over time but in choosing a domain name, it is reasonable for John to assume that MyCity will continue to exist and that it will continue to have a government. Therefore, while cases like Yugoslavia prove that even country names change and top level domains disappear (like .yu), a domain name based on the city's name is as persistent as any domain name can be.
By putting data on the data.mycity.example subdomain, John is creating a specific domain that can be managed independently of any particular department.
It is not safe to assume that a specific department will persist. The authorities in MyCity might very well decide that the Transport Agency should be merged with another to create the Transport and Environment Agency. It is right, therefore, not to include the name of the Transport Agency in the URI, but to include the task from which the data comes, in this case that of providing public transport.
Likewise, the path segments of /road and /bus take us further towards the specific dataset for which John is responsible.
The /dataset path segment is an indication that the URI identifies a dataset, rather than, say, a specific bus route.
Finally /bus-stops leads us to the dataset concerning bus stops in MyCity.
In DCAT terms, this would be the identifier for the dataset. Specific distributions of the dataset are likely to be identified by adding the relevant file extension to the URI, such as http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.csv, http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.json, http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops.ttl etc.

How to Test

Check that each dataset in question is identified using a URI that has been assigned under a controlled process as set out in the previous section. Ideally, the relevant Web site includes a description of the process and a credible pledge of persistence should the publisher no longer be able to maintain the URI space themselves.

Evidence

Relevant requirements: R-UniqueIdentifier, R-Citable

Benefits

Reuse
Linkability
Discoverability
Interoperability

Use persistent URIs as identifiers within datasets

Datasets should use and reuse other people's URIs as identifiers where possible.

Why

The power of the Web lies in the Network effect. The first telephone only became useful when the second telephone meant there was someone to call; the third telephone made both of them more useful yet. Data becomes more valuable if it refers to other people's data about the same thing, the same place, the same concept, the same event, the same person, and so on. That means using the same identifiers across datasets and making sure that your identifiers can be referred to by other datasets. When those identifiers are HTTP URIs, they can be looked up and more data discovered.

These ideas are at the heart of the 5 Stars of Linked Data where one data point links to another, and of Hypermedia where links may be to further data or to services (or more generally 'affordances') that act on or relate to the data in some way. Examples include a bug reporting mechanisms, processors, a visualization engine, a sensor, an actuator etc. In both Linked Data and Hypermedia, the emphasis is put on the ability for machines to traverse from one resource to another following links that express relationships.

That's the Web of Data.

Intended Outcome

That one data item can be related to others across the Web creating a global information space accessible to humans and machines alike.

Possible Approach to Implementation

This is a topic in itself and a general document such as this can only include superficial detail.

Developers know that very often the problem they're trying to solve will have already been solved by other people. In the same way, if you're looking for a set of identifiers for obvious things like countries, currencies, subjects, species, proteins, cities and regions, Nobel prize winners – someone's done it already. The steps described for discovering existing vocabularies [[LD-BP]] can readily be adapted.

ensure URI sets you use are published by a trusted group or organization;
ensure URI sets have persistent URIs.

If you can't find an existing set of identifiers that meet your needs then you'll need to create your own, following the patterns for URI persistence so that others will add value to your data by linking to it.

URIs can be long. In a dataset of even moderate size, storing each URI is likely to be repetitive and obviously wasteful. Instead, define locally unique identifiers for each element and provide data that allows them to be converted to globally unique URIs programmatically. The Metadata Vocabulary for Tabular Data [[tabular-metadata]] provides mechanisms for doing this within tabular data such as CSV files, in particular using URI template properties such as the about URL property.

The URI given as an example in the previous Best Practice (http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops) identifies a dataset. Much of the URI can be reused to identify bus stops, routes and the type of bus used on a given service. For example, a suitable persistent URI for the 'Airport - Bullfrog' route would be:

http://data.mycity.example.com/public-transport/road/bus/route/id/AB

This has the same initial structure as for the dataset but rather than /dataset it now includes the path segment /route so that humans can see that the type of thing identified is a bus route. The /id segment indicates that the URI identifies something that is not an information resource, i.e. something you cannot retrieve over the Internet, and /AB is the local identifier for the actual bus route. Dereferencing this URI should result in an HTTP 303 redirect to a similar URL such as http://data.mycity.example.com/public-transport/road/bus/route/doc/AB that describes, i.e. gives information about, the AB bus route (note the substitution of /doc for /id). Jeni Tennison's work on URLs in Data has more to say on this topic.

In offering this advice, it is recognized that URIs can be long. In a dataset of even moderate size, storing each URI is likely to be repetitive and obviously wasteful. Instead, define locally unique identifiers for each element (such as AB in this example) and provide data that allows them to be converted to globally unique URIs programmatically. The Metadata Vocabulary for Tabular Data [[tabular-metadata]] provides mechanisms for doing this within tabular data such as CSV files, in particular using URI template properties such as the about URL property.

How to Test

Check that within the dataset, references to things that don't change or that change slowly, such as countries, regions, organizations and people, as referred to by URIs or by short identifiers that can be appended to a URI stub. Ideally the URIs should resolve, however, they have value as globally scoped variables whether they resolve or not.

Evidence

Relevant requirements: R-UniqueIdentifier

Benefits

Reuse
Linkability
Discoverability
Interoperability

Assign URIs to dataset versions and series

URIs should be assigned to individual versions of datasets as well as the overall series.

Why

Like documents, many datasets fall into natural series or groups. For example:

noon temperature readings in central London 1850 to the present day;
today's noon temperature in London;
the temperature in London at noon on 3rd June 2015.

In different circumstances, it will be appropriate to refer separately to each of these examples (and many like them).

Intended Outcome

It should be possible to refer to a specific version of a dataset and to concepts such as a 'dataset series' and 'the latest version.'

Possible Approach to Implementation

The W3C provides a good example of how to do this. The (persistent) URI for this document is http://www.w3.org/TR/2015/WD-dwbp-20150224/. That identifier points to an immutable snapshot of the document on the day of its publication. The URI for the 'latest version' of this document is http://www.w3.org/TR/dwbp/ which is an identifier for a series of closely related documents that are subject to change over time. At the time of publication, these two URIs both resolve to this document. However, when the next version of this document is published, the 'latest version' URI will be changed to point to that.

This example needs to be rewritten.

To complete the London temperature example, one might imagine URIs as follows:

  http://weather.example.com/temperature/UK/London/noon
  http://weather.example.com/temperature/UK/London/noon/today
  http://weather.example.com/temperature/UK/London/noon/2015-06-03

How to Test

Check that each version of a dataset has its own URI, and that logical groups of datasets are also identifiable.

Evidence

Relevant requirements: R-UniqueIdentifier, R-Citable

Benefits

Reuse
Discoverability
Trust

Data Formats

The formats in which data is made available to consumers are a key aspect of making that data usable. The best, most flexible access mechanism in the world is pointless unless it serves data in formats that enable use and reuse. Below we detail best practices in selecting formats for your data, both at the level of files and that of individual fields. W3C encourages use of formats that can be used by the widest possible audience and processed most readily by computing systems. Source formats, such as database dumps or spreadsheets, used to generate the final published format, are out of scope. This document is concerned with what is actually published rather than internal systems used to generate the published data.

Use machine-readable standardized data formats

Data must be available in a machine-readable standardized data format that is adequate for its intended or potential use.

Why

As data becomes more ubiquitous, and datasets become larger and more complex, processing by computers becomes ever more crucial. Posting data in a format that is not machine readable places severe limitations on the continuing usefulness of the data. Data becomes useful when it has been processed and transformed into information.

Using non-standard data formats is costly and inefficient, and the data may lose meaning as it is transformed. On the other hand, standardized data formats enable interoperability as well as future uses, such as remixing or visualization, many of which cannot be anticipated when the data is first published. The use of non-proprietary data formats should also be considered since it increases the possibilities for use and reuse of data

Intended Outcome

It should be possible for machines to easily read and process data published on the Web.

It should be possible for data consumers to use computational tools typically available in the relevant domain to work with the data.

It should be possible for data consumers who wants to use or reuse the data to do so without investment in proprietary software.

Possible Approach to Implementation

Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, Turtle, NetCDF, JSON and RDF.

John decides to provide data about the bus stops of the route that goes from the airport to the central bus station of MyCity. The information provided includes, besides others, the name, the description, the latitude and the longitude of the bus stops.

   
  :busstops_airport_to_centralstation
      a dcat:Dataset ;
      dcat:distribution :busstops_airport_to_centralstation.csv
      .
   :busstops_airport_to_centralstation.csv
      a dcat:Distribution ;
      dcat:downloadURL <http://data.mycity.example.com/transport/bustops/airport_to_centralstation.csv> ;
      dct:title "CSV distribution of the bus stops of the route that goes from the airport to the central bus station of MyCity." ;
      dcat:mediaType "text/csv" ;
      dct:license <http://reference.data.gov.uk/id/open-government-licence>
      .

How to Test

Check that the data format conforms to a known machine-readable data format specification.

Evidence

Relevant requirements: R-FormatMachineRead, R-FormatStandardized R-FormatOpen

Benefits

Reuse
Processability

Provide data in multiple formats

Data should be available in multiple data formats.

Why

Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.

Intended Outcome

It should be possible for data consumers to work with the data without transforming it.

Possible Approach to Implementation

Consider the data formats most likely to be needed by intended users, and consider alternatives that are likely to be useful in the future. Data publishers must balance the effort required to make the data available in many formats, but providing at least one alternative will greatly increase the usability of the data.

In order to reach a larger number of data consumers, John decided to also provide a XML distribution of bus stops of the route that goes from the airport to the central bus station of MyCity.

  :busstops_airport_to_centralstation
      a dcat:Dataset ;
      dcat:distribution :busstops_airport_to_centralstation.csv ;
      dcat:distribution :busstops_airport_to_centralstation.xml
      .
   :busstops_airport_to_centralstation.csv
      a dcat:Distribution ;
      dcat:downloadURL <http://data.mycity.example.com/transport/bustops/airport_to_centralstation.csv> ;
      dct:title "CSV distribution of the bus stops of the route that goes from the airport to the central bus station of MyCity." ;
      dcat:mediaType "text/csv" ;
      dct:license <http://reference.data.gov.uk/id/open-government-licence>
      .
    :busstops_airport_to_centralstation.xml
      a dcat:Distribution ;
      dcat:downloadURL <http://data.mycity.example.com/transport/bustops/airport_to_centralstation.xml> ;
      dct:title "XML distribution of the bus stops of the route that goes from the airport to the central bus station of MyCity." ;
      dcat:mediaType "text/xml" ;
      dct:license <http://reference.data.gov.uk/id/open-government-licence>
      .

How to Test

Check that the complete dataset is available in more than one data format.

Evidence

Relevant requirements: R-FormatMultiple

Benefits

Reuse
Processability

Data Vocabularies

Data is often represented in a structured and controlled way, making reference to a range of vocabularies, for example, by defining types of nodes and links in a data graph or types of values for columns in a table, such as the subject of a book, or a relationship “knows” between two persons. Additionally, the values used may come from a limited set of pre-existing values or resources: for example object types, roles of a person, countries in a geographic area, or possible subjects for books. Such vocabularies ensure a level of control, standardization and interoperability in the data. They can also serve to improve the usability of datasets. Say, a dataset contains a reference to a concept described in several languages. Such reference allows applications to localize their display of their search depending on the language of the user.

According to W3C, vocabularies define the concepts and relationships (also referred to as “terms” or “attributes”) used to describe and represent an area of concern. Vocabularies are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several categories of vocabularies have been coined, for example, ontology, controlled vocabulary, thesaurus, taxonomy, code list, semantic network.

There is no strict division between the artifacts referred to by these names. “Ontology” tends however to denote the vocabularies of classes and properties that structure the descriptions of resources in (linked) datasets. In relational databases, these correspond to the names of tables and columns; in XML, they correspond to the elements defined by an XML Schema. Ontologies are the key building blocks for inference techniques on the Semantic Web. The first means offered by W3C for creating ontologies is the RDF Schema [[RDF-SCHEMA]] language. It is possible to define more expressive ontologies with additional axioms using languages such as those in The Web Ontology Language [[OWL2-OVERVIEW]].

On the other hand, “controlled vocabularies”, “concept schemes”, “knowledge organization systems” enumerate and define resources that can be employed in the descriptions made with the former kind of vocabulary. A concept from a thesaurus, say, “architecture”, will for example be used in the subject field for a book description (where “subject” has been defined in an ontology for books). For defining the terms in these vocabularies, complex formalisms are most often not needed. Simpler models have thus been proposed to represent and exchange them, such as the ISO 25964 data model [[ISO-25964]] or W3C's Simple Knowledge Organization System [[SKOS-PRIMER]].

Use standardized terms

Standardized terms should be used to provide data and metadata.

Why

Using standardized lists of codes other commonly used terms for data and metadata values as much as possible helps avoiding ambiguity and clashes between these values.

Intended Outcome

The benefit of using standardized code lists and other commonly used terms is to enhance interoperability and consensus among data publishers and consumers.

Possible Approach to Implementation

Values in datasets should refer as much as possible to standardization efforts or organizations, which defines terms or codes as a clear reference.

Organizations like the Open Geospatial Consortium (OGC), ISO, W3C, libraries and research data services, etc., provide list of codes, terminologies or even Linked Data vocabularies that can be used for this.

A key point is to make sure the dataset or its documentation provides enough (human- and machine-readable) context for the values, so that data consumers can retrieve and exploit the standardized meaning of these values. In the context of the Web, using unambiguous, Web-based identifiers for standardized values is an efficient way to do this.

1. The Library of Congress publishes lists of ISO 639 country codes as Linked Data (see [[ISO639-1-LOC]] for two-letter codes):

:bus-stops
    dct:language <http://id.loc.gov/vocabulary/iso639-1/en> .

2. The British Oceanographic Data Center Web Services publishes a reference list of URIs for types of tools to perform scientific measures, such as http://vocab.nerc.ac.uk/collection/L05/current/357/ for acoustic tracking systems:

:measurement-001 a prov:Activity ;
    prov:used :sensor-001 .
:sensor-001 a prov:Agent ;
    dct:type <http://vocab.nerc.ac.uk/collection/L05/current/357/> .

(note that real-life data should use more specific classes and properties than the PROV ones for typing a measurement activity, an instrument and the relationship between these)

3. Australia's Solid Earth and Environment Grid publish a reference list of URIs for geologic timescale elements from the International Commission on Stratigraphy's Chronostratigraphic Chart, such as http://resource.geosciml.org/classifier/ics/ischart/Paleozoic for the Paleozoic Era:

:dataset-005 a dcat:Dataset ;
    dct:temporal <http://resource.geosciml.org/classifier/ics/ischart/Paleozoic> .

4. Google maintains a General Transit Feed Specification that defines a format for publishing public transportation data. This format relies on a set of fields like route_short_name or route_type that are carefully defined and exposed to constant community feedback in order to facilitate consensus, which leads to easier adoption and greater interoperability. Definitions include specifications of coded values for some fields, as in the following extract for route_type codes:

0 - Tram, Streetcar, Light rail. Any light rail or street level system within a metropolitan area.
1 - Subway, Metro. Any underground rail system within a metropolitan area.
2 - Rail. Used for intercity or long-distance travel.

Note that in a non-linked data fashion, these fields and codes have no individual Web identifiers nor machine-readable semantics. Exploiting them thus requires implementers to parse the documentation and encode interpretations in each individual application consuming the data.

How to Test

Check that the terms or codes to be used are defined in a standard organization/working group of body such as IETF, OGC, W3C, etc.

Evidence

Relevant requirements: R-MetadataStandardized, R-MetadataDocum R-QualityComparable

Benefits

Reuse
Comprehension
Trust
Interoperability

Reuse vocabularies

Shared vocabularies should be used to encode data and metadata.

Why

Use of shared vocabularies captures and facilitates consensus in communities. Reusing existing vocabularies to encode datasets and metadata increases interoperability and reduces redundancies, encouraging reuse of these data. In particular, the use of shared vocabularies for metadata (especially structural, provenance, quality and versioning metadata) helps the automatic processing of both data and metadata.

Intended Outcome

Datasets and metadata sets are easier to be compared by humans or machines when they use the same vocabulary to describe metadata.

When two datasets or metadata sets use the same vocabulary, (automatic) processing tools designed for one can be more easily applied to the other. This greatly facilitates re-use of datasets.

Possible Approach to Implementation

The Vocabularies section of the W3C Best Practices for Publishing Linked Data [LD-BP] provides guidance on the discovery, evaluation and selection of existing vocabularies.

1. All examples in this document show how common vocabularies (PROV, SKOS, etc) can be reused to express data and metadata statements, instead of minting entirely new classes and properties.

2. The DCAT vocabulary to express metadata on datasets [VOCAB-DCAT] re-uses elements from Dublin Core, FOAF, SKOS and vCard, see its namespaces section. For example, reusing Dublin Core properties like dct:title instead of creating new ones (say, dcat:title) enables DCAT-based metadata to be consumed by any application that can read and manipulate Dublin Core statements.

3. In the digital culture sector, the data model for the initiative Europeana (EDM) also widely re-uses existing shared vocabularies like Dublin Core, FOAF, SKOS, etc. This has facilitated adoption of EDM by Europeana's data providers and helped position it as a best practice for similar initiatives in the same sector. For instance, metadata application profile from the Digital Public Library of America reuses EDM and thus the various vocabularies that EDM builds on. As a result, large amounts of digital culture data have become more interoperable within the sector. That data is also easier to reuse by consumers from other communities, who are not familiar with the traditional models and terminologies used by library, archives and museums.

How to Test

Using vocabulary repositories like the Linked Open Vocabularies repository or lists or services mentioned in technology-specific best practices such as the Best Practices for Publishing Linked Data [LD-BP], or the Core Initial Context for RDFa and JSON-LD, check that classes, properties, terms, elements or attributes used to represent a dataset do not replicate those defined by vocabularies used for other datasets.

Evidence

Relevant requirements: R-QualityComparable, R-VocabReference

Benefits

Reuse
Processability
Interoperability

Choose the right formalization level

When reusing a vocabulary, a data publisher should opt for a level of formal semantics that fit data and applications.

Why

Formal semantics help to establish precise specifications that support establishing the intended meaning of the vocabulary and the performance of complex tasks such as reasoning. On the other hand, complex vocabularies require more effort to produce and understand, which could hamper their reuse, as well as the comparison and linking of datasets exploiting them. Highly formalized data is also harder to exploit by inference engines: for example, using an OWL class in a position where a SKOS concept is enough, or using OWL classes with complex OWL axioms raises the formal complexity of the data according to the OWL Profiles [[OWL2-PROFILES]]. Data producers should therefore seek to identify the right level of formalization for particular domains, audiences and tasks, and maybe offer different formalization levels when one size does not fit all.

Intended Outcome

The data supports all application cases but should not be more complex to produce and reuse than necessary;

Possible Approach to Implementation

Identify the "role" played by the vocabulary for the datasets, say, providing classes and properties used to type resources and provide the predicates for RDF statements, or elements in an XML Schema, as opposed to providing simple concepts or codes that are used for representing attributes of the resources described in a dataset. When simpler data models are enough to convey the necessary semantics, represent vocabularies using them.

Even when a language with rich formal semantics like OWL is used to express a vocabulary, it is preferable that this vocabulary has a minimal ontological commitment, i.e. by featuring only the formal axioms that enable inferences and validation checks that have been explicitly identified as relevant for the domain or application at hand. The more axioms are used to specify a vocabulary, the narrower its usage is; unnecessary axioms unnecessarily constrain the reuse of a vocabulary across applications.

1. For expressing simple vocabularies like thesauri or code lists as Linked Data, a simpler data model like SKOS may be preferred over formal ontology languages like OWL; see for example how concept schemes and code lists are represented and used in the RDF Data Cube Recommendation [[VOCAB-DATA-CUBE]].

2. Designers of the SKOS ontology itself have minimized its ontological commitment by questioning all formal axioms that were suggested for its classes and properties. Often they have been rejected because their use, while beneficial to many applications, would have created formal inconsistencies for the data from other applications, making SKOS not usable at all for these. As an example, the property skos:broader was not defined as a transitive property, even though it would have fit the way hierarchical links between concepts are created for many thesauri [[SKOS-DESIGN]].

3. The vocabularies for Data Quality [[DQV]] and Data Usage [[DUV]] created by the W3C Working Group publishing this document have also sought to minimize the number of formal axioms involved in their definition. For instance, the property dqv:hasQualityMeasurement has no formal domain in the RDFS/OWL sense, even though it is expected to be most often used with resources that are of type dcat:Dataset or dcat:Distribution. This allows application designers to employ it for other types of entities, for which quality measurements would also be relevant but that were not in the focus of the design process for DQV.

4. The Schema.org schemas for publishing structured data on the Web have adopted an informative rather than normative approach for defining the types of objects these properties can be used with. For instance, the values of the property author are only "expected" to be of type Organization or Person. author "can be used" on the type CreativeWork but this is not a strict constraint.

How to Test

For formal knowledge representation languages, applying an inference engine on top of the data that uses a given vocabulary does not produce too many statements that are unnecessary for target applications.

Evidence

Relevant requirements: R-VocabReference, R-QualityComparable

Benefits

Reuse
Comprehension
Interoperability

Sensitive Data

To support best practices for publishing sensitive data, data publishers should identify all sensitive data, assess the exposure risk, determine the intended usage, data user audience and any related usage policies, obtain appropriate approval, and determine the appropriate security measures needed to taken to protect the data, which should also account for secure authentication and use of HTTPS.

Data publishers should preserve the privacy of individuals where the release of personal information would endanger safety (unintended accidents) or security (deliberate attack). Privacy information might include: full name, home address, mail address, national identification number, IP address (in some cases), vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc.

At times, because of sharing policies sensitive data may not be available in part or in its entirety. Data unavailability represents gaps that may affect the overall analysis of datasets. To account for unavailable data, data publishers should publish information about unavoidable data gaps.

Provide data unavailability reference

References to data that is not open, or available under different restrictions to the origin of the reference, should provide explanation about how the referred data can be accessed and who can access it.

Why

Publishing online documentation about unavailable data due to sensitivity issues provides a means for publishers to explicitly identify knowledge gaps. This provides a contextual explanation for consumer communities thus encouraging use of the data that is available.

Intended Outcome

Publishers should provide information about data that is referred to from the current dataset but that is unavailable or only available under different conditions.

Possible Approach to Implementation

Depending on the machine/human context there are a variety of ways to indicate data unavailability. Data publishers may publish an HTML document that gives a human-readable explanation for data unavailability. From a machine application interface perspective, appropriate HTTP status codes with customized human readable messages can be used. Examples of status codes include: 404 (file not found), 410 (permanently removed), 503 (service *providing data* unavailable).

How to Test

If the dataset includes references to other data that is unavailable, check whether an explanation is available in the metadata and/or description of it.

Evidence

Relevant requirements: R-AccessLevel

Benefits

Reuse
Trust

Data Access

Providing easy access to data on the Web enables both humans and machines to take advantage of the benefits of sharing data using the Web infrastructure. By default, the Web offers access using Hypertext Transfer Protocol (HTTP) methods. This provides access to data at an atomic transaction level. However, when data is distributed across multiple files or requires more sophisticated retrieval methods different approaches can be adopted to enable data access, including bulk download and APIs.

One approach is packaging data in bulk using non-proprietary file formats (for example tar files). Using this approach, bulk data is generally pre-processed server side where multiple files or directory trees of files are provided as one downloadable file. When bulk data is being retrieved from non-file system solutions, depending on the data user communities, the data publisher can offer APIs to support a series of retrieval operations representing a single transaction.

For data that is streaming to the Web in “real time” or “near real time”, data publishers should publish data or use APIs to enable immediate access to data, allowing access to critical time sensitive data, such as emergency information, weather forecasting data, or published system metrics. In general, APIs should be available to allow third parties to automatically search and retrieve data published on the Web.

On a further note, it can be observed that data on the Web is essentially about the description of entities identified by a unique, Web-based, identifier (an URI). Once the data is dumped and sent to an institute specialised in digital preservation the link with the Web is broken (dereferencing) but the role of the URI as a unique identifier still remains. In order to increase the usability of preserved dataset dumps it is relevant to maintain a list of these identifiers.

Provide bulk download

Data should be available for bulk download.

Why

When Web data is distributed across many URIs but might logically be organized as one container, accessing the data in bulk can be useful. Bulk access provides a consistent means to handle the data as one container. Individually accessing data over many retrievals can be cumbersome and, if used to reassemble the complete dataset, can lead to inconsistent approaches to handling the data.

Intended Outcome

It should be possible to download data on the Web in bulk. Data publishers should provide a way, through either a single-file download or a single API call, for consumers to access all the data. Large file transfers (which would require more time than a typical user would consider reasonable) should be enabled by dedicated file-transfer protocols. The bulk download should include the metadata describing the dataset. Discovery metadata [[VOCAB-DCAT]] should also be available outside the bulk downlaod.

Possible Approach to Implementation

Depending on the nature of the data and consumer needs, possible approaches could include the following:

Preprocessing a copy of the data into a compressed archive format and making the data accessible from one URI. This is particularly useful for handling data that changes infrequently.
Hosting an API that includes the ability to retrieve a bulk download in addition to dynamic queries. This approach is useful for capturing a complete snapshot of dynamic data.
For very large datasets, bulk file transfers can be enabled via means other than http, such as bbcp or GridFTP.

How to Test

Humans can retrieve complete copies of preprocessed bulk data through existing tools such as a browser via a single request.

Evidence

Relevant requirements: R-AccessBulk

Benefits

Reuse
Access

Provide Subsets for Large Datasets

If your dataset is large, enable users and applications to readily work with useful subsets of your data.

Why

Large datasets can be difficult to move from place to place. It can also be inconvenient for users to store or parse a large dataset. Users should not have to download a complete dataset if they only need a subset of it. Moreover, Web applications that tap into large datasets will perform better if their developers can take advantage of “lazy loading”, working with smaller pieces of a whole and pulling in new pieces only as needed. The ability to work with subsets of the data also enables offline processing to work more efficiently. Real-time applications benefit in particular, as they can update more quickly.

Intended Outcome

Both human users and applications should be able to access subsets of a dataset, rather than the entire thing, as needed. Subsetting approaches should aim for a high ratio of needed data to unneeded data for the largest number of users. Static datasets that users in the domain would consider to be large should be downloadable in smaller pieces. APIs should make slices or filtered subsets of the data available, the granularity depending on the needs of the domain and the demands of performance in a Web application.

Possible Approaches to Implementation

Consider the expected use cases for your dataset and determine what types of subsets are likely to be most useful. An API is usually the most flexible approach to serving subets of data, as it allows customization of what data is transferred, making the available subsets much more likely to provide the needed data--and little unneeded data--for any given situation. The granularity should be suitable for Web application access speeds. (An API call that returns within one second enables an application to deliver interactivity that feels natural. Data that takes more than ten seconds to deliver will likely cause users to suspect failure.)

Another way to subset a dataset is to simply split it into smaller units and make those units individually available for download or viewing.

It can also be helpful to mark up a dataset so that individual sections through the data (or even smaller pieces, if expected use cases warrant it) can be processed separately. One way to do that is by indicating “slices” with the RDF Data Cube Vocabulary.

1. Using an API: the MyCity transit API offers information about individual buses, bus stops, and time points. It can accept a query for all the timepoints for a single stop on a single bus route, or it can accept one for all the buses along one route at one time point, or it can accept one for all the time points for a single bus on a single route. These small slices through the data enable Web applications to update data quickly and deliver it to riders on demand, interactively.

2. Splitting the datset: the MyCity transit agency has a set of nicely formatted single-route schedules available in PDF format. Users can select the route they plan to ride from a menu and view or download the schedule for that route alone.

How to Test

Check that the full content of the dataset can be recovered by retrieval of multiple subsets.

Evidence

Relevant requirements: R-Citable, R-GranularityLevels, R-UniqueIdentifier, R-AccessRealTime

Benefits

Reuse
Linkability
Access
Processability

Use content negotiation for serving data available in multiple formats

It is recommended to use content negotiation for serving data available in multiple formats.

Why

It is possible to have data being served in a HTML page mixed with human-readable and machine-readable data. RDFa could be used to mix HTML content with semantic data.

But, in some cases this page is subject of scraping by some applications in order to get data available. When structured data is mixed with HTML, but it is possible to have a different representation with the same structured data, written in Turtle or JSON-LD, it is recommended to serve this page using Content Negotiation.

A dataset can also be served in different representations, and can be retrieved by using an API or by direct access to the resource URI. In those cases, HTTP Content Negotiation technique can be used.

Intended Outcome

It should be possible to serve the same resource with different representations.

Possible Approach to Implementation

A possible approach to implementation is to configure the Web server to deal with content negotiation of the requested resource.

The specific format of the resource's representation can be accessed by the URI or by the Content-type of the HTTP Request.

Different representations of the bus stops dataset can be served according to the specified content type of the HTTP Request:
Using cURL to get the content of http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops represented in CSV and in JSON-LD format.

curl -H "Accept: text/csv" http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops

curl -H "Accept: application/ld+json" http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops

How to test

Check the available representations of the resource and try to get them specifying the accepted content on the HTTP Request header.

Evidence

Relevant requirements:

Benefits

Reuse
Access

Provide real-time access

When data is produced in real-time, it should be available on the Web in real-time.

Why

The presence of real-time data on the Web enables access to critical time sensitive data, and encourages the development of real-time Web applications. Real-time access is dependent on real-time data producers making their data readily available to the data publisher. The necessity of providing real-time access for a given application will need to be evaluated on a case by case basis considering refresh rates, latency introduced by data post processing steps, infrastructure availability, and the data needed by consumers. In addition to making data accessible, data publishers may provide additional information describing data gaps, data errors and anomalies, and publication delays.

Intended Outcome

Data should be available at real time or near real time, where real-time means a range from milliseconds to a few seconds after the data creation, and near real time is a predetermined delay for expected data delivery.

Possible Approach to Implementation

Real-time data accessibility may be achieved through two means:

Push - as data is produced the producers communicates data to the data publisher either by disseminating data to the publisher or making storage available accessible to the data producer.
On-Demand (Pull) - available real-time data is made available upon request. In this case, data publishers will provide an API to facilitate these read-only requests.

In addition to data access, to ensure credibility providing access to error conditions, anomalies, and instrument "house keeping" data enhance real-time applications ability to interpret and convey real-time data quality to consumers.

How to Test

To adequately test real time data access, data will need to be tracked from the time it is initially collected to the time it is published and accessed. [[PROV-O]] can be used to describe these activities. Caution should be used when analyzing real-time access for systems that consist of multiple computer systems. For example, tests that rely on wall clock time stamps may reflect inconsistencies between the individual computer systems as opposed to data publication time latency.

Evidence

Relevant requirements: R-AccessRealTime

Benefits

Reuse
Access

Provide data up to date

Data must be available in an up-to-date manner and the update frequency made explicit.

Why

Data on the Web availability should closely coincide with data provided at creation time, collection time, or after it has been processed or changed. Carefully synchronizing data publication to the update frequency encourages data consumer confidence and reuse.

Intended Outcome

When new data is provided or data is updated, it must be published to coincide with the data changes.

Possible Approach to Implementation

Implement an API to enable data access. When data is provided by bulk access, new files with new data should be provided as soon as additional data is created or updated. Or, use technologies that are intended to expose data on the Web using interlinked resources, like Activity Streams or Atom.

If the site provides multiple data feeds providing multiple updates, the last update date time stamp should be co-located with each separate data feed. The international date format is recommended to avoid any ambiguity https://www.w3.org/International/questions/qa-date-format.

How to Test

Write test standard operating procedure for data publisher to keep test data on Web site up to date.

Following standard operating procedure:

Write test client to access published data.
Access data and save first copy locally.
Publish an updated version of data.
Access data and save second copy locally.
Compare first copy to second copy to verify change.

Evidence

Relevant requirements: R-AccessUptodate

Benefits

Reuse
Access

Data Access APIs

In the following, we present best practices related with the creation of APIs to provide data access.

Make Data Available through an API

Offer an API to serve data if you have the resources to do so.

Why

An API offers the greatest flexibility and processability for consumers of your data. It can enable real-time data usage, filtering on request, and the ability to work with the data at an atomic level. If your dataset is large, frequently updated, or highly complex, an API is likely to be the best option for publishing your data.

Intended Outcome

Developers will have programmatic access to the data for use in their own applications. Data can be updated without requiring effort on the part of consumers.

Possible Approach to Implementation

Creating an API is a little more involved than posting data for download. It requires some understanding of how to build a Web application. One need not necessarily build from scratch, however. If you use a data management platform, such as CKAN, you may be able to enable an existing API. Many Web development frameworks include support for APIs, and there are also frameworks written specifically for building custom APIs.

How to Test

It should be possible for Web applications to obtain specific data by querying a programmatic interface. Use a test client to simulate calls and responses, making sure that the performance is acceptable.

Evidence

Relevant requirements: R-AccessRealTime, R-AccessUpToDate

Benefits

Reuse
Processability
Interoperability
Access

To review the BP "Use an API" and possibly rewrite Possible Approach to Impelmentation section.

Use Web Standards as the foundation of your API

When designing APIs, it is recommended to use an architectural style that is founded on the technologies of the Web itself. Web standards such as URIs, HTTP verbs, HTTP response codes, MIME types, typed HTTP Links, and content negotiation can all help solve difficult problems and enable you to build a flexible and useful data service.

Why

APIs that are built on Web standards leverage the strengths of the Web. For example, using HTTP verbs as methods and URIs that map directly to individual resources helps to avoid tight coupling between requests and responses, making for an API that is easy to maintain and can readily be understood and used by many developers. The statelessness of the Web can be a strength in enabling quick scaling, and using hypermedia enables rich interactions with your API.

Intended Outcome

Developers who have some experience with APIs based on Web standards, such as REST, will have an initial understanding of how to use your API because it uses a standardized interface. Your API will also be easier to maintain.

Possible Approaches to Implementation

REST (REpresentational State Transfer) is an architectural style that, when used in a Web API, takes advantage of the architecture of the Web itself. A full discussion of how to build a RESTful API is beyond the scope of this document, but there are many resources and a strong community that can help in getting started. There are also many RESTful development frameworks available. If you are already using a Web development framework that supports building REST APIs, consider using that. If not, consider an API-only framework that uses REST.

Another aspect of implementation to consider is making a hypermedia API, one that responds with links rather than data alone. Links are what make the Web a web, and data APIs can be more useful and usable by including links in their responses. The links can offer additional resources, documentation, and navigation. Even for an API that does not meet all the constraints of REST, returning links in responses can make for a service that is rich and self-documenting.

An example response for information about a certain bus route from a hypermedia API might look like the following:

  {
    "code": "200",
    "text": "OK",
    "data": {
      "update_time": "2013-01-01T03:00:02Z",
      "route_id": "52",
      "route_name" "Lexington South"
      "route_description": "Lexington corridor south of Market",
      "route_type": "3",
    },
    "links": [
      {
        "href": "https://api.transit.mycity.org/v2/routes/52",
        "rel": "self",
        "type": "application/json",
        "method": "GET"
      },
      {
        "href": "https://api.transit.mycity.org/v2/routes",
        "rel": "collection",
        "type": "application/json",
        "method": "GET"
      }
      {
        "href": "https://api.transit.mycity.org//v2/schedules/52",
        "rel": "described-by",
        "type": "application/json",
        "method": "GET"
      },
      {
        "href": "https://api.transit.mycity.org//v2/maps/52",
        "rel": "described-by",
        "type": "application/json",
        "method": "GET"
      }
    ]
  }

How to Test

Check that the service avoids using http as a tunnel for calls to custom methods, and check that URIs do not contain method names.

Evidence

Relevant requirements: R-APIDocumented, R-UniqueIdentifier

Benefits

Reuse
Linkability
Interoperability
Discoverability
Access
Processability

Provide complete documentation for your API

Provide complete information on the Web about your API. Be sure to update documentation as you add features or make changes.

Why

Developers are the primary consumers of an API. In order to develop against it, they will need to understand how to use it. Providing comprehensive documentation in one place allows developers to code efficiently. Highlighting changes enables your users to take advantage of new features and adapt their code if needed.

Intended Outcome

The whole set of information related to the API—how to use it, notices of recent changes, contact information, and so on—should be easily browsable on the Web.

Developers should be able to obtain detailed information about each call to the API, including the parameters it takes and what it is expected to return.

The API should be self-documenting as well, so that calls return helpful information about errors and usage.

Recent changes to the API itself should be readily discoverable by users.

API users should be able to contact the maintainers with questions, suggestions, or bug reports.

Possible Approach to Implementation

A typical API reference provides a comprehensive list of the calls the API can handle, describing the purpose of each one, detailing the parameters it allows and what it returns, and giving one or more examples of its use. One nice trend in API documentation is to provide a form in which developers can enter specific calls for testing, to see what the API returns for their use case. There are now tools available for quickly creating this type of documentation, such as Swagger, io-docs, OpenApis, and others.

How to Test

Check that every call enabled by your API is described in your documentation. Make sure you provide details of what parameters are required or optional and what each call returns. The quality of documentation is also related to usage and feedback from developers. Try to get constant feedback from your users about the documentation.

Evidence

Relevant requirements: R-APIDocumented

Benefits

Reuse
Trust

Avoid Breaking Changes to Your API

Avoid changes to your API that break client code, and communicate any changes in your API to your developers when evolution happens.

Why

When developers implement a client for your API, they may rely on specific characteristics that you have built into it, such as the schema or the format of a response. Avoiding breaking changes in your API minimizes breakage to client code. Communicating changes when they do occur enables developers to take advantage of new features and, in the rare case of a breaking change, take action.

Intended Outcome

Developer code will continue to work. Developers will know of improvements you make and be able to make use of them. Breaking changes to your API will be rare, and if they occur, developers will have sufficient time and information to adapt their code. That will enable them to avoid breakage, enhancing trust.

Possible Approach to Implementation

When improving your API, focus on adding new calls or new, unrequired options rather than changing how existing calls work. Existing clients can ignore such changes and will continue functioning.

If using a fully RESTful style, you should be able to avoid changes that affect developers by keeping home resource URIs constant and changing only elements that your users do not code to directly. If you need to change your data in ways that are not compatible with the extension points that you initially designed, then a completely new design is called for, and that means changes that break client code. In that case, it’s best to implement the changes as a new REST API, with a different home resource URI.

If using an architectural style that does not allow you to make moderately significant changes without breaking client code, use versioning. Indicate the version in the response header. Version numbers should be reflected in your URIs or in request "accept" headers (using content negotiation). When versioning in URIs, include the version number as far to the left as possible. Keep the previous version available for developers whose code has not yet been adapted to the new version.

Changes to the API should be announced on your API documentation site. To notify users directly of changes, it's a good idea to create a mailing list and encourage developers to join. You can then announce changes there, and this provides a nice mechanism for feedback as well. It also allows your users to help each other.

How to Test

Release changes initially to a test version of your API before applying them to the production version. Invite developers to test their applications on the test version and provide feedback.

Evidence

Relevant requirements: R-PersistentIdentification, R-APIDocumented

Benefits

Trust
Interoperability

Data Preservation

This section describes best practices related to data preservation. Albeit being a closely related topic archiving is considered out of scope for this group and therefore not covered here.

Data Preservation Best Practices are still under discussion (see ISSUE-251 ). We're keen to hear comments about the following questions:

Do we need the BP "Use a trusted serialization format for preserved data dumps" or this should be covered under the BP about using standardized formats?
Do we need the BP "Update the status of identifiers" or this should be covered under versioning or unavailability?
Do we need to rewrite the BP Assess dataset coverage to make it more clear that datasets should have minimal dependencies on external entities that may not be preserved?

Assess dataset coverage

The coverage of a dataset should be assessed prior to its preservation.

Why

A chunk of Web data is by definition dependent on the rest of the global graph. This global context influences the meaning of the description of the resources found in the dataset. Ideally, the preservation of a particular dataset would involve preserving all its context. That is the entire Web of Data.

At ingestion time an evaluation of the linkage of Web data dataset dump to already preserved resources is assessed. The presence of all the vocabularies and target resources in uses is sought in a set of digital archives taking care of preserving Web data. Datasets for which very few of the vocabularies used and/or resources pointed out are already preserved somewhere should be flagged as being at risk.

Intended Outcome

It should be possible to appreciate the coverage and external dependencies of a given dataset.

Possible Approach to Implementation

The assessment can be performed by the digital preservation institute or the dataset depositor. It essentially consists in checking whether all the resources used are either already preserved somewhere or provided along with the new dataset considered for preservation.

A dataset targetted for preservation is made of the following triples:

 
  <http://data.mycity.example.com/public-transport/road/bus/route/ABtimetable> 
      a gtfs:Route ;
      gtfs:color "ff0000" ;
      gtfs:shortname "10" ;
      gtfs:longName "Airport - Bullfrog" ;
      gtfs:agency <http://data.mycity.example.com/transport-agency/DTA> ;
      gtfs:routeType ex:three ;
      ex:usualVehicleType dbpedia:Roumaster ;
      foaf:isPrimaryTopicOf ex:Airport_Bullfrog
      .

  <http://data.mycity.example.com/public-transport/road/bus/route/BFC> 
      a gtfs:Route ;
      gtfs:color "ffff00" ;
      gtfs:shortname "20" ;
      gtfs:longName "Bullfrog - Furnace Creek Resort" ;
      gtfs:agency <http://data.mycity.example.com/transport-agency/DTA> ;
      gtfs:routeType ex:three ;
      ex:usualVehicleType dbpedia:Articulated_bus ;
      foaf:isPrimaryTopicOf ex:Bullfrog_Furnace_Creek_Resort
      .

  …

Those triples make use of the "gtfs" vocabulary and a custom one defined in the testing domain name "ex". It also uses entities defined in "foaf", "dbpedia" and "ex". Although not formal standards, FOAF and GTFS [[GTFS]] are well established ontologies that are archived in several places on the Web (see, for instance, the LOV repository). Entities defined in DBpedia are also preserved through their Memento gateway and archived dumps of the dataset also exist. The risks associated to preserving the triple making use of those external resource is thus minimal. A bigger concern arises from the usage made of resources defined in "ex" which is a namespace that, by design, does not exist outside of the dataset. Unless the data describing "ex:usualVehicleType", "ex:Airport_Bullfrog" and "ex:Bullfrog_Furnace_Creek_Resort" is preserved alongside those triples their contextual meaning will be lost. This is particularly critical for "ex:usualVehicleType" as without it the relationship between the described route and the dbpedia resources will be unknown to a consuming application (however obvious it may be to a human).

Considering this assessment, a revised dataset including the definition of "ex:usualVehicleType" can be considered for preservation:

  <http://data.mycity.example.com/public-transport/road/bus/route/AB> a gtfs:Route;
      gtfs:color "ff0000" ;
      gtfs:shortname "10" ;
      gtfs:longName "Airport - Bullfrog" ;
      gtfs:agency <http://data.mycity.example.com/transport-agency/DTA> ;
      gtfs:routeType ex:three ;
      ex:usualVehicleType dbpedia:Roumaster ;
      foaf:isPrimaryTopicOf ex:Airport_Bullfrog
      .

  <http://data.mycity.example.com/public-transport/road/bus/route/BFC> 
      a gtfs:Route;
      gtfs:color "ffff00";
      gtfs:shortname "20";
      gtfs:longName "Bullfrog - Furnace Creek Resort";
      gtfs:agency <http://data.mycity.example.com/transport-agency/DTA>;
      gtfs:routeType ex:three;
      ex:usualVehicleType dbpedia:Articulated_bus;
      foaf:isPrimaryTopicOf ex:Bullfrog_Furnace_Creek_Resort
      .

  …

  # Custom vocabulary element
  ex:usualVehicleType 
      a rdf:Property ;
      rdfs:subPropertyOf gtfs:routeType ;
      rdfs:range gtfs:Bus.

This second, more complete, dataset is better suited for preservation as it is more self-describing and only makes use of external entities whose preservation is trusted.

How to Test

Datasets making references to portions of the Web of Data which are not preserved should receive a lower score than those using common resources.

Evidence

Relevant requirements:R-VocabReference

Benefits

Reuse
Trust

Use a trusted serialisation format for preserved data dumps

Data depositors willing to send a data dump for long term preservation must use a well established serialisation.

Why

Web data follows an abtract data model that can be expressed in different ways (RDF/XML, JSON-LD, ...). Using a well-established serialisation of this data increases its chances of reuse.

Institutes, such as national archives, that are engaged in digital preservation are tasked with monitoring file formats regularly for potential risk of obsolescence. Datasets which have been acquired in some format some years ago may have to be converted into another format in order to still be usable with more modern software (see [[ROSENTHAL]]). This task can be made more challenging, or even impossible, if non-standard serialisation formats are used by data depositors.

Intended Outcome

It should be possible to read and load the dataset into a computer for manipulation even if the original software that was used to create it is no longer available or supported.

Possible Approach to Implementation

Give preference to non-binary Web data serialisation formats that are available as open standards. For instance those provided by the W3C [[FORMATS]].

Those triples are serialised as RDf using the Turtle W3C recommendation. It is a text-based format which is supported by the majority of software able to process Web data. This format can thus be trusted for preservation.

 # Definition of a person
 ex:bob a ex:Staff;
     foaf:basedNear dbpedia:Cardiff;
     foaf:knows ex:john.

A custom-made serialisation of the same data such as the comma-delimited example that follows is an example of an inappropriate serialisation of RDF and is neither hrlpful nor good practice for preserving the dataset.

ex:bob,a,ex:staff;###,foaf:basedNear,dbpedia:Cardiff;###,foaf:knows,ex:john.

How to Test

Check that the dataset can be read by a standard text editor. Try to dereference the HTTP URIs present in the data dump using for example [[cURL]], confirming that the Content-Type header matches the format you expect to get.

Evidence

Relevant requirements:R-FormatStandardized

Benefits

Long-term availability of accessible data

Update the status of identifiers

Preserved resources should be linked through URIs with their "live" counterparts.

Why

URI dereferencing is a primary interface to data on the Web. Linking preserved datasets with the original URI informs the data consumer (which might be a computer programme) that there are other, more recent, versions and facilitates determining the status of these resources.

During its life cycle a dataset may undergo several modifications resulting in multiple versions. Although URIs assigned to things are not expected to change, the description of these resource will evolve over time. During this evolution, several snapshots could be made available for preservation and accessed as earlier versions of the current dataset.

Intended Outcome

A link is maintained between the URI of a resource, the most up-to-date description available for it, and preserved descriptions. If the dataset resource does not exist any more then the description should say so and refer to the last preserved description that was available.

Possible Approach to Implementation

There are a variety of HTTP status codes that could be put into use to relate the URI with its preserved description. In particular, 200, 410 and 303 can be used for different scenarios:

200 => there is a new description which contains pointers to the archived description
410 => the resource is no longer available but it has been removed under a controlled process cf. 404 which simply states that something is not available.
303 => the resource identified by this URI is no longer served here but there is a preserved description at a different location.

In addition to the status codes, HTTP Link headers can also be used to relate resources to their preserved descriptions.

The following example needs to be updated to use the URI http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.

One approach with a link header is to use the Memento protocol to give a link to a timegate providing access to the preserved descriptions of the resource:

curl -I http://example.org/bus-stops-001

HTTP/1.1 200 OK
Memento-Datetime: Sun, 05 April 2015 00:00:00 GMT
Link: http://example.org/bus-stops; rel=“original”, http://example.org/timegate/dataset; rel=“timegate”

Using HTTP status code the data consumer can be redirected to the most recent description of the entity. In the following example a request for the resource "http://example.org/bus-stops-001" is first redirected to the description "http://example.org/data/bus-stops-001" which, as it has been preserved and flagged as invalid, redirects the client to the newer description "http://example.org/newdata/bus-stops-001"

curl -L -I http://example.org/bus-stops-001

HTTP/1.1 303 See Other
Location: http://example.org/data/bus-stops-001
Link: http://example.org/newdata/bus-stops-001, rel="new"

HTTP/1.1 303 See Other
Location: http://example.org/newdata/bus-stops-001
Link: http://example.org/data/bus-stops-001, rel="previous"

HTTP/1.1 200 Ok

How to Test

Check that dereferencing the URI of a preserved dataset returns information about its current status and availability.

Evidence

Relevant requirements:R-AccessLevel, R-PersistentIdentification

Benefits

Reuse
Trust

Feedback

Publishing data on the Web enables data sharing on a large scale, providing data access to a wide range of audiences with different levels of expertise. Data publishers want to ensure that the data published is meeting the data consumer needs and user feedback is crucial. Feedback has benefits for both data publishers and data consumers, helping data publishers to improve the integrity of their published data, as well as to encourage the publication of new data. Feedback allows data consumers to have a voice describing usage experiences (e.g. applications using data), preferences and needs. When possible, feedback should also be publicly available for other data consumers to examine. Making feedback publicly available allows users to become aware of other data consumers, supports a collaborative environment, and allows user community experiences, concerns or questions are currently being addressed.

From a user interface perspective there are different ways to gather feedback from data consumers, including site registration, contact forms, quality ratings selection, surveys and comment boxes for blogging. From a machine perspective the data publisher can also record metrics on data usage or information about specific applications consumers are currently relying upon. Feedback such as this establishes a line of communication channel between data publishers and data consumers. In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format. Publicly available feedback should be displayed in a human-readable form through the user interface.

This section provides some BP to be followed by data publishers in order to enable data consumers to provide feedback about the consumed data. This feedback can be for humans or machines.

Gather feedback from data consumers

Data publishers should provide a means for consumers to offer feedback.

Why

Providing feedback contributes to improving the quality of published data, may encourage publication of new data, helps data publishers understand data consumers needs better and, when feedback is made publicly available, enhances the consumers' collaborative experience.

Intended Outcome

It should be possible for data consumers to provide feedback and rate data in both human and machine-readable formats. The feedback should be Web accessible and it should provide a URL reference to the corresponding dataset.

Possible Approach to Implementation

Provide data consumers with one or more feedback mechanisms including, but not limited to: a registration form, contact form, point and click data quality rating buttons, or a comment box for blogging.

Collect feedback in machine-readable formats to represent the feedback and use a vocabulary to capture the semantics of the feedback information.

How to Test

Demonstrate how feedback can be collected from data consumers.
Verify that the feedback is persistently stored. If the feedback is made publicly available verify that a URL links back to the published data being referenced.
Check that the feedback format conforms to a known machine-readable format specification in current use among anticipated data users.

Evidence

Relevant requirements: R-UsageFeedback, R-QualityOpinions

Benefits

Reuse
Comprehension
Trust

Make feedback available

Feedback should be available for both human users and computer applications.

Why

Making feedback about datasets and distributions publicly available allows users to become aware of other data consumers, supports a collaborative environment, and allows user community experiences, concerns or questions are currently being addressed. Providing feedback in a machine-readable format allows computer applications to automatically collect and process feedback about datasets.

Intended Outcome

It should be possible for humans to have access to feedback on a dataset or distribution given by one or more data consumers.

It should be possible for machines to automatically process feedback about a dataset or distribution.

Possible Approach to Implementation

Feedback can be availabe as part of an HTML Web page, but it can also be provided in a machine-readable format according to the vocabulary to describe dataset usage [[DUV]].

  :bus-stops-2015-05-05
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dcat:keyword "transport","mobility","bus" ;
      dct:issued "2015-05-05"^^xsd:date ;
      dcat:contactPoint <http://data.mycity.example.com/public-transport/contact>> ;
      dct:temporal <http://reference.data.gov.uk/id/year/2015> ;
      dct:spatial <http://www.geonames.org/3399415> ;
      dct:publisher :transport-agency-mycity ;
      dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-A> ;
      dcat:theme :mobility ;
      dcat:distribution :bus-stops-2015-05-05.csv ;
      .

  :bus-stops-2015-05-05.csv
      a dcat:Distribution ;
      dct:title "CSV distribution of bus-stops-2015-05-05 dataset" ; 
      dct:description "CSV distribution of the bus stops dataset of MyCity" ;
      dcat:mediaType "text/csv" ;
      .

  :comment1Content a cnt:ContentAsText ;
      cnt:chars "This dataset is missing stop 3" ; 
      .

  :comment1
      a duv:UserFeedback ;
      oa:hasBody comment1Content ;
      oa:hasTarget :bus-stops-2015-05-05 ;
      dct:creator :localresident ;
      .

  :comment2Content a cnt:ContentAsText ;
      cnt:chars "Are tab delimited formats also available?" ;
      .

  :comment2
      a duv:UserFeedback ;
      oa:hasTarget :bus-stops-2015-05-05.csv ;
      oa:hasBody comment2Content ;
      dct:creator :localresident ;
      .

  :localresident
      a foaf:Person ;
      foaf:Name "Alan Law" ;
      .

How to Test

Check if a human consumer can access the feedback about the dataset or distribution and check if a computer application can automatically process the feedback.

Evidence

Relevant requirements: R-UsageFeedback, R-QualityOpinions

Benefits

Reuse
Trust

Data Enrichment

Data enrichment refers to a set of processes that can be used to enhance, refine or otherwise improve raw or previously processed data. This idea and other similar concepts contribute to making data a valuable asset for almost any modern business or enterprise.

This section provides some advice to be followed by data publishers in order to enrich data.

Enrich data by generating new data

Enrich your data by generating new data from the raw data when doing so will enhance its value.

Why

Enrichment can greatly enhance processability, particularly for unstructured data. Missing values can be filled in, and new attributes and measures can be added. Publishing more complete datasets enhances trust. Deriving additional values that are of general utility saves users time and encourages more kinds of reuse. There are many intelligent techniques that can be used to enrich data, making the dataset an even more valuable asset.

Intended Outcome

A dataset that has missing values should be enhanced if possible to fill in those values. Additional relevant measures or attributes should be added if they enhance utility. Unstructured data can be given structure in this way as well.

Because inference-based enrichment may introduce errors into the data, values generated by such techniques should be labeled as such, and it should be possible to retrieve any original values replaced by enrichment.

Whenever licensing permits, the code used to enrich the data should be made available along with the dataset. Sharing such code is particularly important for scientific data.

Possible Approaches to Implementation

Machine learning can be readily applied to the enrichment of data. Methods include those focused on data categorization, disambiguation, entity recognition, sentiment analysis, and topification, among others. After new data is extracted, it can be provided as part of any open data format.

New data values may be derivable as simply as performing a mathematical calculation across existing columns. Other examples include visual inspection to identify features in spatial data and cross-reference to external databases for demographic information.

How to test

Look for missing values in the dataset or additional fields likely to be needed by others. Check that any data added by inferential enrichment techniques is identified as such and that any replaced data is still available. Check that code used to enrich the data is available. Check whether the metadata being extracted is in accordance with human knowledge and readable by humans.

Evidence

Relevant requirements: R-DataEnrichment, R-FormatMachineRead, R-ProvAvailable

Benefits

Reuse
Comprehension
Trust
Processability

Provide Complementary Presentations

Enrich data by also presenting it in complementary, immediately informative ways, such as visualizations, tables, Web applications, or summaries.

Why

Data published online is meant to inform others about its subject. But only posting datasets for download or API access puts the burden on consumers to interpret it. The Web offers unparalleled opportunities for presenting data in ways that let users learn and explore without having to create their own tools.

Intended Outcome

Besides making datasets available for download, processing, and reuse, publishers should give human consumers immediate insight into the data by presenting it in ways that are readily understood. Data consumers should not have to create their own tools to understand the meaning of the data.

Possible Approaches to Implementation

One very simple way to provide immediate insight is to publish an analytical summary in an HTML page. Including summative data in graphs or tables can help users scan the summary and quickly understand the meaning of the data.

If you have the means to create interactive visualizations or Web applications that use the data, you can give consumers of your data greater ability to understand it and discover patterns in it. These approaches also demonstrate its suitability for processing and encourage reuse.

How to test

Check that the dataset is accompanied by some additional interpretive content that can be perceived without downloading the data or invoking an API.

Evidence

Relevant requirements: R-DataEnrichment

Benefits

Reuse
Comprehension
Access
Trust

Data Usage/Data Reuse

We're keen for comments about the title of this section. Which one is more suitable: Data Usage or Data Reuse?

Reusing data is another way of publishing data. It can take the form of combining existing data with other datasets, creating Web applications or visualizations, or repackaging the data in a new form, such as a translation. Data reusers have some responsibilities that are unique to that form of publishing on the Web. This section provides advice to be followed when reusing data.

Provide Feedback to the Original Publisher

When using data published by others, let them know that you are reusing their data. If you find an error or have suggestions or compliments, let them know.

Why

Publishers generally want to know whether the data they publish has been useful. Moreover, they may be required to report usage statistics in order to allocate resources to data publishing activities. Reporting your usage helps them justify putting effort toward data releases. Providing feedback repays the publishers for their efforts by directly helping them to improve their dataset for future users.

Intended Outcome

Better communication will make it easier for original publishers to determine how the data they post is being used, which in turn helps them justify publishing the data. Publishers will also be made aware of steps they can take to improve their data. This leads to more and better data for everyone.

Possible Approach to Implementation

When you begin using a dataset in a new product, make a note of the publisher’s contact information, the URI of the dataset you used, and the date on which you contacted them. This can be done in comments within your code where the dataset is used. Follow the publisher’s preferred route to provide feedback. If they do not provide a route, look for contact information for the Web site hosting the data.

# Calling the MyCity transit API, http://api.mycitytransit.example.org/docs/
# Published by MyCity Transit Agency,
# notified of our reuse by email to opendata@mycitytransit.example.org
# by Newton Calegari on 3/24/2016.

How to test

Check that you have a record of at least one communication informing the publisher of your use of the data.

Evidence

Relevant requirements: R-TrackDataUsages, R-UsageFeedback, R-QualityOpinions

Benefits

Reuse
Interoperability
Trust

Follow Licensing Terms

Find and follow the licensing requirements from the original publisher of the dataset.

Why

Licensing provides a legal framework for using someone else’s work product. By adhering to the original publisher’s requirements, you keep the relationship between yourself and the publisher friendly. You don’t need to worry about legal action from the original publisher if you are following their wishes. Understanding the initial license will help you determine what license to select for your reuse.

Intended Outcome

Data publishers will be able to trust that their work is being reused in accordance with their licensing requirements, which will make them more likely to continue to publish data. Reusers of data will themselves be able to properly license their derivative works.

Possible Approach to Implementation

Read the original license and adhere to its requirements. If the license calls for specific licensing of derivative works, choose your license to be compatible with that requirement. If no license is given, contact the original publisher and ask what the license is.

How to test

Read through the original license and check that your use of the data does not violate any of the terms.

Evidence

Relevant requirements: R-LicenseAvailable, R-LicenseLiability,

Benefits

Reuse
Trust

Cite the Original Publication

Indicate the source of your data in the metadata for your reuse of the data. If you provide a user interface, include the citation visibly in the interface.

Why

Data is only useful when it is trustworthy. Identifying the source is a major indicator of trustworthiness in two ways: first, the user can judge the trustworthiness of the data from the reputation of the source, and second, citing the source suggests that you yourself are trustworthy as a republisher. In addition to informing the end user, citing helps publishers by crediting their work. Publishers who make data available on the Web deserve acknowledgment and are more likely to continue to share data if they find they are credited. Citation also maintains provenance and helps still others to work with the data.

Intended Outcome

End users should be able to assess the trustworthiness of the data they see, and original publishers should be recognized for their efforts. The chain of provenance for data on the Web should be traceable back to its original publisher.

Possible Approach to Implementation

You can use the Dataset Usage Vocabulary to cite the original publication of the data in metadata.

  :bus-stops-2015-05-05 
      a dcat:Dataset ;
      dct:title "Bus stops of MyCity" ;
      dct:issued "2015-05-05"^^xsd:date ;
      prism:doi "10.3456/4567.21"^^xsd:string ;
      dct:creator :john ;
      owl:versionInfo "1.0" ; 
      pav:version "1.0" ;
      . 

  ex:bus-stops-memorandum
      a biro:BibliographicReference ;
      a fabio:Policy  ;
      dct:bibliographicCitation
      "Costello, E. Mayor (2016). City Bust Stops Memorandum 
      January, 2016. DOI:0.3456/4567.21"^^xsd:string ;
      biro:references :bus-stops-2015-05-05 ;
      .

You can cite the original source in a user interface by providing bibliographic text and a working link.

Data source: Costello, E. Mayor (2016) "City Bust Stops Memorandum". January, 2016. References dataset: http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05.

How to test

Check that the original source of any reused data is cited in the metadata provided. Check that a human-readable citation is readily visible in any user interface.

Evidence

Relevant requirements: R-Citable, R-ProvAvailable, R-MetadataAvailable

Benefits

Reuse
Discoverability
Trust

Introduction

Audience

Scope

Context

Namespaces

Best Practices Template

The Best Practices

Basic Example

Metadata

Data Licenses

Data Provenance

Data Quality

Data Versioning

Data Identifiers

Data Formats

Data Vocabularies

Sensitive Data

Data Access

Data Access APIs

Data Preservation

Feedback

Data Enrichment

Data Usage/Data Reuse

Conclusions

Glossary

Data on the Web Challenges

Best Practices Benefits

Use Cases Requirements x Best Practices

Acknowledgements

Change history

Best Practice	Benefits
Provide metadata	Reuse Comprehension Discoverability Processability
Provide descriptive metadata	Reuse Comprehension Discoverability
Provide locale parameters metadata	Reuse Comprehension
Provide structural metadata	Reuse Comprehension Processability
Provide data license information	Reuse Trust
Provide data provenance information	Reuse Comprehension Trust
Provide data quality information	Reuse Trust
Provide a version indicator	Reuse Trust
Provide version history	Reuse Trust
Use persistent URIs as identifiers of datasets	Reuse Linkability Discoverability Interoperability
Use persistent URIs as identifiers within datasets	Reuse Linkability Discoverability Interoperability
Assign URIs to dataset versions and series	Reuse Discoverability Trust
Use machine-readable standardized data formats	Reuse Processability
Provide data in multiple formats	Reuse Processability
Use standardized terms	Reuse Comprehension Trust Interoperability
Reuse vocabularies	Reuse Processability Interoperability
Choose the right formalization level	Reuse Comprehension Interoperability
Provide data unavailability reference	Reuse Trust
Provide bulk download	Reuse Access
Provide Subsets for Large Datasets	Reuse Linkability Access Processability
Use content negotiation for serving data available in multiple formats	Reuse Access
Provide real-time access	Reuse Access
Provide data up to date	Reuse Access
Make Data Available through an API	Reuse Processability Interoperability Access
Use Web Standards as the foundation of your API	Reuse Linkability Interoperability Discoverability Access Processability
Provide complete documentation for your API	Reuse Trust
Avoid Breaking Changes to Your API	Trust Interoperability
Assess dataset coverage	Reuse Trust
Use a trusted serialisation format for preserved data dumps	Long-term availability of accessible data
Update the status of identifiers	Reuse Trust
Gather feedback from data consumers	Reuse Comprehension Trust
Make feedback available	Reuse Trust
Enrich data by generating new data	Reuse Comprehension Trust Processability
Provide Complementary Presentations	Reuse Comprehension Access Trust
Provide Feedback to the Original Publisher	Reuse Interoperability Trust
Follow Licensing Terms	Reuse Trust
Cite the Original Publication	Reuse Discoverability Trust

Requirement	Best Practices
R-MetadataAvailable	Provide metadata Provide descriptive metadata Provide locale parameters metadata Provide structural metadata Provide data provenance information Cite the Original Publication
R-MetadataDocum	Provide metadata Use standardized terms
R-MetadataMachineRead	Provide metadata Provide descriptive metadata Provide data license information
R-MetadataStandardized	Provide descriptive metadata Use standardized terms
R-FormatLocalize	Provide locale parameters metadata
R-LicenseAvailable	Provide data license information Follow Licensing Terms
R-ProvAvailable	Provide data provenance information Enrich data by generating new data Cite the Original Publication
R-QualityMetrics	Provide data quality information
R-DataMissingIncomplete	Provide data quality information
R-QualityOpinions	Provide data quality information Gather feedback from data consumers Make feedback available Provide Feedback to the Original Publisher
R-DataVersion	Provide a version indicator Provide version history
R-UniqueIdentifier	Use persistent URIs as identifiers of datasets Use persistent URIs as identifiers within datasets Assign URIs to dataset versions and series Provide Subsets for Large Datasets Use Web Standards as the foundation of your API
R-Citable	Use persistent URIs as identifiers of datasets Assign URIs to dataset versions and series Provide Subsets for Large Datasets Cite the Original Publication
R-FormatMachineRead	Use machine-readable standardized data formats Enrich data by generating new data
R-FormatStandardized	Use machine-readable standardized data formats Use a trusted serialisation format for preserved data dumps
R-FormatOpen	Use machine-readable standardized data formats
R-FormatMultiple	Provide data in multiple formats
R-QualityComparable	Use standardized terms Reuse vocabularies Choose the right formalization level
R-VocabReference	Reuse vocabularies Choose the right formalization level Assess dataset coverage
R-AccessLevel	Provide data unavailability reference Update the status of identifiers
R-AccessBulk	Provide bulk download
R-GranularityLevels	Provide Subsets for Large Datasets
R-AccessRealTime	Provide Subsets for Large Datasets Provide real-time access Make Data Available through an API
R-AccessUpToDate	Provide data up to date Make Data Available through an API
R-APIDocumented	Use Web Standards as the foundation of your API Provide complete documentation for your API Avoid Breaking Changes to Your API
R-PersistentIdentification	Avoid Breaking Changes to Your API Update the status of identifiers
R-UsageFeedback	Gather feedback from data consumers Make feedback available Provide Feedback to the Original Publisher
R-DataEnrichment	Enrich data by generating new data Provide Complementary Presentations
R-TrackDataUsages	Provide Feedback to the Original Publisher
R-LicenseLiability	Follow Licensing Terms