Consumer guidance from Jeni Tennison on 2011-11-20 (public-html-data-tf@w3.org from November 2011)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Sun, 20 Nov 2011 23:01:31 +0000
To: HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-Id: <890CD066-1ECC-482D-8956-0D93F10D53B3@jenitennison.com>
Hi,

I have attempted to put together some guidance for HTML data consumers at

  http://www.w3.org/wiki/Choosing_an_HTML_Data_Format#Consumers and
  http://www.w3.org/wiki/HTML_Data_Vocabularies

These are replicated below. This is a first draft and I'd be grateful for any feedback on it.

Cheers,

Jeni

--- http://www.w3.org/wiki/Choosing_an_HTML_Data_Format#Consumers ---

== Consumers ==

You will find it easier to consume and combine data published using a single format (syntax and vocabulary). To decide which to consume, you should first look at what formats your target publishers are currently using. It may be that these contain sufficient information for your application.

If the publishers whom you are targeting are already publishing using multiple formats, you may want to [[Mixing HTML Data Formats#Consuming_Pages_with_Multiple_Formats|consume from all those formats]] in order to maximise the data that you can collect while minimising the impact on the publishers who are providing that information. If you are consuming microdata and storing the results as RDF, you should [[Mapping_Microdata_to_RDF|follow a standard mapping]].

If current formats do not encode the information you need to the detail you need it for your application, publishers will be more likely to publish extra data for you to consume if you:

* [[HTML Data Vocabularies|extend existing common vocabularies]] they are already using
* consume data from a syntax they already use

If you cannot simply extend an existing vocabulary, you will need to create your own vocabulary and choose which syntaxes to support with that vocabulary.

=== Choosing a Syntax to Consume ===

As you choose syntax, you should take into account the following considerations.

==== Tooling Considerations ====

Applications vary widely in terms of the tooling that they need. A script that runs in a publisher's page needs easy access to data through a DOM API. A crawler that creates a store of data from a set of distributed pages requires a server-side parser and good storage and querying support.

As a consumer, you will be led by the requirements you have for your application and the experience that you have with different technology sets. It's important, however, to also consider the experience and capabilities of the publishers that are providing you with data, and which formats they will find easy to publish given their tooling. You should also consider the ease with which you can provide support tools for the format, such as validators or previewers that make it easy for publishers to tell whether they have published data correctly within their pages.

==== Data Model Considerations ====

Microdata uses a JSON-based data model of a tree of objects which may be identified through a URI, with properties whose values are strings. microformats-2 uses a similar JSON-based data model of a tree of objects, but they do not have identifiers and their property values may be strings, URLs, date/times or structured HTML values. RDFa uses RDF as its data model, which is a graph of objects identified by URLs with properties whose values may be other objects, lists or literal values which can be tagged with a language or any datatype. These different models have different capabilities.

;Structured HTML values
:Under appropriate conditions, RDFa and microformats will use markup within the content of an element to provide a property value; in microdata values never retain markup. If you wish to consume data that may contain markup &mdash; be it structures such as multiple paragraphs, list items, tables, or inline markup such as emphases, links or ruby markup &mdash; you will need publishers to use RDFa or microformats to mark up that data. In RDFa, this is done by publishers adding <code>datatype="rdf:XMLLiteral"</code> to elements whose markup should be preserved. In microformats, the handling of the content of an element is determined by the property; in microformats-2, those that retain the HTML structure are named with a <code>e-*</code> prefix, such as <code>e-content</code>.
;Language support
:Microformats and RDFa use the language of the HTML elements in the page (from the <code>lang</code> attribute) to indicate the language of relevant values. In microdata, the vocabulary has to provide a separate mechanism to indicate a language (pending resolution of [http://www.w3.org/Bugs/Public/show_bug.cgi?id=14470 bug 14470]). If you are consuming information about the same things from pages that use different languages, or anticipate publishers using multiple languages in their pages to describe a particular entity, you can automatically pick up the language of the content of the page if publishers use microformats or RDFa. If you consume microdata, you need to provide specific properties in your vocabulary that publishers can use to indicate the language of the content.

==== Usability Considerations ====

Publishing data within HTML can be a challenge for publishers, simply because the structure of the data that they publish is not immediately visible within their pages. The publishers you are targeting will have different levels of skill and experience, which may influence your choice of syntax and the way in which you design your vocabulary. If you can, you should try to work closely with a few target publishers to better understand their requirements and constraints. Experimenting with marking up a few of their existing pages will often highlight issues with both syntax and vocabulary.

Some usability issues may be addressed by restricting the set of attributes that you instruct publishers how to use, or by restricting their location to provide more consistency. For example:

* [http://www.w3.org/2010/02/rdfa/sources/rdfa-lite/Overview-src.html RDFa 1.1 Lite] is an authoring profile of RDFa 1.1 that is sufficient for most data publishing
* most microdata markup does not require <code>@itemid</code> or <code>@itemref</code>
* constraining data markup to the <code>head</code> of an HTML document can make it easier to author and protect it from templating changes, although it also runs the risk of getting out of sync with the content of the page, increases repetition, and is hard to use for anything but flat data structures

Profiling microdata and RDFa is useful for documentation, but consumers should still recognise and understand the full set of syntactic constructs described by the standards. This ensures that those publishers who find that they need the more advanced constructs to mark up their pages can do so, and means that publishers can use general-purpose tools and documentation rather than just those that you provide.

=== Good Consumption Practice ===

It is good practice for a consumer to provide tools that help publishers to see how the data within their pages is interpreted by the consumer and that highlight any errors in the markup, such as invalid values or missing required properties.

It is good practice for consumers to ignore markup that uses syntax or vocabularies that they do not understand. Properties and types in unrecognised vocabularies should be ignored by consumers.

TODO: More?



--- http://www.w3.org/wiki/HTML_Data_Vocabularies ---

Designing vocabularies is a complex craft, and this page does not cover all aspects of how to go about it. Instead, this page focuses on aspects of microdata, microformat and RDFa syntax and processing that should influence vocabulary design, and how to create vocabularies that can be used across multiple syntaxes, as part of the work of the [[Html-data-tf|HTML Data TF]]. There are several existing more general resources for vocabulary creators, such as:

* [http://microformats.org/wiki/process the microformats process]
* [http://www.w3.org/2001/sw/interest/webschema.html  SWIG Web Schemas Task Force]

TODO: More?

== Extending Vocabularies ==

There are already many vocabularies in existence, particularly for common domains such as people, organisations, events, products, reviews, recipes and so on. Reusing these vocabularies benefits consumers because it saves design time and means they do not have to create supporting tools and materials such as validators, previewers or documentation. It also benefits publishers because it increases the likelihood that the data within their pages can be consumed by other useful tools. It is therefore good practice to extend existing vocabularies rather than creating new ones, where possible.

This section describes some of the issues that vocabulary authors who extend existing vocabularies need to be aware of.

=== Extending microformats ===

Microformats are developed using an iterative process whereby proposals for extensions are [http://microformats.org/wiki/process#Brainstorm_Proposals brainstormed] and eventually either accepted or rejected by the microformats community. It is not appropriate to create unilateral extensions to microformats. On the other hand, publishers should use semantic classes within their HTML, whether or not they are used within current microformats. Evidence of use of semantic classes within HTML pages is one input to the microdata standardisation process.

=== Extending RDF vocabularies ===

RDF vocabularies, which are used within RDFa, use IRIs for types and properties. Any resource in RDFa can be extended by adding new types to the <code>@typeof</code> attribute and/or adding new properties from different vocabularies. However, it is not general practice to allow RDF vocabularies themselves to be extended with new types or properties by third parties.

One pattern that is quite common is for one vocabulary to accept a string for a property, such as an address, and for an extension to provide more structure for that property. In this case, a useful pattern is to nest the more structured property inside the textual property within the HTML. For example:

 <nowiki><div property="location">
  <address property="http://example.org/address" vocab="http://example.org/" typeof="Address">
    <span property="name">The White House</span><br>
    <span property="street">1600 Pennsylvania Avenue NW</span><br>
    <span property="city">Washington</span>, <span property="state">DC</span> <span property="zip">20500</span>
  </address>
</div></nowiki>

This pattern also works for properties whose values are XML literals; in this case, the XML literal will include the RDFa markup.

=== Extending microdata vocabularies ===

Microdata items can have both properties that are scoped to the type of the item and properties that have absolute URLs. The acceptability of non-URL properties is determined by the vocabulary author of the type of the item; some vocabularies may define a set of acceptable properties, others say that any properties are acceptable. In all cases, however, it's possible to add properties to items if they are named with an absolute URL. Third parties who wish to extend an existing type with new properties should check the constraints of the type being extended to work out whether it's possible to use a non-URL property or not. Note that there is always a possibility, if you do use a non-URL property name, that your extension will conflict with an extension made by someone else; properties whose names are absolute URLs do not have this issue but are more verbose when used in markup.

Microdata does not allow items to have multiple types from different vocabularies. Some vocabularies, such as schema.org, may permit third parties to freely extend existing types within that vocabulary. In this case, items should be assigned both the supertype and the extension type within the <code>@itemtype</code> attribute. For example, schema.org describes a [http://schema.org/docs/extension.html method of extending its vocabulary] that involves identifying an appropriate supertype or superproperty and appending a <code>/</code> and then the name of a subtype or subproperty. Schema.org also permits anyone to create additional non-URL properties on these new types. To extend schema.org's types with a type for a member of parliament, a vocabulary author might use the URI <code>http://schema.org/Person/MP</code>, and mark up their page with

 <nowiki><p itemscope itemtype="http://schema.org/Person http://schema.org/Person/MP">
  <span itemprop="name">David Cameron</span> is the member of parliament for <span itemprop="constituency">Witney</span>.
</p></nowiki>

Here, both <code>http://schema.org/Person</code> and <code>http://schema.org/Person/MP</code> are given as types, and the non-URL <code>constituency</code> property is used despite it not being defined within the schema.org vocabulary.

Other microdata vocabularies do not enable third parties to extend the vocabulary. In these cases, third parties should use a URL property to specify the additional type for the item. For compatibility with RDF, we recommend using <code>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</code> for this property, and using a full URL for the type. An alternative to the example above that didn't use the schema.org extension mechanism would be:

 <nowiki><p itemscope itemtype="http://schema.org/Person">
  <link itemprop="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" href="http://gov.example.org/uk/MP">
  <span itemprop="name">David Cameron</span> is the member of parliament for <span itemprop="http://gov.example.org/uk/constituency">Witney</span>.
</p></nowiki>

More details about the use and limitations of this technique can be found on the [[Mixing_HTML_Data_Formats#Mixing_Vocabularies_in_microdata|section about using multiple vocabularies in microdata]].

The technique described for RDFa above, of nesting a property that contains more structure within a property that has less, can also be used with microdata content.

== Designing Vocabularies ==

This section looks at the particular requirements of different HTML data syntaxes on vocabularies, and how to create vocabularies that can be used across HTML data syntaxes.

=== Syntax-Specific Requirements ===

Each HTML data syntax brings with it a set of constraints on both how vocabularies are designed and their documentation.

==== Microformats ====

The [http://microformats.org/wiki/microformats-2 microformats 2] page describes the constraints on the design of microformat vocabularies, and the [http://microformats.org/wiki/process microformats process] describes additional procedural guidelines on how to create a new microformat.

==== Microdata ====

Microdata vocabularies must define, within a specification for that vocabulary, processing rules to be followed by consumers of that vocabulary, using the terms given by the [http://dev.w3.org/html5/md/ microdata specification]. These include:

* what types the vocabulary includes
* which types support <code>@itemid</code> to provide global identifiers for items
* whether and how two items described using microdata should be considered a single item by a consumer (such as when they have the same <code>@itemid</code>) and if so, how two items within an HTML page should be merged
* whether URL values that have the same value as an <code>@itemid</code> should be treated the same as if the item had been nested within the page
* which non-URL properties ('''defined property names''') are permitted on each of those types, whether there are equivalent URL properties for them, and how properties will be merged if both are used
* how many and what types of values are allowed for each property, and what consumers should do if there are more or fewer values than required, how the values are parsed, and what happens when the values are of the wrong type
* whether items that are the value of a property must explicitly have a type or if this can be inferred by consumers
* what to do when an item has a property that it should not have
* whether type and property URLs can be dereferenced
* how consumers should recognise items belonging to the vocabulary (whether purely by <code>@itemtype</code> or through some other mechanism)

An example of a microdata vocabulary description is available for [http://www.heppnetz.de/ontologies/goodrelations/v1.html#microdata GoodRelations]. There are also example microdata vocabularies within the [http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#mdvocabs WHATWG version of the microdata specification].

Microdata does not support the use of the HTML <code>lang</code> attribute to provide language information for textual values; if this is important, a microdata vocabulary must provide a mechanism for supplying a language separately. This can be done by:

* having a property that indicates the language used in the data for the item; this only works if all the data uses the same language
* defining a LanguageString type that has properties for both content and language and specifying the use of items of that type as a value for any appropriate property

Microdata does not support structured HTML values. Where these need to be captured, vocabularies can instead use URLs that reference fragments of HTML in the page. For example:

 <nowiki><link itemprop="breadcrumb" href="#breadcrumb">
<div id="breadcrumb">
  <a href="category/books.html">Books</a> >
  <a href="category/books-literature.html">Literature & Fiction</a> >
  <a href="category/books-classics">Classics</a>
</div></nowiki>

==== RDFa ====

RDFa is used to create RDF graphs, so vocabularies used within RDFa should bear in mind the constraints and conventions that commonly apply to RDF vocabularies. These include:

* types should be named using CapitalCamelCase, and properties using lowerCamelCase
* types and properties in the same vocabulary should share a IRI prefix &mdash; the vocabulary IRI &mdash; which should end in a <code>#</code> or a <code>/</code>; the local part of a type or property IRI, after this prefix, should be a valid [http://www.w3.org/TR/REC-xml-names/#NT-NCName NCName] so that it can be used within RDF/XML serialisations
* the IRIs used for types and properties should resolve into documentation and/or (through content negotiation) an [http://www.w3.org/TR/rdf-schema/ RDFS schema] or [http://www.w3.org/TR/owl-overview/ OWL ontology] that describes the types and properties

More guidelines and patterns for modelling using RDF are available within [http://patterns.dataincubator.org/book/modelling-patterns.html Linked Data Patterns].

=== Syntax-Neutral Vocabularies ===

Syntax-neutral vocabularies must have variants for each syntax that meet the requirements for the syntax as described above, but the capabilities of each variant do not have to be identical.

For example, a syntax-neutral review vocabulary could specify a required <code>reviewLanguage</code> property to give the language of a review in microdata, but say that if microformats or RDFa were used, and this were left unspecified, the language would be assumed. Publishers who had content that included multiple languages in the review itself (which couldn't be represented using a property providing a language for the entire review) would be able to use microformats or RDFa to mark up the review.

There are a number of measures that make it easier for vocabularies to be used across syntaxes in ways that make it easier for consumers to combine data whichever syntax is used.

; Naming Conventions
: Adopt consistent names across syntaxes, even if the naming conventions between the syntaxes differs. For example, microformats uses lowercase-hyphenated-names whereas RDF uses lowerCamelCase; all that is needed is a clear mapping between them. Although microdata allows defined property names to contain any character except <code>:</code> and <code>.</code>, non-URL properties should have names that are [http://www.w3.org/TR/REC-xml-names/#NT-NCName NCNames] so that they can be used in microformats and RDFa. Note that microdata's restrictions mean that <code>.</code>s should be avoided in these names.
; Entity Identity
: Microformats and microdata have a limited notion of entity identity: entities may have identifiers (in microdata, from the <code>@itemid</code> attribute) but these are not used within the data model to combine entities or link them together into graphs. Syntax-neutral vocabularies use the RDF concept of identity whereby entities with the same identifier are the same entity, and references to that entity's identifier serve to create a graph of entities. This should be reflected in the definition of the microdata variant of the vocabulary, which should allow <code>@itemid</code> on all items, and specify that consumers should combine and link to items to create a graph.

TODO: other guidelines?

== Good Practices ==

It is good practice for vocabulary creators to collaborate with others who are consuming or publishing information in the relevant domains in order to create a vocabulary that can be used widely across an industry.

It is good practice for vocabulary creators to make available a validation tool that enables publishers who use a vocabulary to check that their HTML pages contain data that is valid against that vocabulary.

It is good practice for vocabulary creators to make available test suites that enable implementers to check the behaviour of their implementations. These test suites should cover error handling as well as the correct interpretation of valid data.


-- 
Jeni Tennison
http://www.jenitennison.com
Received on Sunday, 20 November 2011 23:02:00 UTC