Guidance on publishing in multiple formats from Jeni Tennison on 2011-11-08 (public-html-data-tf@w3.org from November 2011)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Tue, 8 Nov 2011 22:04:32 +0000
To: HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-Id: <FB270935-476D-4C4E-8B5C-5648D96E25F2@jenitennison.com>
Hi,

We need to start bringing together the issues that we've discussed and resulting guidance in the form of some documents.

I've taken a first stab at writing up some guidance about publishing using multiple formats (syntaxes and vocabularies). There's a bit of text at

  http://www.w3.org/wiki/Choosing_an_HTML_Data_Format#Publishing_in_Multiple_Formats

which links to a page at

  http://www.w3.org/wiki/Mixing_HTML_Data_Formats

I have reproduced the text below. It would be great to hear any comments/suggestions that you have on it.

Cheers,

Jeni

--- 
http://www.w3.org/wiki/Choosing_an_HTML_Data_Format#Publishing_in_Multiple_Formats
---

=== Publishing in Multiple Formats ===

Publishing in multiple formats can be easy. For example, it may be that different consumers expect HTML data to appear in different places within the page, such as Facebook requiring Open Graph Protocol data to appear within the <code>head</code> of an HTML page, while schema.org markup appears in the <code>body</code> of the page. Or it may be that the items that you need to mark up on the page appear in different places -- events listed in a sidebar while company details are provided in a footer, for example.

Different formats and vocabularies can be used independently in these circumstances. Consumers of the data within your pages might read additional data if it is in a syntax that they recognise -- for example, an processor that recognises both RDFa and microdata will interpret all such markup in the page -- but it should ignore information that is in a vocabulary that it doesn't understand rather than giving an error.

Publishing can be harder when there are multiple consumers of information that require different formats. Techniques for mixing different syntaxes and vocabularies within a page are [http://www.w3.org/wiki/Mixing_HTML_Data_Formats provided on a separate page].

---
http://www.w3.org/wiki/Mixing_HTML_Data_Formats
---

This page examines how publishers can mix different HTML data formats within their pages, and how consumers can interpret the results, as part of the work of the [[Html-data-tf|HTML Data TF]].

== Mixing Vocabularies ==

Methods for marking up the same data in a page using different vocabularies in the same syntax vary by syntax.

=== Mixing Vocabularies in microformats ===

As microformats are simply indicated through classes, it's possible to mix several within the same set of content. An example is the [http://www.bbc.co.uk/worldservice/bangladeshboat/ BBC Bangladesh River Journey] page which includes hAtom and hCalendar:

 <nowiki><li class="hentry vevent xfolkentry postid-f2068841910">
  <h3 class="entry-title summary">
    <a href="http://www.flickr.com/photos/bangladeshboat/2068841910" title="The final picture (on Flickr)">The final picture</a>
  </h3>
  <div class="entry-content">
    <p class="photo">
      <a rel="bookmark" class="taggedlink url" href="http://www.flickr.com/photos/bangladeshboat/2068841910" title="The final picture (on Flickr)">
        <img src="http://farm3.static.flickr.com/2175/2068841910_1162a8086b_s.jpg" 
             alt="The final picture (on Flickr)" title="The final picture (on Flickr)" width="64" height="64" />
      </a>
    </p>
    <p class="description">As the BBC team prepare to disembark the boat, the sun sets overhead, and indeed on the trip itself.</p>
  </div>
  <ul class="meta">
    <li class="date"><abbr class="published dtstart" title="2007-11-26T02:11:51+06:00">2 days ago</abbr></li>
    <li class="location"><abbr class="geo point-22" title="+22.47157;+89.59534">Mongla, Bangladesh</abbr></li>
  </ul>
</li></nowiki>

=== Mixing Vocabularies in RDFa ===

RDFa is designed to be used with multiple vocabularies:

* types and properties are given IRIs as names, so do not have to be disambiguated; IRIs do not have to be written out in full (see below)
* an entity can be assigned multiple types from different vocabularies by listing them within the <code>@typeof</code> attribute
* attributes that indicate properties (<code>@property</code>, <code>@rel</code> and <code>@rev</code>) can take multiple space-separated properties which may be from different vocabularies

Writing out IRIs in full can clutter HTML so RDFa provides four mechanisms to shorten IRIs:

* There are several built-in prefixes which can be used for popular vocabularies. These are listed as part of the [http://www.w3.org/2011/rdfa-context/rdfa-1.1.html RDFa 1.1 Core initial context]. Any IRI within one of these vocabularies can be abbreviated using the <code>prefix:name</code> notation.
* The <code>@prefix</code> attribute can be used to define additional prefixes for other vocabularies.
* The <code>@vocab</code> attribute defines a default vocabulary within its scope; any IRIs that begin with this vocabulary can be abbreviated to a short name (the remainder of the IRI after the vocabulary IRI).
* Namespace declarations (<code>xmlns:prefix</code> attributes) can also be used to define prefixes. '''This mechanism is deprecated and should not be used.'''

Note that if you use any of the last three mechanisms, the shortened IRIs can only be understood when they are within the scope of the relevant attributes. These can be easy to mislay when people copy and paste HTML from one place to another, or as the result of template changes in a content-management system. We therefore recommend that these attributes are avoided where possible &mdash; use the built-in prefixes or full IRIs in preference &mdash; and, where they are used, placed on elements that represent entities (those with <code>@about</code> or <code>@typeof</code> attributes) and repeated on each entity element rather than being inherited from an ancestor element.

=== Mixing Vocabularies in microdata ===

microdata is designed such that each piece of information in a page is assigned types from a single vocabulary, though each entity may have multiple types and have properties from other vocabularies.

Properties in microdata are either short names (in which case they are scoped to the vocabulary of the types of the entity) or URLs. A URL property has no relationship to a given short name property unless that relationship is specified within the vocabulary that defines the properties.

You might find that you need to target two consumers who each recognise items using types from different vocabularies. For example, you might want to target schema.org and use the vEvent vocabulary with the original HTML:

 <nowiki><a href="nba-miami-philidelphia-game3.html">
NBA Eastern Conference First Round Playoff Tickets:
 Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)
</a>

Thu, 04/21/16
8:00 p.m.

<a href="wells-fargo-center.html">
Wells Fargo Center
</a>
Philadelphia, PA

Priced from: $35
1938 tickets left</nowiki>

In this case there are three options available to you. The first, if consumers support it, is to use a different syntax for one of the vocabularies. For example, the vEvent vocabulary is only supported in microdata but schema.org can be consumed from either microdata or RDFa. Mixing syntaxes within a single page is rarely a good option but in some circumstances it may be preferable to the other workarounds described here.

==== Mixing Vocabularies using a Type Property ====

Some vocabularies may define a property through which types from that vocabulary can be assigned to items that are in a different vocabulary. For example, schema.org could define a <code>http://schema.org/type</code> property whose value is a URL, and state that any microdata item that a schema.org type as a value for that property is recognised as being an item of that type. In this case, the types specified in the <code>@itemtype</code> attribute are the '''primary types''' of the entity and those specified through the property are the '''secondary types'''.

Alongside the assertion that property URLs that begin with <code>http://schema.org/</code> have the same semantics as short name properties on items with a schema.org type, this enables the schema.org vocabulary to be mixed in with an item marked up using vEvent:

'''Note that at time of writing schema.org does not specify a <code>http://schema.org/type</code> property and this example will not work.'''

 <nowiki><div itemscope itemtype="http://microformats.org/profile/hcalendar#vevent">
  <link itemprop="http://schema.org/type" href="http://schema.org/Event">
  <a itemprop="url http://schema.org/url" href="nba-miami-philidelphia-game3.html">
  NBA Eastern Conference First Round Playoff Tickets:
  <span itemprop="summary http://schema.org/name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
  </a>

  <meta itemprop="dtstart http://schema.org/startDate" content="2016-04-21T20:00">
    Thu, 04/21/16
    8:00 p.m.

  <div itemprop="location">
    <div itemprop="http://schema.org/location" itemscope itemtype="http://schema.org/Place">
      <a itemprop="url" href="wells-fargo-center.html">
      Wells Fargo Center
      </a>
      <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span itemprop="addressLocality">Philadelphia</span>,
        <span itemprop="addressRegion">PA</span>
      </div>
    </div>
  </div>

  <div itemprop="http://schema.org/offers" itemscope itemtype="http://schema.org/AggregateOffer">
    Priced from: <span itemprop="lowPrice">$35</span>
    <span itemprop="offerCount">1938</span> tickets left
  </div>
</div></nowiki>

Note in particular that the vEvent <code>location</code> property takes text while the schema.org <code>location</code> property takes structured information about the location. These are combined by having an element for the property which requires structured information nested within the property that requires text.

Also note that in this example the <code>http://schema.org/type</code> property is only used where necessary, on the entity which needs to be marked as an event in both vocabularies. Where possible, the schema.org type for an entity is provided explicitly through the <code>@itemtype</code> attribute.

This method of mixing vocabularies requires vocabularies to specify how consumers should recognise items of a particular type. It is recommended that vocabulary authors define an <code>@itemtype</code>-equivalent property, and that, for better integration with RDF tools, this property is <code>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</code> (TODO: Issue about what to recommend here.)

The other disadvantage of this approach is that there is no support within the microdata API for retrieving items based on the value of a property. In the example above, it would be possible to retrieve the event using:

 document.getItems('http://microformats.org/profile/hcalendar#vevent')

but not through:

 document.getItems('http://schema.org/Event')

Scripts that extract microdata information using the DOM will be faster if they can use the primary types for an item, specified within the <code>@itemtype</code> attribute, so you should specify types accessed through scripts within <code>@itemtype</code> rather than through a property  wherever possible.

==== Mixing Vocabularies using Repeated Content ====

The second method of supporting multiple properties is to have the entity represented by two (or more) microdata items on the page. To enable dragging and dropping the data from these items, they should be nested inside each other. Properties can be set on the outer element using <code>link</code> and <code>meta</code> elements which are hidden from users, while the visible content of the page is marked up by the inner element.

 <nowiki><div itemscope itemtype="http://microformats.org/profile/hcalendar#vevent">
  <link itemprop="url" href="nba-miami-philidelphia-game3.html">
  <meta itemprop="summary" content="Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)">
  <meta itemprop="dtstart" content="2016-04-21T20:00">
  <meta itemprop="location" content="Wells Fargo Center, Philadelphia, PA">
  <div itemscope itemtype="http://schema.org/Event">
    <a itemprop="url" href="nba-miami-philidelphia-game3.html">
    NBA Eastern Conference First Round Playoff Tickets:
    <span itemprop="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
    </a>

    <meta itemprop="startDate" content="2016-04-21T20:00">
      Thu, 04/21/16
      8:00 p.m.

    <div itemprop="location" itemscope itemtype="http://schema.org/Place">
      <a itemprop="url" href="wells-fargo-center.html">
      Wells Fargo Center
      </a>
      <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
        <span itemprop="addressLocality">Philadelphia</span>,
        <span itemprop="addressRegion">PA</span>
      </div>
    </div>

    <div itemprop="offers" itemscope itemtype="http://schema.org/AggregateOffer">
      Priced from: <span itemprop="lowPrice">$35</span>
      <span itemprop="offerCount">1938</span> tickets left
    </div>
  </div>
</div></nowiki>

This method does not require any special properties to be defined in the vocabularies used to mark up the page, and the two items are directly assigned the relevant type and are thus accessible to scripts through the <code>document.getItems()</code> method.

The disadvantages of this method are that the page contains more items than there are entities (in the above example, two items representing the same event), and it requires repetition of data within the page.

== Mixing Syntaxes ==

A requirement to support a large range of consumers can mean that it becomes necessary to publish using not only multiple vocabularies but multiple syntaxes.

RDFa, microformats and microdata all share the same basic entity/attribute/value model, so in many cases it is possible to mirror attributes across the syntaxes. The following example shows the same content marked up with:

* hCalendar (microformat)
* schema.org (RDFa)
* vEvent (microdata)

 <nowiki><div class="vevent"
  itemscope itemtype="http://microformats.org/profile/hcalendar#vevent"
  about="_:event" vocab="http://schema.org/" typeof="Event">
  <a class="url" itemprop="url" rel="url" href="nba-miami-philidelphia-game3.html">
    <span about="_:event">
      NBA Eastern Conference First Round Playoff Tickets:
      <span class="summary" itemprop="summary" property="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1) </span>
    </span>
  </a>

  <meta itemprop="dtstart" property="startDate" content="2016-04-21T20:00:00">
  <abbr class="dtstart" title="2016-04-21T20:00:00">
    Thu, 04/21/16
    8:00 p.m.
  </abbr>

  <div class="location" itemprop="location" rel="location">
    <div typeof="http://schema.org/Place">
      <a rel="url" href="wells-fargo-center.html">
        Wells Fargo Center
      </a>
      <div rel="address">
        <div typeof="http://schema.org/PostalAddress">
          <span property="addressLocality">Philadelphia</span>,
          <span property="addressRegion">PA</span>
        </div>
      </div>
    </div>
  </div>

  <div rel="offers">
    <div typeof="http://schema.org/AggregateOffer">
      Priced from: <span property="lowPrice">$35</span>
      <span property="offerCount">1938</span> tickets left
    </div>
  </div>
</div></nowiki>

It is particularly important to check pages in which syntaxes are mixed together using an appropriate validator for each format.

The following guidelines may help when creating pages in which different syntaxes are mixed together.

* microformats do not use <code>link</code> or <code>meta</code> elements within the content of the page and in some cases require particular elements to be used to encode information, such as using <code>abbr</code> to support the [http://microformats.org/wiki/datetime-design-pattern datetime-design-pattern] as illustrated by the <code>dtstart</code> property in the example above
* the following equivalencies between RDFa and microdata attributes generally hold true:
** <code>@itemid</code> = <code>@about</code>
** <code>@itemtype</code> = <code>@typeof</code> (+ <code>@vocab</code> to enable the use of short names for properties)
** <code>@itemprop</code> on an [http://dev.w3.org/html5/md/Overview.html#url-property-elements URL property element] = <code>@rel</code>
** <code>@itemprop</code> + <code>@itemscope</code> = <code>@rel</code> + a nested element with <code>@about</code>/<code>@typeof</code>
** <code>@itemprop</code> otherwise = <code>@property</code>
* when a <code>@rel</code> attribute is used on an <code>a</code> element, the content of that element then changes to talk about a new entity identified through the URL from the <code>@href</code> attribute; this is not true in microdata, where the entity a property relates to is usually the closest element with an <code>@itemscope</code> attribute. A workaround for this is to put an <code>@about</code> attribute naming a blank node (a name beginning with <code>_:</code>) on the ancestor entity element and then wrap the content of the <code>a</code> element in a <code>span</code> with the same <code>@about</code> attribute; the <code>url</code> and <code>name</code> properties in the example above show this in practice
* RDFa vocabularies are typically stricter in the range of values that they accept for properties that take dates and times; it is best to use the syntax <code>YYYY-MM-DD</code> for dates, <code>hh:mm:ss</code> for times and <code>YYYY-MM-DDThh:mm:ss</code> for dateTimes to be compliant with the [http://www.w3.org/TR/xmlschema-2/#dateTime XML Schema dates and times] which RDFa-based vocabularies will typically use
* the <code>@datatype</code> property might be required for some RDFa vocabularies/consumers; others will coerce values into the appropriate datatype based on the property itself. However, if a property takes a structured value, the property element must have <code>datatype="rdf:XMLLiteral"</code> for that structure to be preserved

== Consuming Pages with Multiple Formats ==

In attempting to provide information to multiple consumers, publishers may use several formats within a single page. Consumers should ignore data in vocabularies that they do not recognise and only raise errors for unexpected properties in those vocabularies.

Consumers of HTML data may recognise several formats embedded within a given page, and even within the same part of a page. In these cases, consumers should merge from the different formats; in the example above, a consumer should recognise that the data in vEvent, hCalendar and schema.org is about is a single event rather than interpreting it as three events and merge property values so that the event ends up having a single URL rather than several. Different formats may provide information about different aspects of an entity to different levels of fidelity &mdash; in the example above, the schema.org RDFa provided extra details about the location of the event t to the vEvent or hCalendar formats &mdash; and consumers should seek to use whatever gives them the most detailed information.


-- 
Jeni Tennison
http://www.jenitennison.com
Received on Tuesday, 8 November 2011 22:05:08 UTC