[whatwg] Extensible microdata attributes from Benjamin Hawkes-Lewis on 2011-04-27 (public-whatwg-archive@w3.org from April 2011)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Wed, 27 Apr 2011 14:06:27 +0100
Message-ID: <BANLkTinsDf6vfOvBEO5wHUVQmL7DXsP1vQ@mail.gmail.com>
On Wed, Apr 27, 2011 at 3:54 AM, Brett Zamir <brettz9 at yahoo.com> wrote:
> Thanks for the references. While this may be relevant for the likes of blogs
> and other documents whose requirements for semantic density is limited
> enough to allow such reshaping for practical effect and whose content is
> reshapeable by the content creator (as opposed to republishing of already
> completed books), for more semantically dense content, such as the types of
> classical documents marked up by TEI, it is simply not possible to expose
> text for each bit of semantic information or to generate new text to meet
> that need. And of course, even with microformats/microdata as it is now, the
> semantic content itself is not necessarily exposed just because text is
> visible on the page.
>
> The issue of discoverability is I think more related to how it will be
> consumed or may be consumed. And even if some pieces of information are less
> discoverable, it does not mean that they have no value. For such rich
> documents, a lot of attention is being paid to these texts since they are
> themselves considered important enough to be worth the time.
>
> If the Declaration of Independence of the United States was marked up with
> hidden information about prior emendations, their likely reasons, etc., or
> about suspected authors of particular passages, or the United Nations
> Declaration of Human Rights were marked up to indicate which countries have
> expressed reservations (qualifications) about which rights, while a browsing
> application or query tool ought to be able (optionally) expose this hidden
> information, there is no automatic need for the markup to be polluted with
> extra (hidden) (and especially URI-based or other non-textual) tags when an
> attribute would suffice.
>
> For things that are truly important, there may be a great deal of care put
> into building up many layers which are meant to be peeled away, and it is
> worth allowing some of that information (particularly the non-textual
> information, e.g., the conditions of authorship, publisher, etc.),
> especially which the original publication did not expose, to be still
> selectively revealed to queries and deeper browsing.
>
> If a site like Wikisource (the online library sister project of Wikipedia's)
> would be able to offer such officially sanctioned semantic attributes,
> classic texts could become enhanced in this way over time, with the wiki
> exposing the hidden semantic information, which indeed may not be as
> important as the visible text, but with queries by interested to users, any
> problems in encoding could be discovered just as well.

Your email challenges the principle of visible data on four different grounds:

   1. You note even proponents of visible data do not always show their data.
But the microformats community only endorse hidden metadata for annotating
human-friendly visible data (e.g. "mercredi prochain") with a machine-readable
equivalent (e.g. an ISO 8601 formatted date). They do not endorse hidden
metadata without visible equivalents against which it can be cross-checked.

   2. You imply editorial effort can offset the error-proneness of hidden
metadata. But the same extraordinary editorial effort would yield even greater
accuracy if it went towards creating visible data rather than hidden metadata.

   3. You claim tool-assisted queries by end-users against the hidden metadata
will reveal errors at the same rate as visible data. But this is doubtful, in
so far as many queries will obfuscate context whereas simply reading through the
text encourages serendipitous error discovery. For example, I could issue a
query asking what proportion of the Declaration of Independence is suspected to
be authored by John Adams. A percentage answer would not reveal the odd
misattributed passage. By contrast, if I'm a scholar of the Declaration and am
reading through the text and I happen to see a suspiciously Jeffersonian
passage visibly attributed to John Adams, I'm much more likely to notice the
error.

   4. You assert that it is not viable to make multiple layers of rich data
visible in a single view. I'd make the counterargument that on the web, unlike
in print, it is economical to dynamically construct different views and filters
of a document and its various visible data streams on the client, on the
server, on the client, or on some combination of the two. The HTML5
specification itself is a great example of this. The source text is kept in a
repository that stores changes to the text, along with date and rationale.
Multiple views of this source text are then generated serverside: the source
text is carved up into multiple draft specs for W3C and a single mammoth
specification for WHATWG. The HTML spec is provided in a browser-crashing
single document view and in a multipage view. On top of this, there is
clientside filtering in the form of an in-page control that can produce a web
author view by hiding technical text aimed at browser vendors.

If you're keen on using the TEI vocabulary to meet the Wikisource use case,
there's no particular reason why you couldn't convert Wiki markup to TEI source
text, serve TEI directly over the web, and also generate various HTML views of
visible rich data from the TEI (for example, with XSLT). The Perseus project
uses TEI and HTML in combination a bit like that:

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a1999.01.0199

But let's say you were determined to serve up a single HTML document with lots
of hidden metadata. None of microformats, microdata, and RDFa were designed to
do this. But both microdata and RDFa allow you to do so in a conforming manner
using the @content attribute. In WHATWG HTML, this is restricted to the "meta"
element, but the "meta" element is now allowed amidst body text so it can apply
to individual sections of the document, rather than just the whole document.
In W3C HTML+RDFa, the @content attribute is allowed on any element.

In other words, where your examples currently abuse the skinning layer
("display: none") to preserve logical text flow, they should actually be using
meta at content instead; there is no need for "ugly hacks" even if the markup
becomes more verbose than you might like.

Note HTML also has other extension points that are available, including dumping
data in script elements, dumping data in class attributes, and mixing XHTML and
other XML vocabularies in a compound document.

Beware that even where a conforming hidden metadata mechanism is provided,
consumers of such documents may well distrust hidden metadata that is not a
machine-readable equivalent to visible data. For example, Google say:

"In general, Google won't display content that is not visible to the user. In
other words, don't show content to users in one way, and use hidden text to
mark up information separately for search engines and web applications. You
should mark up the text that actually appears to your users when they visit
your web pages."

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146898

The web's poor experience with hidden metadata thus far suggests that consumers
are right to distrust it and that additional hidden metadata feature proposals
are likely to face an uphill struggle.

--
Benjamin Hawkes-Lewis
Received on Wednesday, 27 April 2011 06:06:27 UTC