[whatwg] Annotating structured data that HTML has no semantics for

One of the more elaborate use cases I collected from the e-mails sent in 
over the past few months was the following:

   USE CASE: Annotate structured data that HTML has no semantics for, and
   which nobody has annotated before, and may never again, for private use or
   use in a small self-contained community.

   SCENARIOS:
     * A group of users want to mark up their iguana collections so that they
       can write a script that collates all their collections and presents
       them in a uniform fashion.
     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he teaches to add
       it to their custom applications.
     * The list of specifications produced by W3C, for example, and various
       lists of translations, are produced by scraping source pages and
       outputting the result. This is brittle. It would be easier if the data
       was unambiguously obtainable from the source pages. This is a custom
       set of properties, specific to this community.
     * Chaals wants to make a list of the people who have translated W3C
       specifications or other documents, and then use this to search for
       people who are familiar with a given technology at least at some
       level, and happen to speak one or more languages of interest.
     * Chaals wants to have a reputation manager that can determine which of
       the many emails sent to the WHATWG list might be "more than usually
       valuable", and would like to seed this reputation manager from
       information gathered from the same source as the scraper that
       generates the W3C's TR/ page.
     * A user wants to write a script that finds the price of a book from an
       Amazon page.
     * Todd sells an HTML-based content management system, where all
       documents are processed and edited as HTML, sent from one editor to
       another, and eventually published and indexed. He would like to build
       up the editorial metadata used by the system within the HTML documents
       themselves, so that it is easier to manage and less likely to be lost.
     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush, and has either convinced
       the people making the statements that they should use a common
       language-neutral machine-readable vocabulary to describe their
       thoughts, or has convinced some other people to come in after them and
       process the thoughts manually to get them into a computer-readable
       form.

   REQUIREMENTS:
     * Vocabularies can be developed in a manner that won't clash with future
       more widely-used vocabularies, so that those future vocabularies can
       later be used in a page making use of private vocabularies without
       making the earlier annotations ambiguous.
     * Using the data should not involve learning a plethora of new APIs,
       formats, or vocabularies (today it is possible, e.g., to get the price
       of an Amazon product, but it requires learning a new API; similarly
       it's possible to get information from sites consistently using 'class'
       values in a documented way, but doing so requires learning a new
       vocabulary).
     * Shouldn't require the consumer to write XSLT or server-side code to
       process the annotated data.
     * Machine-readable annotations shouldn't be on a separate page than
       human-readable annotations.
     * The information should be convertible into a dedicated form (RDF,
       JSON, XML) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.
     * Should be possible for different parts of an item's data to be given
       in different parts of the page, for example two items described in the
       same paragraph. ("The two lamps are A and B. The first is $20, the
       second $30. The first is 5W, the second 7W.")
     * It should be possible to define globally-unique names, but the syntax
       should be optimised for a set of predefined vocabularies.
     * Adding this data to a page should be easy.
     * The syntax for adding this data should encourage the data to remain
       accurate when the page is changed.
     * The syntax should be resilient to intentional copy-and-paste
       authoring: people copying data into the page from a page that already
       has data should not have to know about any declarations far from the
       data.
     * The syntax should be resilient to unintentional copy-and-paste
       authoring: people copying markup from the page who do not know about
       these features should not inadvertently mark up their page with
       inapplicable data.
     * Any additional markup or data used to allow the machine to understand
       the actual information shouldn't be redundantly repeated (e.g. on each
       cell of a table, when setting it on the column is possible).
     * Parsing rules should be unambiguous.
     * Should not require changes to HTML5 parsing rules.
     * Creating a custom vocabulary should be relatively easy.
     * Distributed vocabulary development should be possible; it
       should not require coordination through a centralised system.
     * It should be possible to publish and re-use custom
       vocabularies.


Some of the scenarios for this use case required substantial additions to 
the spec, so before I go through the scenarios, I will describe how I 
ended up putting what I did into the spec.

Let's base this on a variant of the first scenario:

     * A group of users want to mark up their iguana collections so that they
       can write a script that collates all their collections and presents
       them in a uniform fashion.

I don't know anything about iguanas, but let's pretend it's about cats, 
which should be basically the same thing from a markup perspective. Let's 
say someone in the group has the following text on http://example.org/:

     <section>
      <h1>Hedral</h1>
      <p>Hedral is a male american domestic shorthair, with a fluffy black 
      fur with white paws and belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo">
     </section>

...and suppose that they want to make it so that the following information 
can be extracted from this part of this page:

   Cat name:    "Hedral"
   Description: "Hedral is a male american domestic shorthair, with a 
                 fluffy black fur with white paws and belly."
   Image:       "http://example.org/hedral.jpeg"

Let's presume there are several authors each with their own distinct and 
differently-structured pages who all want to cooperate so that a single 
script can be used to amalgamate the information stored in their pages to 
print a table summarising this information about their cats.

The simplest solution would be for them to define a basic vocabulary using 
class names and just have them extract the data based on these class 
names, a la Microformats. The markup above might become:

     <section>
      <h1 class="name">Hedral</h1>
      <p class="desc">Hedral is a male american domestic shorthair, with a 
      fluffy black fur with white paws and belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo img">
     </section>

This works well; a script can be written to just rip out the relevant 
data. Unfortunately it has a couple of problems:

 - it is likely that the class names will clash with other classes used by 
   other people, which would make it hard to handle situations where 
   different communities want to work together.

 - there is no way to group information about each cat together on a page 
   with multiple cats.

 - there is no way for a parser to know which classes are properties of 
   cats and which are just for styling (e.g. 'photo' used in this 
   example).

The first problem is easy to solve, we can just use more unique class 
names, e.g. com.damowmow.name, com.damowmow.desc, and com.damowmow.img:

     <section>
      <h1 class="com.damowmow.name">Hedral</h1>
      <p class="com.damowmow.desc">Hedral is a male american domestic 
      shorthair, with a fluffy black fur with white paws and belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo com.damowmow.img">
     </section>

The second is harder to solve. We could introduce a class name that says 
"this is a cat", much like the way Microformats do, but then parsers would 
need to know about the "root class names" (to use the Microformats term) 
to know where to begin collecting data.

The third problem is somewhat of a blocker, and is the most common 
complaint I heard from people about Microformats.


Another solution we could consider is RDFa:

     <section typeof="d:cat" xmlns:d="http://damowmow.com/">
      <h1 property="d:name">Hedral</h1>
      <p property="d:desc">Hedral is a male american domestic shorthair, 
      with a fluffy black fur with white paws and belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo" rel="d:img">
     </section>

This unfortunately also has a number of problems.

 - it uses prefixes, which most authors simply do not understand, and 
   which many implementors end up getting wrong (e.g. SearchMonkey 
   hard-coded certain prefixes in its first implementation, Google's 
   handling of RDF blocks for license declarations is all done with 
   regular expressions instead of actually parsing the namespaces, etc).
   Even if implemented right, namespaces still lead to flaky 
   copy-and-paste behaviour.

 - it sometimes uses rel="" and sometimes uses property="" and it's hard 
   to know when to use one or the other.

 - it introduces much more power than is necessary to solve this problem.

We can fix the first problem by using URLs instead, and using the 
xmlns:http trick to keep things working in RDFa processors:

     <section typeof="http://damowmow.com/cat" xmlns:http="http:">
      <h1 property="http://damowmow.com/name">Hedral</h1>
      <p property="http://damowmow.com/desc">Hedral is a male american 
      domestic shorthair, with a fluffy black fur with white paws and 
      belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo" rel="http://damowmow.com/img">
     </section>

This, though, is quite ugly. An alternative is to go back to the non-URI 
class names we had above. This doesn't break compatibility with the RDFa 
processors, because when there is no colon in the property="" or rel="" 
attributes, the RDFa processors just ignore the values (this is the "no 
prefix" mapping of CURIEs):

     <section typeof="com.damowmow.cat">
      <h1 property="com.damowmow.name">Hedral</h1>
      <p property="com.damowmow.desc">Hedral is a male american 
      domestic shorthair, with a fluffy black fur with white paws and 
      belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo" rel="com.damowmow.img">
     </section>

At this point though we might as well change rel="" to property="" and 
just say that on <img> it takes the value from src="", because that's 
going to be a lot more understandable, and won't affect RDFa processors 
(which at this point are ignoring the values):

     <section typeof="com.damowmow.cat">
      <h1 property="com.damowmow.name">Hedral</h1>
      <p property="com.damowmow.desc">Hedral is a male american 
      domestic shorthair, with a fluffy black fur with white paws and 
      belly.</p>
      <img src="hedral.jpeg" alt="" title="Hedral, age 18 months" 
      class="photo" property="com.damowmow.img">
     </section>

This has us past the first two problems listed above for RDFa, but we 
still have the third, if we were to adopt the rest of RDFa -- way too much 
power to solve this problem. So instead, let's assume we're sticking with 
just these two attributes, typeof="" and property="", in the "prefix-less" 
CURIE space. This gives us a very small subset of RDFa to start from.

Let's see if this works well for other ways of marking up cat data:

     <body typeof="com.damowmow.cat">
      <p>I love my cats. My oldest cat is <span
      property="com.damowmow.name">Silver</span>. <span
      property="com.damowmow.desc">Silver is <span
      property="com.damowmow.age">11</span> years old and refuses to eat
      alone, always waiting for either Yellow or Blue to eat with
      him.</p>
     </body>

This seems fine. This example shows overlapping properties, so that data 
can be extracted from within the value of another.


At this point, I would like to introduce another example, also taken from 
the scenarios above, to examine if this works well in other scenarios:

     * The list of specifications produced by W3C, for example, and various
       lists of translations, are produced by scraping source pages and
       outputting the result. This is brittle. It would be easier if the data
       was unambiguously obtainable from the source pages. This is a custom
       set of properties, specific to this community.

Let's presume that each spec has the following structure:

   Name:
   URL:
   Status:
     Level:
     Publication Date:
     Comments Deadline:
   Working Group:

This might be encoded as:

   <div typeof="org.w3.spec">
    <h1 property="org.w3.name">HTML5</h1>
    <a property="org.w3.url" href="/TR/html5">Current Version</a>
    <div property="org.w3.status">
     <p>Level: <span property="org.w3.level">WD</span>
     <p>Date: <span property="org.w3.pubdate">03/02/2009</span>
     <p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
    </div>
    <p>Working Group: <span property="org.w3.wg">HTMLWG</span>
   </div>

There is a problem here. The original structure has "status" as a property 
with three sub-properties, but according to the "minimal RDFa" convention 
we've come up with so far, the markup here has "status" as containing a 
long string and the other properties are at the top level, like this:

   Name: "HTML5"
   URL: "/TR/html5"
   Status: "Level: WD Date: 2009-03-02 Deadline: 2010-01-01"
   Level: "WD"
   Publication Date: "03/02/2009"
   Comments Deadline: "02/03/2009"
   Working Group: "HTMLWG"

What we need is a way to say that a property is really a new set of 
name-value pairs. If we look at the current markup, we are using typeof="" 
for this, but typeof="" expects a type as its value, whereas here we 
don't need an explicit type. We could just say typeof="" can be used 
without a type:

   <div typeof="org.w3.spec">
    <h1 property="org.w3.name">HTML5</h1>
    <a property="org.w3.url" href="/TR/html5">Current Version</a>
    <div property="org.w3.status" typeof>
     <p>Level: <span property="org.w3.level">WD</span>
     <p>Date: <span property="org.w3.pubdate">03/02/2009</span>
     <p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
    </div>
    <p>Working Group: <span property="org.w3.wg">HTMLWG</span>
   </div>

But that doesn't make much sense and isn't allowed in RDFa anyway.

Let's rename "typeof" to "item" instead:

   <div item="org.w3.spec">
    <h1 property="org.w3.name">HTML5</h1>
    <a property="org.w3.url" href="/TR/html5">Current Version</a>
    <div property="org.w3.status" item>
     <p>Level: <span property="org.w3.level">WD</span>
     <p>Date: <span property="org.w3.pubdate">03/02/2009</span>
     <p>Deadline: <span property="org.w3.deadline">02/03/2009</span>
    </div>
    <p>Working Group: <span property="org.w3.wg">HTMLWG</span>
   </div>

Now the result makes more sense:

   Name: "HTML5"
   URL: "/TR/html5"
   Status:
     Level: "WD"
     Publication Date: "03/02/2009"
     Comments Deadline: "02/03/2009"
   Working Group: "HTMLWG"

While we're at it, we should use the <time> element for encoding those 
dates, so that we don't have work out if the deadline is before the 
publication date or if these are weird US-format dates:

   <div item="org.w3.spec">
    <h1 property="org.w3.name">HTML5</h1>
    <a property="org.w3.url" href="/TR/html5">Current Version</a>
    <div property="org.w3.status" item>
     <p>Level: <span property="org.w3.level">WD</span>
     <p>Date: <time property="org.w3.pubdate" datetime="2009-02-03">03/02/2009</time>
     <p>Deadline: <time property="org.w3.deadline" datetime="2009-03-02">02/03/2009</time>
    </div>
    <p>Working Group: <span property="org.w3.wg">HTMLWG</span>
   </div>

(So, <img> gets its content from src="", <time> gets its content from 
datetime="", and everything else gets its content from the element's own 
contents, serialised to text.)


One of the requirements is the following:

     * Should be possible for different parts of an item's data to be given
       in different parts of the page, for example two items described in the
       same paragraph. ("The two lamps are A and B. The first is $20, the
       second $30. The first is 5W, the second 7W.")

Here we have two "items" (to use the terminology that the attribute name 
above implies), but they are mixed together.

   <p>The two lamps are A and B. The first is $20, the
   second $30. The first is 5W, the second 7W.</p>

The proposal so far doesn't handle this:

   <p item>The two lamps are <span
   property="com.example.name">A</span> and <span ???>B</span>. The
   first is <span property="com.example.price">$20</span>, the second
   <span ???>$30</span>. The first is <span
   property="com.example.power">5W</span>, the second <span
   ???>7W</span>.</p>

What do we put for the "???"s?

In RDFa we would have to give each of these items a name using about="" 
and then link them that way, but this has two problems: these items aren't 
named, nor is there an obvious name to give them, and it makes it hard to 
identify a single element responsible for introducing an item, which would 
make DOM interfaces to this somewhat more complex.

Microformats aren't especially useful here for inspiration as they do not 
solve this problem at all.

Instead, let us try using the regular "IDREF" functionality that HTML uses 
in a variety of other places, like <label for="">. For this we'll need a 
new attribute, but unfortunately we can't use about="" (which would be the 
obvious name to use), because that would conflict with RDFa, so instead 
we'll use subject="":

   <p>The two lamps are <span item id=a><span
   property="com.example.name">A</span></span> and <span item
   id=b><span property="com.example.name">B</span></span>. The first
   is <span subject="a" property="com.example.price">$20</span>, the
   second <span subject="b"
   property="com.example.price">$30</span>. The first is <span
   subject="a" property="com.example.power">5W</span>, the second
   <span subject="b" property="com.example.power">7W</span>.</p>

This is somewhat verbose, but I think this example represents pretty much 
the worst case scenario -- in most cases, most of the content would be in 
one place with just the odd other property elsewhere on the page.


With this proposal in mind, let's now go through the scenarios.

     * A group of users want to mark up their iguana collections so that they
       can write a script that collates all their collections and presents
       them in a uniform fashion.

Assuming Iguanas aren't too different from cats, here's an example of what 
this could look like:

   Page 1:
   <section item="com.damowmow.cat">
    <h1 property="com.damowmow.name">Hedral</h1>
    <p property="com.damowmow.desc">Hedral is a male american domestic
    shorthair, with a fluffy black fur with white paws and belly.</p>
    <img property="com.damowmow.img" src="hedral.jpeg" alt="" title="Hedral, age 18 months">
   </section>

   Page 2:
   <body item="com.damowmow.cat">
    <p>I love my cats. My oldest cat is <span
    property="com.damowmow.name">Silver</span>. <span
    property="com.damowmow.desc">Silver is <span
    property="com.damowmow.age">11</span> years old and refuses to eat
    alone, always waiting for either Yellow or Blue to eat with
    him.</p>
   </body>

   Page 3:
   <h2>My Cats<h2>
   <dl>
    <dt>Schr&ouml;dinger
    <dd item="com.damowmow.cat">
     <meta property="com.damowmow.name" content="Schr&ouml;dinger">
     <meta property="com.damowmow.age" content="9">
     <p property="com.damowmow.desc">Orange male.
    <dt>Erwin
    <dd item="com.damowmow.cat">
     <meta property="com.damowmow.name" content="Lord Erwin">
     <meta property="com.damowmow.age" content="3">
     <p property="com.damowmow.desc">Siamese color-point.
     <img property="com.damowmow.img" alt="" src="/images/erwin.jpeg">
   </dl>

(Note the <meta>s in the last example -- since sometimes the information 
isn't visible, rather than requiring that people put it in and hide it 
with display:none, which has a rather poor accessibility story, I figured 
we could just allow <meta> anywhere, if it has a property="" attribute.)

Assuming a Perl library HTML::Microdata with the following API (which 
could be written on top of html5lib, for instance):

   # $doc is an HTML document instance
   my $microdata = new HTML::Microdata($doc);
   my @items = $microdata->items() # returns a list of items
   my @x = $microdata->items('x') # returns a list of items with that type
   my $properties = $item->properties() # returns a hashref of name-[value] pairs

...then the following could generate a table of all the cats:

   my @cats;
   # @docs is a list of pre-parsed document instances
   foreach my $doc (@docs) {
     my $data = new HTML::Microdata($doc);
     push(@cats, $data->items('com.damowmow.cat')); 
   }
   print "<table><thead><tr><th>Name<th>Age<th>Image<th>Description<tbody>";
   foreach my $cat (@cats) {
     my $name = '&mdash;';
     if (exists $cat->properties->{'com.damowmow.name'}) {
       $name = escapeHTML($cat->properties->{'com.damowmow.name'}->[0]);
     }
     my $age = '';
     if (exists $cat->properties->{'com.damowmow.age'}) {
       $age = escapeHTML($cat->properties->{'com.damowmow.age'}->[0]);
     }
     my $image = '';
     if (exists $cat->properties->{'com.damowmow.img'}) {
       my $src = escapeHTML($cat->properties->{'com.damowmow.img'}->[0]);
       $image = "<img src='$src' alt=''>";
     }
     my $description = '';
     if (exists $cat->properties->{'com.damowmow.desc'}) {
       $description = escapeHTML($cat->properties->{'com.damowmow.desc'}->[0]);
     }
     print "<tr><td>$name<td>$age<td>$image<td>$description";
   }
   print "</table>";

This seems pretty simple to use.


     * A scholar and teacher wants other scholars (and potentially students)
       to be able to easily extract information about what he teaches to add
       it to their custom applications.
     * The list of specifications produced by W3C, for example, and various
       lists of translations, are produced by scraping source pages and
       outputting the result. This is brittle. It would be easier if the data
       was unambiguously obtainable from the source pages. This is a custom
       set of properties, specific to this community.

These seem to be more or less the same.


     * Chaals wants to make a list of the people who have translated W3C
       specifications or other documents, and then use this to search for
       people who are familiar with a given technology at least at some
       level, and happen to speak one or more languages of interest.

It seems that, given a tool that can parse this microdata format from HTML 
pages, it would be reasonably straight-forward to write a tool to search 
the data for particular patterns.


     * Chaals wants to have a reputation manager that can determine which of
       the many emails sent to the WHATWG list might be "more than usually
       valuable", and would like to seed this reputation manager from
       information gathered from the same source as the scraper that
       generates the W3C's TR/ page.

The reputation manager and e-mail processing is out of scope here, but the 
microdata format would address the problem of scraping the data in a less 
brittle manner than would be required today.


     * A user wants to write a script that finds the price of a book from an
       Amazon page.

This wouldn't be helped my the microdata solution proposed above, since 
Amazon apparently do not wish to expose the data in this fashion, but 
instead the Amazon API can be used for this purpose, so there is an 
alternative solution here.


     * Todd sells an HTML-based content management system, where all
       documents are processed and edited as HTML, sent from one editor to
       another, and eventually published and indexed. He would like to build
       up the editorial metadata used by the system within the HTML documents
       themselves, so that it is easier to manage and less likely to be lost.

It's unclear exactly what metadata this would involve, but the microdata 
solution described above seems like it would work for this.


     * Tim wants to make a knowledge base seeded from statements made in
       Spanish and English, e.g. from people writing down their thoughts
       about George W. Bush and George H.W. Bush, and has either convinced
       the people making the statements that they should use a common
       language-neutral machine-readable vocabulary to describe their
       thoughts, or has convinced some other people to come in after them and
       process the thoughts manually to get them into a computer-readable
       form.

The knowledge base aspect of this is out of scope for HTML, but if someone 
is able to convince people to mark up sentences in HTML, the microdata 
solution above would, as far as I can tell, provide sufficient flexibility 
to mark up such features.


The requirements were as follows:

     * Vocabularies can be developed in a manner that won't clash with future
       more widely-used vocabularies, so that those future vocabularies can
       later be used in a page making use of private vocabularies without
       making the earlier annotations ambiguous.

The use of reversed domain labels makes this possible without needing 
prefixes and without making the page unreadable, so this seems satisfied.


     * Using the data should not involve learning a plethora of new APIs,
       formats, or vocabularies (today it is possible, e.g., to get the price
       of an Amazon product, but it requires learning a new API; similarly
       it's possible to get information from sites consistently using 'class'
       values in a documented way, but doing so requires learning a new
       vocabulary).

This need isn't met. There would still be a plethora of vocabularies to 
learn. I don't see how to satisfy this requirement short of defining one 
very comprehensive vocabulary and requiring that only that vocabulary be 
used, but that seems like an impractical solution.


     * Shouldn't require the consumer to write XSLT or server-side code to
       process the annotated data.

This requirement isn't met. While XSLT itself isn't necessarily needed, 
there is still typcially a need for code to process the parsed annotations 
to do something with it.

This requirement was later clarified to mean that it should be possible to 
write a single parser that isn't continually maintained in the way that 
the Microformats parsers have to be as new vocabularies are added. That 
_is_ handled by this new syntax, so hopefully that will be enough.


     * Machine-readable annotations shouldn't be on a separate page than
       human-readable annotations.

This requirement is met.


     * The information should be convertible into a dedicated form (RDF,
       JSON, XML) in a consistent manner, so that tools that use this
       information separate from the pages on which it is found have a
       standard way of conveying the information.

I've included a section on converting microdata blobs to JSON and a 
section on converting HTML documents, including microdata blobs, to RDF, 
so that there is a consistent representation of HTML documents in those 
formats. I haven't defined an XML form of the microdata syntax, since HTML 
can be expressed as XML already, and I'm not sure that a dedicated format 
for microdata would be particularly wieldy.


     * Should be possible for different parts of an item's data to be given
       in different parts of the page, for example two items described in the
       same paragraph. ("The two lamps are A and B. The first is $20, the
       second $30. The first is 5W, the second 7W.")

This was discussed above.


     * It should be possible to define globally-unique names, but the syntax
       should be optimised for a set of predefined vocabularies.

The syntax allows URIs and encourages use of Java-like identifiers, but 
some of the use cases that I haven't handled yet are about dedicated 
vocabularies for particular purposes, so I'll return to this soon.


     * Adding this data to a page should be easy.

Based on the examples I've written above, as well as a number of others 
I've written while developing this, it seems that this is a pretty easy 
syntax to use. It is easier than Microformats for people who already use 
"class" heavily, and it is easier than RDFa (it's a loose subset).


     * The syntax for adding this data should encourage the data to remain
       accurate when the page is changed.

The syntax encourages the use of inline metadata for this very purpose, 
while allowing hidden metadata for the cases where trying to shoe-horn 
microdata into visible form is counterproductive.


     * The syntax should be resilient to intentional copy-and-paste
       authoring: people copying data into the page from a page that already
       has data should not have to know about any declarations far from the
       data.

Since microdata doesn't use prefixes, it's quite resilient to copy/paste.


     * The syntax should be resilient to unintentional copy-and-paste
       authoring: people copying markup from the page who do not know about
       these features should not inadvertently mark up their page with
       inapplicable data.

This isn't really met. It's very easy to copy an element with a 
property="" or item="" attribute and not realise that more is being said 
by the markup than the author realises. I'm not sure how to address this. 
This problem is also present in RDFa and Microformats, but my discussions 
with people who have worked with and on those technologies did not reveal 
any obvious solutions that didn't involve significantly reducing ease of 
authoring.


     * Any additional markup or data used to allow the machine to understand
       the actual information shouldn't be redundantly repeated (e.g. on each
       cell of a table, when setting it on the column is possible).

This isn't met at all with the current proposal. Unfortunately the only 
general solutions I could come up with that would allow this were 
selector-based, and in practice authors are still having trouble 
understanding how to use Selectors even with CSS.


     * Parsing rules should be unambiguous.

The parsing rules for microdata are clearly set out in the spec.


     * Should not require changes to HTML5 parsing rules.

The HTML5 parsing rules are untouched.


     * Creating a custom vocabulary should be relatively easy.

Creating a good vocabulary is always a lot of work, but in general the 
microdata syntax doesn't make it harder, at least. I haven't yet addressed 
the issue of validating the vocabulary, which is one of the use cases that 
I haven't yet dealt with.


     * Distributed vocabulary development should be possible; it
       should not require coordination through a centralised system.

This is handled through the use of URIs or domain-label-based identifiers.


     * It should be possible to publish and re-use custom
       vocabularies.

Nothing in the syntax seems to preclude this.


In conclusion:

To address this use case and its scenarios, I've added to HTML5 a simple 
syntax (three new attributes) based on RDFa. It doesn't have the full 
power of RDF, because that didn't seem to be necessary to address the use 
cases. It doesn't really have anything in common with Microformats; I 
didn't find the Microformats syntax to be very convenient. (This was also 
the experience with eRDF.)

I expect the syntax will need adjustments over the coming weeks to address 
issues that I overlooked. I look forward to such feedback.


A number of further use cases remain to be examined, including one
with scenarios regarding validating custom vocabularies and allowing
editors to provide help with custom vocabularies, and including
several regarding specific vocabularies such as events, contact
information, and bibliographic information. I will send further e-mail
next week as I address them.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 10 May 2009 03:32:34 UTC