ISSUE-76: CHANGE PROPOSAL (Draft 4): Separate Microdata from HTML5 Specification

The following document outlines a Change Proposal to remove Microdata
from the HTML5 specification. The first draft of this document was
published on October 21st, 2009, due to a request by one of the Chairs
of the HTML WG. This e-mail is formatted to conform to step 2.b of the
Escalation Process section[1] of the HTML Working Group Decision Policy
document.

Changes in Draft 4
------------------
* Removed automatic publication of HTML+Microdata FPWD requirement

Changes in Draft 3
------------------
* Added statements noting that the publication of HTML+Microdata FPWD is
  an immediate result of accepting this change proposal in "Summary" and
  "Proposal Details"
* Added more items to "Microdata Cons" list in the "Rebuttal to
  Counter-Proposal" section

Changes in Draft 2
------------------
* Removed mention of RDFa except when used to convey
  development/deployment experiences or examples of "maturity" or
  "minimum/adequate support".
* Added Benefit of modularizing Microdata - adoption by languages other
  than HTML5
* Included Rebuttal to Counter-Proposal section

Summary
-------

There are currently two mechanisms under active development by the HTML
WG for embedding machine-readable semantics in HTML5 - RDFa and
Microdata. The HTML+RDFa spec was published in a separate document, as a
specification built on top of HTML5. The Microdata spec was published
inside of the HTML5 specification, while the discussion of whether or
not to include RDFa was still taking place.

While there are many points to be made for and against RDFa and
Microdata as technologies, the rationale for this proposal is not
concerned with those arguments. The pros-and-cons are, however, reviewed
at the bottom of the document in the "Rebuttal to Counter-Proposal"
section. This change proposal is concerned with the ramifications of
placing a new technology that has not gained broad deployment experience
nor authoring feedback into the main HTML5 specification.

Primarily, this Change Proposal asserts that RDFa barely meets the
requirement of broad deployment experience and authoring feedback.
Microdata, having achieved very little implementation experience, no
deployment experience and very little authoring feedback (to date)
should be considered to be an at-risk feature for HTML5 and should be
considered for removal into a separate specification.

This proposal argues that Microdata should be kept separate from the
HTML specification until it is clear to this Working Group that it has
become broadly deployed and heavily utilized by the HTML authoring
community. This has a number of benefits in the case that the technology
succeeds, as well as in the case where the technology fails.

Separating Microdata from the HTML5 specification has several
significant advantages and no significant disadvantages.

Rationale
---------

There are a number of basic premises related to separating Microdata
from the main HTML5 specification:

* Microdata may fail in the marketplace.
* It is more productive for philosophically divergent communities
(RDFa/Microdata) within a larger community (HTML WG) to have their own
work products during a period of active debate. Those complete work
products should only be presented to the larger group for consensus when
they reach maturity. Doing so prior to the work being completed, leads
to perma-thread discussions, as we have experienced for the past several
months.
* HTML+Microdata should be allowed to become a mature draft before
consensus on inclusion or dismissal is discussed in order to ensure the
proper technology is selected for semantic data markup.
* Having the Microdata specification separate from the HTML5
specification will allow the technologies to evolve independently from
HTML5 (during LC, and after REC).
* Microdata could be used in other markup languages to provide semantic
markup.

A number of potential conclusions can be drawn from the premises and
current state of affairs:

* If Microdata fails in the marketplace, in the long-term, it would be
advisable to allow it to fail without having a negative impact on the
HTML5 spec proper. Removing it from HTML5, many years from now, will be
difficult, if not impossible.
* The HTML+Microdata draft should be allowed to mature until W3C Last
Call before the discussion on whether or not to include it in the HTML5
specification. A productive way to enable that maturation process is to
separate the work into a separate document.
* If we don't separate the Microdata specification into a different work
product, the alternative may be to prematurely select a single
technology, before Microdata is allowed to mature and gather
implementation and deployment feedback.
* If Microdata is split into a modular Microdata specification, the
likely-hood that it would be adopted by other markup languages (like
SVG, ODF, or Docbook) might increase because it will no longer be viewed
as an HTML5-only technology.

Proposal Details
----------------

The change details of this proposal would require removing all language
discussing Microdata from the HTML5 specification. It is advised, but
not required, that the language be placed into a separate specification.
It has been proven that the Microdata specification can be cleanly
migrated into a separate HTML+Microdata specification:

http://html5.digitalbazaar.com/specs/microdata.html

The work to remove the Microdata language from the body of the HTML5
specification took roughly 8-10 hours for a single person to perform.

Impact
------

Negative Effects
* May produce less interest in and feedback on Microdata since it will
not be in the HTML5 spec proper.
* All Microdata attributes and behavior would be defined in a separate
specification.

Positive Effects
* Addresses the months-long Microdata vs. RDFa debate by employing the
"allowing many flowers to bloom" strategy instead of the "Mad Max" "two
enter, one leaves" fight-to-the-death strategy that has been driving the
debate.
* Demonstrates that multiple specs are capable of being layered on top
of HTML5. Proving that HTML5 can be extended through this Working Group
is an important milestone for other spec writers as well as W3C member
companies.
* Allows Microdata to organically mature at its own pace, largely
independently from HTML5.
* Allows Microdata to fail without affecting the main HTML5 specification.
* Changing Microdata in the future wouldn't require the HTML5
specification to be republished as a REC (a very costly process).
* Frees the WHATWG and HTMLWG to concentrate on making technical
progress in other areas.

Rebuttal to Counter-Proposal
----------------------------

> * All good specs which integrate with HTML5 should, ideally, be a part
> of HTML5.  Inclusiveness promotes greater attention to each part, and
> ensures that the language evolves in directions which are most
> helpful.  A spec which is separate from HTML5 may find the easiest way
> to resolve difficulties is to route around them, rather than altering
> or extending the HTML language itself, which may be the best option
> overall.

While there is nothing erroneous with the statements made in the
rationale above, it doesn't address how the rationale relates to
Microdata directly. The philosophy, if employed as "the best way to
implement specifications", is largely false and ignores the large body
of work that constitutes existing Internet and Web specifications. The
"many modular specifications" approach is how the IETF and W3C have
operated to date and the Internet and Web still work fairly well.

Even if one were to assert the above philosophy as true, it is just one
possible philosophy among many that may be used to move us forward. Here
is another, equally convincing, strategy (in the spirit of the rationale
in the counter-proposal):

* All good specs which can be built upon HTML5 should, ideally, be
placed in a separate specification and vetted thoroughly by the HTML WG.
Modularity provides focused specifications and eases the burden on
implementers and authors when creating software or web pages that use
the features outlined in specifications. A spec which is built on top of
HTML5 SHOULD NOT be allowed to route around problems when the best
option would be to change the HTML5 spec proper - the W3C Review process
is in place to ensure that this is enforced. The W3C Review process has
done so for countless other specifications, by inviting reviewers from
the HTML WG, to review the extension specifications.

Example: HTML5+RDFa is a separate specification built on top of HTML5
and has received a large amount of feedback and interest from the HTML
and WHATWG community even though it resides in a separate specification.
This feedback has resulted in planned modifications, corrections, a FPWD
and 107 additional HTML5+RDFa tests added to the test suite.

The Point: The rationale provided in the counter-proposal is a
theoretical problem and is contrary to empirical evidence experienced in
both the Microdata and RDFa discussions. If Microdata is split out and
there is no further interest in it, then it was never a "good spec". If
Microdata is split out and is a "good spec", it will enjoy an adequate
amount of implementations, feedback, review and testing before REC, much
like <video>, <canvas>, and RDFa have in the past (when they were not a
part of any HTML specification).

> * A spec that is designed within HTML5 and one designed outside of it
> are qualitatively different (see Conway's Law).  One designed
> originally as part of the larger spec tends has a larger "surface
> area" alongside the rest of the spec, rather than limiting its
> interaction to a small number of channels.  This makes it harder to
> separate out (though Manu has already done that work) and makes it
> more vulnerable to incompatible changes in the larger spec.  Something
> which originated within the spec is best kept within the spec or
> dropped entirely; it should require strong reasoning to separate it
> out.

This rationale is also fairly theoretical - it effectively boils down to
"we might accidentally create bugs when changing a specification" and
"we need a good reason to separate Microdata from the HTML5
specification". Those "good reason"s and "strong reasoning" are provided
in the body of this change proposal.

While it is true that integrating language in a specification allows for
a "larger surface area", the argument is not persuasive because a decent
test suite should be able to catch most accidentally introduced bugs.
The potential bugs and construction of the test suite also have no
bearing on where a particular technology is specified (in the HTML5
spec, or in a separate Microdata spec).

Example: The RDFa Test Suite is defined for XHTML1, HTML4 and HTML5,
even though each specification is in a separate document. It has been
fairly effective at catching RDFa Processor bugs and continues to be
expanded to cover newly discovered issues. Any incompatible changes in
the larger spec would be immediately visible when utilizing an updated
XHTML1, HTML4 or HTML5 parser - if the test suite is doing its job.

The Point: The way to ensure that a software system (HTML5+Microdata) is
operating correctly is to /test it thoroughly/ against a set of
specifications - not to ensure that all of the specification language is
in one document.

> * Many parts of HTML5 cannot be considered 'mature' and are in fact
> actively changing, and yet are still part of the spec.  It is expected
> that these sections, Microdata included, will receive implementation
> attention and experience, and will be amended or dropped as these
> experiences warrant.  Lack of maturity is not a reason for removal of
> any other part of the spec, and there is no distinguishing feature of
> Microdata that would warrant it being treated differently.

The phrase "mature" was intended to imply a number of attributes.
Namely, since the HTML5 spec is approaching Last Call at the W3C, it is
concerning when any feature has the following attributes - lack of
implementation experience, lack of feedback, corner-case bugs, vehement
disagreement, a published and competing W3C spec, lack of authoring
experience, and lack of deployment experience. When any of these
attributes are associated with a feature, it is certainly a reason to
consider postponing the inclusion of that feature into HTML5. A number
of these attributes are associated with Microdata, namely - lack of
implementation experience, vehement disagreement, lack of authoring
experience, a published and competing REC W3C spec, and lack of
deployment experience.

There are a number of individuals that believe that Microdata is not
"mature" enough to proceed to Last Call at this time. This group has
argued this point in detail, created Change Proposals, volunteered to
split the Microdata specification into a separate document, and
demonstrated that it is quite possible and fairly easy to split the
Microdata specification into a separate document.

The Point: The definition of "mature" provided in the counter-proposal
does not use the same definition of "mature" used in the original
proposal. This rebuttal clarifies the meaning of "mature" and asserts
that we should be postponing features that have a laundry list of issues
associated with them.

> * Microdata does not appear to be in an extreme level of flux to
> warrant concerns of it holding up HTML5's progression in the standards
> process.  If it turns out to indeed limit the main spec it can be
> split out at that time, but at the moment this is nothing more than a
> theoretical concern.  In the other direction, it does not seem likely
> that implementations of Microdata will progress any quicker if it was
> a separate spec, and so HTML5 cannot be said to be slowing down
> Microdata's progress either.  In the event that Microdata does fail in
> the marketplace, it can simply be removed from the spec at that time;
> there does not seem to be any benefit in spending effort to make this
> action any simpler.

It takes years, if not decades, to "remove" features from the HTML
language... much less, see their use halted in browsers. It is far
better to be patient and take 2-4 years to ensure that a technology is
stable enough to become a part of the HTML language than it is to
prematurely insert it into a specification and publish it as part of HTML.

The Point: Asserting that "simply removing" features from the HTML spec
fails to grasp this particular footnote that is prevalent in HTML's
history. There are still people publishing HTML4, warts and all.
Removing a feature of HTML is never a simple matter.

> * The purpose of the W3C is to advance the web, not to remain neutral
> in technological conflicts.  If one technology under the W3C's purview
> is better than a competing technology, it is our responsibility to
> actively decide in favor of it.  To do elsewise would be dereliction
> of our core duty to the web.  Microdata and RDFa are directly
> competing, as they accomplish virtually precisely the same thing;
> there is no good reason to use both on a page except for gratuitous
> proliferation of metadata embedding syntaxes.

Separating Microdata into a separate specification buys the technology
some time to get implementation, authoring and deployment feedback from
search companies, browser manufacturers and authors.

If it does, we run the danger of killing it off before it matures and
that would be a shame. The mere existence of Microdata is driving the
RDFa community to adapt the best parts of Microdata/Microformats for its
own use and that is something that will make the Web a better place,
even if Microdata eventually fails in the marketplace (or vice-versa).

The Point: Microdata isn't ready to compete against RDFa on many levels
- namely, number of implementations, published W3C REC status,
implementation feedback, test suites, deployment feedback and authoring
feedback. Forcing it to do so, prematurely, will almost certainly kill
it before we can see if it is a workable solution.

> * The Microdata data model is extremely simple for simple, common
> cases, and is complex only in rare, complicated cases.  Its tree-based
> nature (as a set of nested name/value pairs) matches well with both
> the HTML language and XML and JSON data storage/interchange formats.
> The processing model is extremely simple and well-defined, and
> essentially trivial to implement.  The DOM API associated with it
> makes retrieving metadata from a page via a script in the page
> extremely simple, broadening the possible usages of Microdata beyond
> spiders and the like to actually being useful in applications.  It is,
> in short, a simple and intuitive metadata syntax in a field where
> neither adjective can typically be applied, backed up by user studies
> that directly informed its design.  Removing it from HTML5 would
> provide no benefit to authors or implementors, and would likely serve
> only to slow down the development and deployment of a useful tool for
> authors.

While it is true that Microdata's data model is extremely simple, it is
also true that it is too simple to accomplish many important
engineering, science and mathematics related tasks since it does not
have support for anything that involves open-ended measurements or
property data-typing of any kind.

There is no multi-language support in Microdata's data model, making it
impossible for web applications to determine the markup language of text
data. For example, there is no way to tag the word "chair" to any
language in Microdata. That word means something fairly benign in
English, but something entirely different in French. Translation
software would be very difficult to implement in Microdata.

While its tree-based nature is a positive design criteria, Microdata's
processing model is no different than RDFa in this respect as RDFa
contains the same tree-based processing mechanism and can be easily
mapped to a tree-based data structure if needed.

Microdata's ability to serialize to XML and JSON data/storage
interchange formats is not a defining characteristic as most any data
structure can be mapped to XML and JSON.

While the DOM API was a first for Microdata, it will be short lived as a
DOM API for RDFa is in the works for RDFa 1.1 and will reach REC years
before Microdata reaches REC - it will not be a defining characteristic
in seven months time.

While it has been asserted a number of times that user studies were
performed to influence its design, the raw data for these studies have
never been provided for 3rd party analysis.

The W3C's Technical Architecture Group, the body that oversees the
overall system design for the Web, has asked that Microdata be removed
from the HTML5 specification. This removal is partly based on
Microdata's design decision to not fully support the follow-your-nose
principle. This principle asserts that a User Agent should be able to
dereference the meaning of semantic predicates like "name", "desc", and
"title". User Agents that implement Microdata will have half-baked
support for the follow-your-nose principle, unlike RDFa, which has full
support for the follow-your-nose principle.

Semantic object validation is not supported in Microdata, which makes it
impossible for User Agents to understand whether or not the data that
they are working with is valid. This will inevitably make User Agents
more complicated, since Microdata pushes data validation far up the
application layer stack. RDFa has data validation, as well as vocabulary
term equivalency (via RDFS and OWL) support, built in.

There is nothing useful that Microdata does that RDFa doesn't already do
now, or will do in less than one year - long before Microdata reaches
REC status.

Additionally, Microdata does not have a firm commitment of
implementation support by the majority of any single industry (Google,
Yahoo - search), nor does it have the commitment to be included in any
high profile content management system (Drupal 7), nor does it have the
commitment of any major world government (The United Kingdom), nor a
scientific body (The Public Library of Science).

That said - it would be a mistake to kill off Microdata now... or in the
next 2 years. Giving Microdata the benefit of the doubt and the chance
to mature over the course of 1-2 years would ensure that the HTML WG
makes the proper decision when it comes to choosing a technology for the
Semantic Web. The place for that maturation is not the HTML5
specification proper for the reasons listed in this proposal. We should
be calculated in this decision and not allow "what we know" now to blind
us to "what could be" in the future.

The Point: Split Microdata out so it has a chance to mature - the
correct technological solution will become clear in time.

-- manu

[1]http://dev.w3.org/html5/decision-policy/decision-policy.html#change-proposal

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Bitmunk 3.2 Launched - The Legal P2P Music Network
http://blog.digitalbazaar.com/2009/11/30/bitmunk-3-2-launched/

Received on Wednesday, 9 December 2009 20:16:41 UTC