[whatwg] Writing authoring tools and validators for custom microdata vocabularies

On May 20, 2009, at 10:27, Henri Sivonen wrote:

> However, in order to usefully apply RELAX NG or Schematron to a  
> microdata-base infoset, the infoset conversion should turn property  
> names into element names. Since XML places arbitrary limitations on  
> element names (and element content), this mapping would have exactly  
> the same complications as mapping microdata to RDF/XML.


Here's an attempt at mapping microdata to XML:

  * Have a root element (it doesn't matter what it's called) with  
attribute xml:lang that has the language of the root element of the  
HTML document.
  * Have a child of root with local name 'title', namespace 'http://purl.org/dc/terms/title' 
  and content that is the content of HTML <title>
  * For each link relation in the document, have a child of root that  
has as its local name the ASCII-lowercased rel token (or ALTERNATE- 
STYLESHEET for alternate stylesheet), namespace http://www.w3.org/1999/xhtml/vocab# 
  and no-namespace attribute 'url' that contains the absoluticized  
href of the link relation.
  * For each <meta name content>, have a child of root with the value  
of the name attribute of the <meta> as local name, namespace http://www.w3.org/1999/xhtml/vocab# 
  and the value of the content attribute as element content. If the  
language of the <meta> differs from root, have xml:lang with the  
different language.
  * For cites, do the link thing analogously to how cites are handled  
in the RDF conversion.
  * For items and properties:
    - map the property name to XML namespace,local pair as follows and  
use the result as the element name for the 'property element':
      * If itemprop contains a colon: Locate the last # or / whichever  
comes last but isn't the last character of the URI. Make the part up  
to and including that character the namespace URI and the part after  
the local name.
      * Otherwise: Namespace is http://www.w3.org/1999/xhtml/custom#  
and the propitem token is the local name.
    - If value is a URL, put the URL value in an attribute called  
'url' on the property element.
    - If the value is itself an item, put the value of the item  
attribute on the property element in the value of an attribute called  
'type' in no namespace.
    - Otherwise, put the string value in the content of the property  
element and put the language of the property on the xml:lang attribute  
of the property element if different from its nearest ancestor xml:lang.

Without actually trying, on the face of things, this kind of mapping  
seems tractable to RELAX NG schemas.

And, as mentioned before, this breaks when:
  1) The local name becomes non-NCName.
  2) textContent in HTML contains non-XML characters

Use the infoset coercion rules for those. However, the Uhhhhhh  
notation may be collided, because microdata property names aren't  
lowercased.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 20 May 2009 03:50:02 UTC