Re: [Bug 8611] Consider adding a full schema to H:TML

On Tue, 5 Jan 2010, Joe D Williams wrote:
> >
> > XML has a different syntax than text/html HTML.
> 
> Are there such differences that expression by xml schema is only 
> possible for HTML5 in XHTML form?

As far as I can tell (which admittedly is not especially far), XML Schema 
is defined in terms of the XML Infoset, not in terms of the XML syntax. 
Therefore, anything that can be expressed in an XML Infoset can be 
syntax-checked by XML Schema.

HTML5 defines how to coerce the output of an HTML parser (namely, a DOM) 
into an Infoset for toolchains that do not support features beyond those 
defined by the XML and XML Infoset specifications:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#coercing-an-html-dom-into-an-infoset

It's worth noting that the XML Infoset cannot distinguish every possible 
difference between XML documents, for instance one could not express the 
number of spaces between XML element attributes in the Infoset; these two 
documents therefore have the same Infoset despite having different XML 
serialisations:

   <test a="" b=""/>

   <test a=""  b=""/>

The same applies to text/html HTML5, as I described in my last e-mail.

There are also semantically relevant aspects of text/html that cannot be 
expressed in an XML Infoset, such as whether the document is in quirks 
mode, or what the form element associations might exist that are not 
represented in the DOM. These are aspects that are mentioned by the 
coercion section cited above. Furthermore, there are semantically relevant 
aspects of text/html that cannot be expressed even by the DOM data 
structures, such as the functionality of <noscript> in the presence of 
script or in the absence of script. For validation purposes, these are 
handled in relatively complicated ways by the spec. As far as I can tell, 
there is no way to make straight XML Schema fully handle these features, 
as the information simply wouldn't be present in the Infoset.


> Is there structure, content models, or combinations of HTML5 that cannot 
> be modelled by xml schema?

Insofar as there are structures that cannot be modeled by the XML Infoset, 
yes. There may also be conformance requirements that cannot be fully 
expressed by XML Schema itself, but I'm not familiar enough with XML 
Schema to say whether this is the case or not. Henri might know. (It is 
the case that SGML DTDs, XML DTDs, RelaxNG, and Schematron all cannot 
fully express all the machine-checkable conformance requirements of HTML, 
so I would be surprised if it wasn't also the case for XML Schema.)

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 6 January 2010 07:21:32 UTC