Re: XMLLiteral handling in RDFa in HTML from Sam Ruby on 2009-05-27 (public-html@w3.org from May 2009)

From: Sam Ruby <rubys@intertwingly.net>
Date: Wed, 27 May 2009 01:57:56 -0400
To: Manu Sporny +ADw-msporny+AEA-digitalbazaar.com+AD4
CC: Toby Inkster +ADw-tai+AEA-g5n.co.uk+AD4, RDFa mailing list +ADw-public-rdf-in-xhtml-tf+AEA-w3.org+AD4, HTMLWG WG +ADw-public-html+AEA-w3.org+AD4
Message-ID: <4A1CD664.5050704@intertwingly.net>
Manu Sporny wrote:
> Toby Inkster wrote:
>> On Mon, 2009-05-25 at 20:55 -0400, Manu Sporny wrote:
>>> So, thoughts on this issue?
>> I don't think that a big song and dance is needed over this. The issue
>> seems pretty simple to me. 
> 
> Hmm, I don't think it is that simple, and here's why...
> 
> If you have the following markup:
> 
> <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
>    <span property="dc:description"><br>para1</span>
> </div>
> 
> A SAX-based parser (such as Expat), parsing an XHTML document will fail
> to generate a triple due to a parser error. Even if you do some sort of
> self-healing and continue processing the document, the XMLLiteral should
> not be produced because the contents are not well-formed XML.
> 
> However, an HTML5lib-based parser would correct the input to the
> following before a purely DOM-based RDFa processor could see the
> contents of the SPAN element:
> 
> <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
>    <span property="dc:description"><br/>para1</span>
> </div>
> 
> which would then generate the following triple:
> 
> <#foo>
>    <http://purl.org/dc/elements/1.1/description>
>       '<br xmlns="http://www.w3.org/1999/xhtml"
> xmlns:dc="http://purl.org/dc/elements/1.1/" />para1'^^rdf:XMLLiteral .
> 
> So, we have the exact same markup generating two completely different
> sets of XMLLiteral triples. If one of our goals is to generate the same
> triples across different types of markup - we are failing to do so with
> the current set of processing rules.

It is worse than that.  If you consider only the set of valid, and
well-formed XHTML 1.1 documents, it is the case that parsing all such
documents as text/html will produce a DOM, but it is not the case that
all such DOMs will be identical to the ones produced if the same sources
were parsed as application/xhtml+xml.

More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

Most of the differences deal with things like titles, textarea, scripts,
and style elements.  Also, <![CDATA[...]]> ends up being treated as a
comment in HTML.

The first observation is that even the Microdata proposal in the current
HTML 5 specification doesn't meet the criteria specified above(*), as
titles which contain the strings "&amp;" or "&lt;" will produce
different triples when those documents are parsed as text/html vs
application/xhtml+xml.

As an aside: for purposes of this discussion, I suggest adopting the
approach of identifying content based on the MIME type.  My weblog, for
example, is only XHTML when served to browsers that support
application/xhtml+xml.  For all other browsers (e.g. Lynx, IE8), it is
simply HTML.  I also suggest dropping version numbers when referring to
HTML or XHTML.

>> Sometimes an RDFa parser, dealing with HTML,
>> will hit a situation where it needs to generate an XMLLiteral from
>> non-wellformed HTML. In these situations, it seems to me that we have a
>> choice of three potential "the parser MUST" actions, all of which are
>> roughly consistent with RDFa in XHTML:
>>
>> 1. The parser MUST ignore this triple altogether. A simple solution, and
>> it means that the HTML graph would be a subset of the XHTML graph. RDF
>> vocabularies are generally defined so that if a graph G is true, then
>> any graph H such that H is a subset of G is also true.
> 
> The XHTML parser can't ignore the triple due to a parser error, or if it
> corrects the parser error, shouldn't output the malformed XMLLiteral.
> 
> The HTML5lib parser will never see that the XMLLiteral was malformed.
> 
>> 2. The parser MUST add the triple to the graph as normal, but MUST NOT
>> set the literal's datatype to XMLLiteral. They could either leave the
>> literal as an untyped literal (that happened to have a lot of angled
>> brackets in it) or perhaps set it to some HTMLLiteral datatype of our
>> own concoction.
> 
> This would be a problem because the XML-based parser implementations
> would switch the datatype of the object to something like
> XMLCharacterStream, while the html5lib parser would output an XMLLiteral.
> 
> I don't believe that there is any such thing as an malformed XMLLiteral
> in HTML5... is there? Can anybody think of an example of an invalid
> XMLLiteral in an html5 parser?

<div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
   <span property="dc:description"><a$></span>
</div>

>> 3. The parser MUST coerce the HTML fragment into a well-formed (but not
>> necessarily valid) XHTML fragment. The HTML5 draft gives us decent
>> algorithms for doing this.
> 
> It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why
> should we apply HTML5's parsing rules to XHTML1.1 and XHTML2 documents?

Browsers will apply HTML parsing rules to XHTML1.1 documents served as
text/html.  This can affect the triples produced by jquery.rdfa.js.

> I don't think that this is something we can 'MUST' ourselves out of...
> relaxing the conformance requirements to not include XMLLiterals seems
> to be a mechanism that would:
> 
> a) Allow variance in IF and HOW XMLLiterals are generated - which will
> vary based on if a document is being parsed by a SAX-based XML parser in
> XHTML1.1, or a DOM-based Javascript parser in HTML5.
> b) Not automatically disqualify all DOM-based HTML5 implementations, or
> non-raw-stream-based XHTML1.1 implementations.
> 
> Although, even this approach bothers me quite a bit... as does getting
> rid of XMLLiterals all-together.

There are a set of documents which will produce the same RDF triples
independent of whether the document is processed as text/html vs
application/xhtml+xml.

1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs RDFa
in text/html not be considered separately, but be developed together and
with an eye towards maximizing the set mentioned above.

2) (in the fullness of time) it would be helpful if there were a
validator which identified documents which cause different triples to be
produced.  If such a tool also identifies other conformance issues with
the document, it would be helpful to have an option to turn the
reporting of such issues off as with many documents this will obscure
the set of errors that affect the production of triples.

3) Test cases should be produced with the goal of ensuring that parsers
looking to produce RDF triples (whether it be from microformats,
microdata, or RDFa) respect the MIME type of the document.

> -- manu

- Sam Ruby

(*) I don't know whether this is an oversight, or even a problem, but
when looking into the HTML5 draft, I couldn't find where itemprop
attributes have any effect on the RDF triples produced.
Received on Wednesday, 27 May 2009 06:06:26 UTC