Re: XMLLiteral handling in RDFa in HTML

Toby Inkster wrote:
> On Mon, 2009-05-25 at 20:55 -0400, Manu Sporny wrote:
>> So, thoughts on this issue?
> 
> I don't think that a big song and dance is needed over this. The issue
> seems pretty simple to me. 

Hmm, I don't think it is that simple, and here's why...

If you have the following markup:

<div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
   <span property="dc:description"><br>para1</span>
</div>

A SAX-based parser (such as Expat), parsing an XHTML document will fail
to generate a triple due to a parser error. Even if you do some sort of
self-healing and continue processing the document, the XMLLiteral should
not be produced because the contents are not well-formed XML.

However, an HTML5lib-based parser would correct the input to the
following before a purely DOM-based RDFa processor could see the
contents of the SPAN element:

<div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
   <span property="dc:description"><br/>para1</span>
</div>

which would then generate the following triple:

<#foo>
   <http://purl.org/dc/elements/1.1/description>
      '<br xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/elements/1.1/" />para1'^^rdf:XMLLiteral .

So, we have the exact same markup generating two completely different
sets of XMLLiteral triples. If one of our goals is to generate the same
triples across different types of markup - we are failing to do so with
the current set of processing rules.

> Sometimes an RDFa parser, dealing with HTML,
> will hit a situation where it needs to generate an XMLLiteral from
> non-wellformed HTML. In these situations, it seems to me that we have a
> choice of three potential "the parser MUST" actions, all of which are
> roughly consistent with RDFa in XHTML:
> 
> 1. The parser MUST ignore this triple altogether. A simple solution, and
> it means that the HTML graph would be a subset of the XHTML graph. RDF
> vocabularies are generally defined so that if a graph G is true, then
> any graph H such that H is a subset of G is also true.

The XHTML parser can't ignore the triple due to a parser error, or if it
corrects the parser error, shouldn't output the malformed XMLLiteral.

The HTML5lib parser will never see that the XMLLiteral was malformed.

> 2. The parser MUST add the triple to the graph as normal, but MUST NOT
> set the literal's datatype to XMLLiteral. They could either leave the
> literal as an untyped literal (that happened to have a lot of angled
> brackets in it) or perhaps set it to some HTMLLiteral datatype of our
> own concoction.

This would be a problem because the XML-based parser implementations
would switch the datatype of the object to something like
XMLCharacterStream, while the html5lib parser would output an XMLLiteral.

I don't believe that there is any such thing as an malformed XMLLiteral
in HTML5... is there? Can anybody think of an example of an invalid
XMLLiteral in an html5 parser?

> 3. The parser MUST coerce the HTML fragment into a well-formed (but not
> necessarily valid) XHTML fragment. The HTML5 draft gives us decent
> algorithms for doing this.

It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why
should we apply HTML5's parsing rules to XHTML1.1 and XHTML2 documents?

I don't think that this is something we can 'MUST' ourselves out of...
relaxing the conformance requirements to not include XMLLiterals seems
to be a mechanism that would:

a) Allow variance in IF and HOW XMLLiterals are generated - which will
vary based on if a document is being parsed by a SAX-based XML parser in
XHTML1.1, or a DOM-based Javascript parser in HTML5.
b) Not automatically disqualify all DOM-based HTML5 implementations, or
non-raw-stream-based XHTML1.1 implementations.

Although, even this approach bothers me quite a bit... as does getting
rid of XMLLiterals all-together.

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: A Collaborative Distribution Model for Music
http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/

Received on Wednesday, 27 May 2009 02:07:00 UTC