Request for clarifications re XMLLiteral in RDFa (was: Re: XML Literals poll)

Hi Gregg,

Thanks for the comments. It's good to have more RDFa perspective on this. A few notes and requests for clarification below.

On 22 Nov 2011, at 17:47, Gregg Kellogg wrote:
>> Q1. Should the specs define a way to compare XML literals based on value?
>> 
>> In other words, in the same way that integers 7 and 007 have the same value, should <foo/> and <foo></foo> be defined as having the same value?
> 
> -1. The only interesting use of XML literals now, from my perspective, is capturing HTML markup in RDFa. Given that RDFa may be used with non-closing tags (i.e., not a valid XML infoset), this can't work reliably.

Why? The HTML5 draft contains an algorithm for parsing HTML fragments into an HTML DOM:
http://www.w3.org/TR/html5/the-end.html#parsing-html-fragments

And there's an algorithm for coercing an HTML DOM into an XML infoset:
http://www.w3.org/TR/html5/the-end.html#coercing-an-html-dom-into-an-infoset

On the face of it this seems like it would give a reliable basis for value-based comparison even if HTML tag soup fragments are considered.

> To do so, you'd need to know if the content model was HTML or XML. This could potentially be addressed by introducing a new rdf:HTMLLiteral, but that seems a step in the wrong direction.

Why would you call this a step in the wrong direction? Given that HTML and XML have very different syntactic constraints, two separate datatypes seem like a natural approach to take?

>> Q4. Should *invalid XML* be allowed in the lexical space?
>> 
>> In other words, should "</bar !!!>"^^rdf:XMLLiteral be ill-typed (just like "AAA"^^xsd:integer) or well-typed (just like "</bar !!!>"^^xsd:string)?
> 
> +1. If we depend on authors only using "correct" markup, we'll invalidate many common cases, even where the HTML is incorrect. 

I note that the lexical form isn't necessarily what authors write – there's always a parser in between.

> rdf:XMLLiteral should simply be a sub-datatype of xsd:string. RDFa uses rdf:XMLLiteral to trigger using innerHTML rather than innerText when extracting literal content.

But doesn't the name “XMLLiteral” imply that *something* about it – perhaps either the input or output – should be XML? Why would you expect that @datatype="rdf:XMLLiteral" accepts tag soup as input and deposits tag soup in the graph?

>> Q5. Should the specs say that RDF/XML parsers MUST canonicalize when handling parseType="literal"?
>> 
>> RDF/XML parsers are often implemented on top of an XML parser, and hence they don't have access to a low-level representation of the XML literal, e.g., did it use single or double quotes in the attributes, what order where the attributes in, or how many spaces were between them? If they don't canonicalize, then two different RDF/XML parsers would be pretty much guaranteed to parse the same RDF/XML file into different triples (or even different runs of the same parser over the same file could yield different triples).
> 
> -1. C14N is a pain. I'd remove any requirement that in-scope namespace definitions be added to top-level elements within the nodeset too.

I note that this would likely invalidate most existing RDF/XML content.

> I do think there's value in maintaining the in-scope @lang or @xml:lang as part of the literal, though.

(There is no @lang in RDF/XML)

Best,
Richard

Received on Wednesday, 23 November 2011 22:17:49 UTC