Re: Request for clarifications re XMLLiteral in RDFa (was: Re: XML Literals poll) from Gregg Kellogg on 2011-11-23 (public-rdf-comments@w3.org from November 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Wed, 23 Nov 2011 18:04:30 -0500
To: Richard Cyganiak <richard@cyganiak.de>
CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <098423FF-B22D-4DC2-8F9F-EF4A72CF2D2C@greggkellogg.net>
On Nov 23, 2011, at 2:17 PM, Richard Cyganiak wrote:

> Hi Gregg,
> 
> Thanks for the comments. It's good to have more RDFa perspective on this. A few notes and requests for clarification below.
> 
> On 22 Nov 2011, at 17:47, Gregg Kellogg wrote:
>>> Q1. Should the specs define a way to compare XML literals based on value?
>>> 
>>> In other words, in the same way that integers 7 and 007 have the same value, should <foo/> and <foo></foo> be defined as having the same value?
>> 
>> -1. The only interesting use of XML literals now, from my perspective, is capturing HTML markup in RDFa. Given that RDFa may be used with non-closing tags (i.e., not a valid XML infoset), this can't work reliably.
> 
> Why? The HTML5 draft contains an algorithm for parsing HTML fragments into an HTML DOM:
> http://www.w3.org/TR/html5/the-end.html#parsing-html-fragments
> 
> And there's an algorithm for coercing an HTML DOM into an XML infoset:
> http://www.w3.org/TR/html5/the-end.html#coercing-an-html-dom-into-an-infoset
> 
> On the face of it this seems like it would give a reliable basis for value-based comparison even if HTML tag soup fragments are considered.

Understood, and I could live with this. However, not every environment will have an HTML parser to turn the lexical representation into a DOM representation. For example, Ruby does not contain a native HTML parser, although there is one available as a plugin, this can't be used in every circumstance.

In most cases, an L2V mapping of the content is not important, only when querying or comparing graphs does it become useful. In any case, comparing based on equivalent literal content in this case seems fragile. I would hate to impose a requirement that environments perform this transformation in an environment where it's not likely to be useful.

Also, much HTML markup may be invalid, but browsers still do a good (too good, probably) job of displaying it anyway. Even if the content is invalid, it is still useful to treat it as a string and update an element's innerHTML, for example. It may be unreliable, but I don't think it's the job of RDF to make it so.

That said, I don't think my implementations will ever really be able to drop such DOM conversion code, due to backwards compatibility issues anyway. It's just that, going forward, it would be useful if this were a lighter weight proposition. As an alternative, I would consider relaxing the Exclusive C14N requirements regarding namespace promotion. This is often not done correctly, or results in extra namespaces being handled. The advice for people running the RDFa test harness is to ignore failing tests that use XMLLiterals because of these problems.

>> To do so, you'd need to know if the content model was HTML or XML. This could potentially be addressed by introducing a new rdf:HTMLLiteral, but that seems a step in the wrong direction.
> 
> Why would you call this a step in the wrong direction? Given that HTML and XML have very different syntactic constraints, two separate datatypes seem like a natural approach to take?

RDFa 1.0 will always need to deal with XML Literals. We could potentially change this for RDFa 1.1, but it would still be different depending on the host language being used: XHTML, SVG and XML would probably continue to use XML Literal, while HTML4 and HTML5 could use a hypothetical HTML Literal. It's reasonably likely that two different implementations will do this in different ways, and you could imagine the same graph being serialized as XHTML or HTML, which would be equivalent, except for the differences in datatype. It would be much easier, and more likely to come out right, if we stuck with a single datatype (XML Literal), but relaxed the C14N rules to achieve greater interoperability.

OTOH, adding an optional transformation to an infoset could be useful in some cases, but I would make this a MAY, or perhaps a SHOULD, but not a MUST.

>>> Q4. Should *invalid XML* be allowed in the lexical space?
>>> 
>>> In other words, should "</bar !!!>"^^rdf:XMLLiteral be ill-typed (just like "AAA"^^xsd:integer) or well-typed (just like "</bar !!!>"^^xsd:string)?
>> 
>> +1. If we depend on authors only using "correct" markup, we'll invalidate many common cases, even where the HTML is incorrect. 
> 
> I note that the lexical form isn't necessarily what authors write – there's always a parser in between.

I disagree that there's always a parser in between; if I write Turtle containing an XML Literal this doesn't have to involve an HTML (or XML) tool chain. I commonly write my HTML by hand, of course, I try to do so correctly.

>> rdf:XMLLiteral should simply be a sub-datatype of xsd:string. RDFa uses rdf:XMLLiteral to trigger using innerHTML rather than innerText when extracting literal content.
> 
> But doesn't the name “XMLLiteral” imply that *something* about it – perhaps either the input or output – should be XML? Why would you expect that @datatype="rdf:XMLLiteral" accepts tag soup as input and deposits tag soup in the graph?

One thing XMLLiteral implies is that an RDFa processor should use use the HTML/XML content to form the literal, not the innerText content. This is really what most people care about, IMO. Within a different context, say JSON-LD, it might be useful in an Ajax response to know that the result should update the text or HTML of the element; for example using jQuery $("#id").html(literal value) vs $("#id").text(literal value). None of this requires any reasoning over the literal value itself; I think that's a more common use of the datatype.

>>> Q5. Should the specs say that RDF/XML parsers MUST canonicalize when handling parseType="literal"?
>>> 
>>> RDF/XML parsers are often implemented on top of an XML parser, and hence they don't have access to a low-level representation of the XML literal, e.g., did it use single or double quotes in the attributes, what order where the attributes in, or how many spaces were between them? If they don't canonicalize, then two different RDF/XML parsers would be pretty much guaranteed to parse the same RDF/XML file into different triples (or even different runs of the same parser over the same file could yield different triples).
>> 
>> -1. C14N is a pain. I'd remove any requirement that in-scope namespace definitions be added to top-level elements within the nodeset too.
> 
> I note that this would likely invalidate most existing RDF/XML content.

As with RDFa, I presume there will always be backwards compatibility issues. Within the context of RDF/XML, C14N may continue to be required, but why require it for Turtle and/or RDFa. I can't think of a real-world use case for this in those environments.

Gregg

>> I do think there's value in maintaining the in-scope @lang or @xml:lang as part of the literal, though.
> 
> (There is no @lang in RDF/XML)
> 
> Best,
> Richard
Received on Wednesday, 23 November 2011 23:05:27 UTC