ISSUE-13: History of rdf:XMLLiteral from Richard Cyganiak on 2011-11-10 (public-rdf-wg@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 10 Nov 2011 14:59:35 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: Jeremy Carroll <jeremy@topquadrant.com>, Ivan Herman <ivan@w3.org>, RDF Working Group WG <public-rdf-wg@w3.org>
Message-Id: <33ADDFC4-11E4-4C4E-BC04-BBFEDFE65F6B@cyganiak.de>

Resurrecting this very old thread…

On 10 Mar 2011, at 09:05, Andy Seaborne wrote:
> On 10/03/11 07:39, Jeremy Carroll wrote:
>> Or we canonicalize in the mapping to value space, hence in the equality
>> algorithm (design rejected in 2003 last call)
>
> Why was that? Was it just because RDF/XML was the only format so it's not unreasonable to put in the parsing step.

I read up a bit on the 2003 discussions and I think I have a rough picture how we got to the current design.

The naïve design would have been to treat rdf:XMLLiteral just like xsd:string; no cannoicalization or other fancy processing at all.

The problem is that RDF/XML implementors want to use DOM parsers, so the lexical form of an rdf:XMLLiteral would be obtained by re-serializing the part of the DOM tree that corresponds to the XML literal. DOM implementations often don't maintain the order of attributes or remember whether single or double quotes were used. So there's no way to guarantee that the lexical form is identical to what was in the input RDF/XML file.

To make RDF/XML implementable with DOM parsers, it was decided that the lexical form doesn't have to be *exactly* the same XML string as in the input RDF/XML file, but it could be any XML string that has the same canonicalization as the string from the input file.

But that just pushes the problem down one step: Now, if one RDF/XML file goes through two different processing pipelines, then the same XML literal might end up with two different lexical forms. Both pipelines might feed into the same OWL reasoner that wants to know if the XML literals are the same or not.

To solve that problem, the obvious solution was to make the value space of rdf:XMLLiteral not some simple XML string but the canonicalized form.

This approach went to Last Call in January 2003. To summarize:

1) RDF/XML file can contain arbitrary XML
2) RDF/XML parser can produce any XML that has same canonical form as 1)
3) Lexical form can be arbitrary XML
4) L2V mapping does canonicalization of 3)
5) Value space is canonicalized

Two criticisms were raised against this approach in LC:

i) Anything working against the RDF graph representation (e.g., the OWL reasoner) now has to perform canonicalization before doing comparison. This means any OWL reasoner needs to include an XML parser and canonicalizer. The WebOnt WG hated this.

ii) It's weird that the output of an RDF/XML parser is required to have the same canonical form as its input, but isn't actually required to be *in* canonical form. This makes, among other things, graph comparison harder. The XML C14N people were confused by this.

So the decision was made to simply use canonicalization throughout. The fact that an RDF/XML parser by definition already includes an XML parser, and is in a good position to implement canonicalization (unlike, say, the OWL reasoner), was a big factor – and RDF/XML was of course the only game in town back then. The second Last Call in October 2003 thus had the following picture:

1) RDF/XML file can contain arbitrary XML
2) RDF/XML parser MUST canonicalize
3) Lexical form is canonicalized
4) L2V mapping is an 1:1 mapping
5) Value space is canonicalized

This was seen as a major simplification of rdf:XMLLiterals.

It should be said that it was expected that XML literals would be commonplace in RDF content, to embed some markup into strings, deal with mixed-language literals, bidi, ruby markup and all these kinds of things. At some point there was a vision that

<dc:title>Fun with C<sub>2</sub>H<sub>5</sub>OH</dc:title>

would be just as common as

<dc:title>Fun with C2H5OH</dc:title>

and that many i18n requirements would be met by allowing XML literals. This seems to be the reason why XML literals are a built-in datatype in RDF (the only one! not even numbers are built-in!).

The solution that was standardized obviously completely failed to deliver in this regard, and W3C's I18n was rather unhappy with the final design, with two main points of criticism:

i) “foo” as a plain literal and “foo” declared as an XLM fragment have different value (and abstract syntax representation) for no good reason.

ii) xml:lang in RDF/XML applies to plain literals but not to XML literals.

So much for the history – Jeremy please correct me if I got anything wrong.

Now the question is: If we want to change anything about XML literals, then what problem exactly are we trying to solve?

Best,
Richard

Received on Thursday, 10 November 2011 15:00:17 UTC