[Fwd: Re: Ameliorating no change on XML Literal design]

and Martin's response

Brian

-------- Original Message --------
Subject: Re: Ameliorating no change on XML Literal design
Date: Thu, 17 Jul 2003 15:08:43 -0400
From: Martin Duerst <duerst@w3.org>
To: Brian McBride <bwm@hplb.hpl.hp.com>, RDF Core <w3c-rdf-core@w3.org>
CC: w3c-i18n-ig@w3.org

At 17:30 03/07/17 +0100, Brian McBride wrote:

>Martin further suggested that we consider changing the canonicalization 
>algorithm to omit the conversation to utf 8.  I pointed out that this has 
>the benefit of avoiding false equals between similar plain and xml 
>literals, but I agreed to raise it anyway.

Some more notes on what Brian and me talked about. Not guaranteed
that everything makes sense, please feel free to comment.

Brian said that in the current system, the lexical form of an XML literal
is a (non-canonicalized) string of characters, and the thing it denotes
is the UTF-8-encoded canonicalized version of that string.

This is 180 degrees against what happens in internationalization,
and in contrast to xml:lang, is quite extensively explained in the
Character Model. The physical/electronic/whatever lower-level
representation is in terms of octets or other code units, and
the higher level (not necessarily highest level, of course)
representation is in terms of characters.

The point that Brian mentiones above is a valid one, we would not
like to have equality between a string of characters representing
XML markup and a string of characters that by chance looks like
markup to be introduced via a back door. Brian explained to me
that the denotation does not explicitly carry the datatypes.
But still, it seems to me that the denotation "integer 11" and
the denotation "string '11'" should be different currently.
Then it would be easy to solve this particular problem (and to
hopefully bring quite a bit more clarity into the distinction
between plain strings and strings with markup) by saying that
an XML literal denotes the XML fragment that is represented by
the string of characters resulting from the exclusive canonicalization
(without the step of UTF-8 encoding) of [the relevant input].

I.e. an XML literal denotes an XML fragment the same way an
integer denotes an integer.


Regards,    Martin.

Received on Thursday, 17 July 2003 17:51:34 UTC