Re: pfps-04 from Martin Duerst on 2003-07-24 (www-rdf-comments@w3.org from July to September 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 24 Jul 2003 16:06:09 -0400
To: Brian McBride <bwm@hplb.hpl.hp.com>, "Peter F. " Patel-Schneider <pfps@research.bell-labs.com>, jjc@hplb.hpl.hp.com
Cc: Pat Hayes <phayes@ai.uwf.edu>, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20030724143707.05770380@localhost>

Hello Brian, others,

At 16:54 03/07/24 +0100, Brian McBride wrote:
>On Thu, 2003-07-24 at 16:31, Peter F. Patel-Schneider wrote:

> > So the question boils down to whether octets and Unicode characters are
> > disjoint.
>
>I believe they are.  From
>
>   http://www.unicode.org/book/uc20ch1.html
>
>[[
>The character identified by a Unicode code value is an abstract entity,
>such as "LATIN CAPITAL LETTER A" or "BENGALI DIGIT 5".
>]]
>
>i.e. characters are distinct from their encodings.
>
>Martin, Jeremy: confirm?

I have looked at
http://www.w3.org/2001/sw/RDFCore/20030123-issues/#pfps-04
http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0091.html

and wasn't sure why the question below is relevant for adressing issue pfps-04.

Based on a conversation with Brian that I had a week or two ago,
I suspect that it may be related to some technical issue of how
to distinguish between the values of plain literals, string, and
XML literals. Looking at
http://lists.w3.org/Archives/Public/www-rdf-comments/2003JulSep/0064.html
seems to confirm this suspicion:

 >>>>>>>>
Peter:
 > > > Therefore for the RDF entailment rules to be complete, no XML 
Literal can
 > > > have a character string as its denotation.

Brian:
 > > Right.  The denotation of an XML Literal is an octet sequence, as
 > > defined by the xml canonicalization spec, see the note in:
 > >
 > >
 > > 
http://www.w3.org/2001/sw/RDFCore/TR/WD-rdf-concepts-20030117/#section-XMLLi 
teral

Peter:
 > Unfortunately this does not answer the question.  Octet sequence is
 > undefined in http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/.  At
 > least some places in this document appear to indicate that an octet
 > sequence is just a sequence of (Unicode?) characters.
 >>>>>>>>

(the short and simple summary of the above discussion is:
"In order to be able to say that there is a difference between
plain text and XML, can we claim that plain text is sequences
of characters and XML is sequences of octets?"

My answer to the question that Brian asked is: Yes, octets and
Unicode characters are different. The Unicode standard certainly
explains that, as does the Character Model:
http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Storage

But this is the wrong question to ask. It is totally inappropriate
to use different layers of an encoding model to make semantic
distinctions that are not related to this encoding model.
Although such a statement is not explicitly made in the Character
Model (because, frankly speaking, we didn't immagine that anybody
would come up with such an idea), it should be quite clear from
Section 3.5 Reference Processing Model
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefProcModel)
that this is very inappropriate.

It seems that the encoding to UTF-8, inherited by Exclusive XML
Canonicalization from Canonical XML, and very suitable as a
preparation for digital signing and encryption or for parser
testing, is confusing. I will request a clarification to that
specification and will cc the RDF Core WG on that request.

I am sure that a different and more appropriate way to make the
distinction can be found.

Regards,    Martin.

Received on Thursday, 24 July 2003 16:06:31 UTC