- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 24 Jul 2003 16:06:09 -0400
- To: Brian McBride <bwm@hplb.hpl.hp.com>, "Peter F. " Patel-Schneider <pfps@research.bell-labs.com>, jjc@hplb.hpl.hp.com
- Cc: Pat Hayes <phayes@ai.uwf.edu>, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org
Hello Brian, others, At 16:54 03/07/24 +0100, Brian McBride wrote: >On Thu, 2003-07-24 at 16:31, Peter F. Patel-Schneider wrote: > > So the question boils down to whether octets and Unicode characters are > > disjoint. > >I believe they are. From > > http://www.unicode.org/book/uc20ch1.html > >[[ >The character identified by a Unicode code value is an abstract entity, >such as "LATIN CAPITAL LETTER A" or "BENGALI DIGIT 5". >]] > >i.e. characters are distinct from their encodings. > >Martin, Jeremy: confirm? I have looked at http://www.w3.org/2001/sw/RDFCore/20030123-issues/#pfps-04 http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0091.html and wasn't sure why the question below is relevant for adressing issue pfps-04. Based on a conversation with Brian that I had a week or two ago, I suspect that it may be related to some technical issue of how to distinguish between the values of plain literals, string, and XML literals. Looking at http://lists.w3.org/Archives/Public/www-rdf-comments/2003JulSep/0064.html seems to confirm this suspicion: >>>>>>>> Peter: > > > Therefore for the RDF entailment rules to be complete, no XML Literal can > > > have a character string as its denotation. Brian: > > Right. The denotation of an XML Literal is an octet sequence, as > > defined by the xml canonicalization spec, see the note in: > > > > > > http://www.w3.org/2001/sw/RDFCore/TR/WD-rdf-concepts-20030117/#section-XMLLi teral Peter: > Unfortunately this does not answer the question. Octet sequence is > undefined in http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/. At > least some places in this document appear to indicate that an octet > sequence is just a sequence of (Unicode?) characters. >>>>>>>> (the short and simple summary of the above discussion is: "In order to be able to say that there is a difference between plain text and XML, can we claim that plain text is sequences of characters and XML is sequences of octets?" My answer to the question that Brian asked is: Yes, octets and Unicode characters are different. The Unicode standard certainly explains that, as does the Character Model: http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Storage But this is the wrong question to ask. It is totally inappropriate to use different layers of an encoding model to make semantic distinctions that are not related to this encoding model. Although such a statement is not explicitly made in the Character Model (because, frankly speaking, we didn't immagine that anybody would come up with such an idea), it should be quite clear from Section 3.5 Reference Processing Model (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefProcModel) that this is very inappropriate. It seems that the encoding to UTF-8, inherited by Exclusive XML Canonicalization from Canonical XML, and very suitable as a preparation for digital signing and encryption or for parser testing, is confusing. I will request a clarification to that specification and will cc the RDF Core WG on that request. I am sure that a different and more appropriate way to make the distinction can be found. Regards, Martin.
Received on Thursday, 24 July 2003 16:06:31 UTC