Re: pfps-04 (why the thread is germane to pfps-04) from pat hayes on 2003-07-29 (www-rdf-comments@w3.org from July to September 2003)

From: pat hayes <phayes@ihmc.us>
Date: Tue, 29 Jul 2003 00:46:24 -0500
To: Martin Duerst <duerst@w3.org>
Cc: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org, msm@w3.org
Message-Id: <p06001a39bb4bb0e9e63f@[10.0.100.23]>
>Hello Pat,
>
>I have copied one part of your mail from the middle to the top
>to discuss it first.
>
>>>However, I think it is absolutely inappropriate to solve this
>>>problem by saying that one of them is characters and the other
>>>is encoded in octets.
>>
>>We aren't saying that XML literals denote things that are encoded 
>>in octets: we are saying that XML literals denote the octets 
>>themselves.
>
>Sorry I wasn't precise enough. I think the reason for this is
>that it's just very difficult for me to think that XML fragments
>could denote octets.

Well, I also have some trouble figuring out what XML is supposed to 
refer to, I admit.

>The way this usually works is that the
>octets on the wire or on a disk denote characters, and some
>of these characters then in turn denote things such as start
>tags, element names, attribute names, attribute values, or
>character content, and the overall sequence then denotes an
>XML document or an XML fragment.

??? You must be using 'denotes' in a different way that I tend to 
think of it. You are here referring to a 
text-assembly/lexical-analysis/parsing process, right? Going up 
layers of encoding from bytes up to some kind of syntactically 
defined structure - in this case, XML.  I don't usually think of that 
as what denotation is about. Denotation starts when you have got the 
the syntax worked out, then you ask what it *means*. Now, the cases 
we are considering here are weird precisely because when you ask what 
a string-typed literal means, you get right back to the syntax: the 
whole point of using text to denote strings is that the string in the 
text pretty much denotes itself. Hence the RDF plain literal 
semantics. If we could say that XML literals denoted themselves, I 
would have just *loved* that idea. We almost did at one time, in our 
innocence: at that time XML literals were just like plain literals 
except they had a kind of XML 'bit' which registered them as being 
XML instead of just being text: but they *were* text, in every other 
way: they denoted themselves, they were character strings, etc.. (One 
difference was that if the character string of a plain literal 
weren't legal XML markup, nothing happens, but if the same is true of 
an XML literal then the literal itself behaves differently, eg its 
not in the class rdf:XMLLiteral, things like that.) But that got 
rejected as being much too fine-grained, since all kinds of 
character-string differences (like whitespace in markup) would make 
literals be distinct that XML would consider indistinguishable.

>There are some specific cases where characters denote characters
>(in particular with escaping), or characters denote octets
>(escaping in some special cases such as URIs, and things
>such as base64), but they are exceptions.
>
>This just lets me wonder: If XML fragments denote octets, then
>what about the XML Schema base64Binary datatype? From XML Schema,
>part 2 (http://www.w3.org/TR/xmlschema-2/#base64Binary):
>
>>>>>
>3.2.16 base64Binary
>
>[Definition:]   base64Binary represents Base64-encoded arbitrary binary data.
>The .value space. of base64Binary is the set of finite-length sequences of
>binary octets. For base64Binary data the entire binary stream is encoded using
>the Base64 Content-Transfer-Encoding defined in Section 6.8 of [RFC 2045].
>>>>>
>
>Are 'binary octets' different from 'octets'?

I have absolutely no idea. :-)

>At 17:01 03/07/27 -0500, pat hayes wrote:
>
>>>At 07:54 03/07/25 -0400, Peter F. Patel-Schneider wrote:
>
>>>  > Two XML literals are (now) equal in RDF precisely when their Exclusive
>>>>XML Canonicalizations are the same octet sequence.
>>>
>>>Okay. The equivalences would stay exactly the same if XML literals
>>>would be represented a character sequences rather than as octet
>>>sequences.
>>
>>'equal' here means 'denote the same thing', not 'is identical to' . 
>>Nobody is suggesting interfering with how literal strings are 
>>represented or encoded. We had to choose some criterion to refer to 
>>in order to establish questions of identity between referents.
>
>But why not just say that XML Literals are XML Literals to establish
>their identity? Or call them XML fragments, or text with markup, or
>whatever you think will work best.

What would YOU like them to be, in order to have them work best? 
Suppose they are text with markup. Now, consider
"<br />"^^rdf:XMLLiteral
"<br/>"^^rdf:XMLLiteral
are these equal or not? If text-with-markup is defined in terms of 
character sequences then they are not. So how is it defined, so as to 
make these be equal?

>>>Apart from that, it is very important to make sure that the plain
>>>string "<br/>" (in XML written as "&lt;br/&gt;") is not the
>>>same as the XML markup "<br/>" (in XML written as "<br/>").
>>>So it is indeed important to make sure this question can easily
>>>be answered.
>>
>>If we were to specify that plain literals and XML literals both 
>>denote Unicode character sequences, then "<br/>" and 
>>"<br/>"^^rdf:XMLLiteral would be equal and neither of them would 
>>bear any RDF relationship to a literal whose character string was 
>>"&lt;br/&gt;" So it sounds like you want to say that XML values and 
>>Unicode character strings must be distinct; which is the situation 
>>we currently have.
>
>Let me again try to explain how I think this should have worked
>[Because we should have said that during last call, but missed it,
>we are explicitly not insisting on this point. I just want to
>make sure that we can eliminate misunderstandings]:
>
>>>>>
>XML Literals denote text (character content) with markup
>(start tags, end tags, empty tags, PIs, comments). XML
>Literals that contain only character content denote the
>same thing as plain literals with the same character
>sequence (and language information).

Well, OK, I agree that would be nice. But it seems to me that text 
with markup *is* text. If you can write it down as a sequence of 
characters, that's text. XML is text, by that criterion. If that's 
not the right criterion, then what is? Another way to ask the same 
question: what does it mean for two pieces of XML to be the same 
*considered as XML* that differs from them being the same *considered 
as text*?

I would be happy for XML literals to denote themselves, but if that 
means what I understand it to mean, then your qualification about 
'only character content' is beside the point: any XML literal will 
denote a character string, markup  or no markup. .

>  >>>>
>
>By this, "<br/>" denotes a sequence of five characters.
>"<br/>"^^rdf:XMLLiteral denotes an empty 'br' tag.

OK, but stop there. What *is* that thing? Does an empty 'br' tag 
count as a character in a character string? Or is this an entity in 
some abstract XML structural space? Where is this space defined? What 
kind of stuff does it have in it, and what sorts of structures do 
they have? Until we get questions like this straight, we can't begin 
to write formal semantics.

I guess this is the central question. We all know what XML *is*: its 
text plus markup. But what does it *denote* ? What *kind* of thing 
dos it denote, even? I don't know how to begin to answer that 
question.

>"&lt;br/&gt;"^^rdf:XMLLiteral again denotes a sequence
>of five characters, the same five characters as in the
>"<br/>" plain literal.

That works for examples where the XML markup resolves into 
XML-encoded Unicode text strings, but is that always true? What about 
attributes on tags with values.....??

>Even if you disagree that the later two are the same,
>because you want to preserve the distinction between
>plain literals and the 'XML-ness' of text in XML
>literals, a slightly tweaked denotation should give
>you that distinction.

Maybe, but I would like to see the details.

>>The point is, we have a distinction between two kinds of literals. 
>>To put it crudely, a string (the literal string) can be labelled as 
>>'plain' in which case it (rather oddly) denotes itself, or as 
>>'XML-ish', in which case it might denote something else. The 
>>question is, what? The issue is not to do with how the literal 
>>itself is encoded or represented.
>
>I was at one point worrying about the actual representation,
>and still worry about that a bit, because some implementers
>might confuse these things. But I guess such confusion can
>never be completely avoided.
>
>Anyway, if XML Literals are labeled as XML-ish, it seems most
>natural to let them denote something XML-ish, rather than something
>octet-ish.

I think the problem we have is that we have to say that they denote 
*something*.  XML text as a character string was rejected as 
unworkable, as it would make tiny character differences ruin XML 
identities; so we looked for what the XML docs said was the root of 
XML syntactic identity: when are two pieces of XML "really" the same? 
And the best answer we (Jeremy) could find was the one we used. We 
couldn't find anything more XML-ish than this.

Pat

-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 29 July 2003 01:46:28 UTC