Re: pfps-04 (why the thread is germane to pfps-04) from Martin Duerst on 2003-07-28 (www-rdf-comments@w3.org from July to September 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 28 Jul 2003 15:13:21 -0400
To: pat hayes <phayes@ihmc.us>
Cc: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>, bwm@hplb.hpl.hp.com, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org, msm@w3.org
Message-Id: <4.2.0.58.J.20030728143701.0255ca80@localhost>
At 17:04 03/07/27 -0500, pat hayes wrote:
>>Hello Peter,
>>
>>At 09:27 03/07/25 -0400, Peter F. Patel-Schneider wrote:
>>>I believe that a complete theory of equality for XML literals resolves this
>>>comment.  I suggest that several test cases be added to the RDF test suite.
>>>
>>>The related issue of whether the value spaces of xsd:string and plain
>>>literals are disjoint also appears to be well on the way to resolution.
>>
>>Apart from the issue of language information (plain literals can take
>>language information, xsd:string can't), what is the reason for making
>>these two disjoint? We seem to get into a serious proliferation of
>>string-related datatypes that provide no useful distinction.
>
>True, but I guess my reaction to this is that apparently, this 
>proliferation exists, and RDF's job is not to try to put the world to rights,

I agree that it's not your job to solve other people's problems.
But with respect to plain literals, which are a pre-XML-Schema
RDF-internal creation, it doesn't seem inappropriate to ask
the question whether these are the same as anything similar
in the XML Schema type system. To make an analogy, assume that
RDF M&S had defined integers as another kind of literal.
When integrating this with XML Schema, it would seem natural
in such a case that this type was equated with the XML Schema
datatype integer.


>but to allow anyone to make any assertions they wish to about any topic 
>they wish to, as far as possible. If therefore there are people out there 
>who wish to distinguish "Hello World" as character string from "Hello 
>World"  as octet sequence from "Hello World"  as XML, or even "Hello 
>World"  as red from "Hello World"  as green, who are we to say that they 
>should not do so?

Obviously you already have said yes to some, and no to some others.
For example, there is currently no way to distinguish between
"Hello World" as XML and "Hello World" as octet sequence because
XML Literals denote octet sequences. Also, there is no clear way
to distinguish "Hello World" as red from "Hello World" as green.

So obviously you make decisions, and these decisions have
consequences for everybody.


>>In RDF, the simple text "Hello World" (without language information)
>>can be a plain literal, an xsd:string, and an XML literal.
>>What is the point of them all being different if there is no
>>observable difference?
>
>I am not sure what you mean by 'observable' in this context, or why that 
>is relevant. Identity does not rely on indistinguishability.

Yes, in many cases these are clearly different things.
For example, two copies of the same book can look very
indistinguishable, yet they are definitely not identical.
On the other hand, we very much tend to assume that two
integers that are indistinguishable are identical.
So one question would be: What are strings closer to,
integers or books.


>In another message you insist that
>" it is very important to make sure that the plain
>string "<br/>" (in XML written as "&lt;br/&gt;") is not the
>same as the XML markup "<br/>" (in XML written as "<br/>")."
>which seems like an unobservable difference to me of exactly the same 
>kind. How something is written in XML is beside the point:

Are you sure about that? We are talking about XML literals,
so it very well seems relevant.


>the sequence of 5 characters (less-than, lowercase-b, lowercase-r, 
>forward-slash, greater-than) is what it is.  What you seem to be insisting 
>on is that markup is not text; that indeed makes sense as a parsing 
>restriction when discussing XML.  But (with a passing bow to charmod) 
>characters are characters.

I know which section of charmod you refer to. That section is there
to make clear that parsing happens on the character layer, rather than
on the octet layer. With US-ASCII, the difference may not be very
visible, but looking at EBCDIC or JIS (iso-2022-jp) this difference
becomes quite important. In charmod, you will also find ample discussion
of escaping, explaining how character sequences can represent other
characters.


>'<br/>' was a sequence of 5 characters before XML was invented, and its 
>still the same sequence of 5 characters. When I'm editing XHTML, I will 
>treat this sequence differently when I see it in the code window than when 
>I see it in the design window, but its the same 5 characters I am looking 
>at in each case.

The same argument would apply to integers represented as characters,
I guess, and other things that would use the same characters but would
not be integers.


>>>PS: Although the current situation may be technically satisfactory in this
>>>area, the pain in getting there suggests that a slightly different
>>>description of XML literals might be more useful, perhaps something along
>>>the line of making the value space of XML literals in RDF be some abstract
>>>set with equality defined as per exclusive XML canonicalization and
>>>explicitly determined to be disjoint from the value space of plain RDF
>>>literals and also from the XSD value spaces.  This would also probably make
>>>the XML guys much more happy.
>>
>>I have proposed something like this just a day or two ago. It would
>>definitely make I18N quite a bit happier, because it would not be
>>a straightforward violation of the Character Model, and would indeed
>>be much more in line with the XML spec.
>
>I guess we have been working under the tacit assumption that as far as 
>possible we *should* specify what our RDF-described domains actually are.

Yes, I think this is preferable to leaving things completely open.


>This abstract set trick does make the semantics easier to state, but all 
>it does operationally is to guarantee that identities *cannot* be 
>inferred. If something is in an abstract set then is is definitely not a 
>XML character sequence or octet sequence or XML markup, for example. Is 
>this really what i18n wants?

So the abstract set trick, if I understand correctly, would say that
XML Literals denote elements from an abstract set (let's call it the
XML-Literal-abstract-set), and therefore cannot be identical to any
other things such as character sequences or octet sequences or whatever?

This is not exactly what we would want (because, as discussed above,
some desirable identities are missed), but is definitely MUCH better
than saying that XML Literals denote sequences of octets.


Regards,     Martin.
Received on Monday, 28 July 2003 17:24:30 UTC