Re: pfps-04 (why the thread is germane to pfps-04) from Martin Duerst on 2003-07-29 (www-rdf-comments@w3.org from July to September 2003)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 29 Jul 2003 19:21:26 -0400
To: pat hayes <phayes@ihmc.us>
Cc: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>, www-rdf-comments@w3.org, w3c-i18n-ig@w3.org, msm@w3.org
Message-Id: <4.2.0.58.J.20030729153637.068d5a38@localhost>
Hello Pat,

What I'm writing below is rather general, so please take
it as such.

At 00:46 03/07/29 -0500, pat hayes wrote:

>>Sorry I wasn't precise enough. I think the reason for this is
>>that it's just very difficult for me to think that XML fragments
>>could denote octets.
>
>Well, I also have some trouble figuring out what XML is supposed to refer 
>to, I admit.

I understand. XML is not a technology that tells you what it means,
it is just a technology that lets you build other things on top.
Overall, I think that's okay. Nobody is claiming that we completely
understand integers, either (otherwise we would know whether there
is an infinite number of primal twins or not,...). I think the
important thing is that RDF points in the right direction, so
that whoever builds on it can use their knowledge. So I think
that RDF (and also XML Schema Datatypes) assume that there is
some kind of common knowledge about integers, and that whatever
common knowledge there is is enough to make things work. I think
RDF can and should take a somewhat similar approach to XML Literals,
although I understand that there are more details to be hashed out
than for integers. 'Canonicalization' of integers is easier
than canonicalization of XML.


>??? You must be using 'denotes' in a different way that I tend to think of 
>it. You are here referring to a text-assembly/lexical-analysis/parsing 
>process, right? Going up layers of encoding from bytes up to some kind of 
>syntactically defined structure - in this case, XML.  I don't usually 
>think of that as what denotation is about. Denotation starts when you have 
>got the the syntax worked out, then you ask what it *means*. Now, the 
>cases we are considering here are weird precisely because when you ask 
>what a string-typed literal means, you get right back to the syntax: the 
>whole point of using text to denote strings is that the string in the text 
>pretty much denotes itself. Hence the RDF plain literal semantics.

Yes, I think I understand what you mean. One more way to see this would
be to say that it may mean something more, but we don't know how to
formalize that, or we don't want to formalize it, so we just stop at
this level. In a very wide sense, this is similar for other datatypes.
For example, an integer or a decimal may stand for some temperature,
and it may mean 'very hot' or 'very cold' or something like this.

To get back to plain literals and XML, plain literals, as far as we know,
are just that, plain literals, simple character sequences, whatever
you call it. But for XML Literals, we know more. We know that there
is some special syntax, start tags, end tags,... While we do not
need to deal with the details of XML syntax except for canonicalization
we can at least recognize the fact that XML is about XML syntax,
not just a string of characters or even just octets.


>If we could say that XML literals denoted themselves, I would have just 
>*loved* that idea. We almost did at one time, in our innocence: at that 
>time XML literals were just like plain literals except they had a kind of 
>XML 'bit' which registered them as being XML instead of just being text: 
>but they *were* text, in every other way: they denoted themselves, they 
>were character strings, etc.. (One difference was that if the character 
>string of a plain literal weren't legal XML markup, nothing happens, but 
>if the same is true of an XML literal then the literal itself behaves 
>differently, eg its not in the class rdf:XMLLiteral, things like that.) 
>But that got rejected as being much too fine-grained, since all kinds of 
>character-string differences (like whitespace in markup) would make 
>literals be distinct that XML would consider indistinguishable.

Yes. It would be ignoring much of the XML-ness to just say that
XML Literals are simple strings at their syntactic level.
XML Schema datatypes have the construct of a canonical lexical
form, and it's very aproriate to define this as being the
(exclusive) XML canonicalization (apart from the UTF-8 encoding,
because canonical lexical forms are on the character level, not
on the octet level).



>>At 17:01 03/07/27 -0500, pat hayes wrote:

>>But why not just say that XML Literals are XML Literals to establish
>>their identity? Or call them XML fragments, or text with markup, or
>>whatever you think will work best.
>
>What would YOU like them to be, in order to have them work best? Suppose 
>they are text with markup. Now, consider
>"<br />"^^rdf:XMLLiteral
>"<br/>"^^rdf:XMLLiteral
>are these equal or not? If text-with-markup is defined in terms of 
>character sequences then they are not. So how is it defined, so as to make 
>these be equal?

I think that the questions of identity and equality are very closely
related, but are not exactly the same. I think it should be okay to
'construct' identity for XML Literals from (a) a definition of equality
based on (excusive) canonical XML and (b) the 'XML-ness' of XML Literals.


>>By this, "<br/>" denotes a sequence of five characters.
>>"<br/>"^^rdf:XMLLiteral denotes an empty 'br' tag.
>
>OK, but stop there. What *is* that thing? Does an empty 'br' tag count as 
>a character in a character string? Or is this an entity in some abstract 
>XML structural space?

The later.

>Where is this space defined?

In the XML 1.0 spec, and the Infoset spec (http://www.w3.org/TR/xml-infoset/)
The definition in the XML spec is very implicit, the definition in the
Infoset is in my taste a bit too explicit (repeating the words
'information item' all over the place). But it is probably the
right thing to use for the RDF semantics.

Anyway, where exactly in the concept of 'integer' defined?
The XML Schema spec just says "This results in the standard mathematical 
concept of the integer numbers." (http://www.w3.org/TR/xmlschema-2/#integer).
For most people (including me), that's good enough.


>What kind of stuff does it have in it, and what sorts of structures do 
>they have? Until we get questions like this straight, we can't begin to 
>write formal semantics.

For structures, the Infoset spec should be reasonably okay.


>I guess this is the central question. We all know what XML *is*: its text 
>plus markup. But what does it *denote* ? What *kind* of thing dos it 
>denote, even? I don't know how to begin to answer that question.

My answer would be 'syntactic structures as described in the
infoset'. My markup("....") construct in the mail to Brian
was just an easy abbreviation of this, to show where I thought
the denotations of plain literals and XML literals could
overlap.

Just a stupid question: Has the RDF Core WG considered
using the Infoset in any way? Do you think it would be
difficult to describe the mapping from (exclusive) canonical
XML to the Infoset? In my view, the RDF spec could do this
in a sentence or two, and if it turns out that this not
possible, I guess it might point out serious deficiencies
in the XML 1.0 or Infoset specs.


>>>The point is, we have a distinction between two kinds of literals. To 
>>>put it crudely, a string (the literal string) can be labelled as 'plain' 
>>>in which case it (rather oddly) denotes itself, or as 'XML-ish', in 
>>>which case it might denote something else. The question is, what? The 
>>>issue is not to do with how the literal itself is encoded or represented.
>>
>>I was at one point worrying about the actual representation,
>>and still worry about that a bit, because some implementers
>>might confuse these things. But I guess such confusion can
>>never be completely avoided.
>>
>>Anyway, if XML Literals are labeled as XML-ish, it seems most
>>natural to let them denote something XML-ish, rather than something
>>octet-ish.
>
>I think the problem we have is that we have to say that they denote 
>*something*.  XML text as a character string was rejected as unworkable, 
>as it would make tiny character differences ruin XML identities; so we 
>looked for what the XML docs said was the root of XML syntactic identity: 
>when are two pieces of XML "really" the same? And the best answer we 
>(Jeremy) could find was the one we used. We couldn't find anything more 
>XML-ish than this.

I think for equivalence, (exclusive) canonical XML is the right
answer, because defining equivalence of two XML documents or fragments
was one if its design goals. It also works well for the abstract
syntax (minus UTF-8 encoding). For denotation, the Infoset seems
to be the best fit.


Regards,   Martin.
Received on Tuesday, 29 July 2003 19:22:03 UTC