Re: Summary of strings, markup, and language tagging in RDF (resend) from pat hayes on 2003-07-02 (w3c-rdfcore-wg@w3.org from July 2003)

From: pat hayes <phayes@ihmc.us>
Date: Wed, 2 Jul 2003 00:49:19 -0500
To: Martin Duerst <duerst@w3.org>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p06001212bb281aec6b4c@[10.0.100.7]>
>Hello Graham, others,
>>>Can we please make sure that we separate syntax and semantics?
>>
>>I wasn't aware of conflating the two.  This issue seems to be 
>>entirely syntactic:  is a sequence of Unicode characters used to 
>>represent an XML document (and conforming to XML syntax) 
>>syntactically distinguished from any other sequence of Unicode 
>>characters?  (Hmmm... maybe the conflation here is between concrete 
>>syntax and abstract syntax -- I'm thinking of abstract syntax here.)
>
>First, we are not dealing with XML documents, we are dealing with
>XML fragments. Second, of course there is a distinction between
>an XML fragment that has actual markup and a string that does
>contain nothing but text. But this is not what we are talking about.
>The question is whether there is a difference between an 'XML fragment'
>that contains nothing else but just text and a simple string that
>contains nothing else but text. What I was saying was that there
>may be some syntactic differences (because there may be some need
>for escaping in the first case, but not in the second case),
>but there is no real difference. (I'll try to avoid the word
>'semantic' from now on.)

Ah, but some of us don't have that option.

If I follow you, you seem to be saying that as long as a character 
string has no XML markup in it, then its just text and there is no 
'real' difference between it thought of as XML or as a Unicode string 
or as an XSD:string.  But as soon as it contains actual XML markup, 
it becomes something different: there is a real difference in kind 
between '<this>is XML</this>' considered as an XML fragment and the 
mere string of characters which looks the same but isn't being 
treated as XML. If this is more or less right, it seems to require a 
three-way distinction between mere text, genuine XML, and what might 
be called faux XML, which is text that looks like XML but isn't. This 
is pretty much what we have in RDF at present: mere text is plain 
literals, genuine XML is XML typed literals (encoded using 
rdf:parseType='Literal' in RDFXML), and faux XML is a plain literal 
which happens to be in the lexical space of rdf:XMLliteral but isnt 
inside a typed literal (or maybe is typed with xsd:string).

>
>>As for the rest of what you say, I really don't want to get into 
>>encoding tricks here -- to me that is just another layer of 
>>complexity we don't need, and as such should be left to 
>>implementers to deal with in their own way.
>
>I fully agree. In the same way, if rdf:parseType='Literal' is irrelevant
>if there is no markup in the literal, then we should just say so and
>let implementations deal with it.
>
>>  That is, if the string
>>    "<a>Some text</a>"
>>is to be distinct from the XML document encoded as:
>>    "<a>Some text</a>"
>>then we should just say so and deal with the consequences.
>
>Yes, exactly. The former would turn out in RDF/XML as something
>like <foo>&lt;a&gt;Some text&lt;/a&gt;<foo>, the later would turn
>out as <foo rdf:parseType='Literal'><a>Some text</a></foo>.
>I think nobody in this discussion claims that these two should
>be the same. What we are discussing is the cases where there is
>only an XML fragment without markup. I.e. if the string
>     "Some text"
>is to be distinct from the XML fragment encoded as:
>     "Some text"
>then we should just say so. But very obviously, they are the same,
>so we should not claim they are different.

It is not obvious that they are the same. It is also not obvious that 
either of them is the same as "Some text" considered as an XSD 
string. I don't say its obvious that they are different either, mind 
you: only that nothing is OBVIOUS in this area.

>
>
>>Personally, I don't think XML should have this distinguished status 
>>in RDF.  If it's really necessary to distinguish an XML document 
>>literal in RDF, when why not use RDF facilities to do so?  e.g.
>>
>>    <ex:XMLDocument>
>>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>>    </ex:XMLDocument>
>>
>>as distinct from, say:
>>
>>    <ex:StringData>
>>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>>    </ex:StringData>
>
>First, this would be against RDF Model and Syntax. Second,
>as Jeremy pointed out, it would be against all the other
>decisions RDF Core has taken up to last call. Third, it
>would create even more different representations for what's
>exactly the same thing. There would be nothing but syntax
>differences between the following two:
>
><ex:XMLDocument>
>    <rdf:value rdf:parseType='Literal'>Some text</rdf:value>
></ex:XMLDocument>
>
><ex:StringData>
>    <rdf:value rdf:parseType='Literal'>Some text</rdf:value>
></ex:StringData>
>
>And fourth, the second one could easily be seen as yet another
>way to do CDATA Sections, see the parallel between
>
><ex:StringData>
>     <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
></ex:StringData>
>
>and <![CDATA[<a>Some text</a>]]>.
>
>As I18N has worked hard to keep CDATA Sections out of the infoset,
>we wouldn't be pleased about this either :-(.
>
>>>For RDF to say that XML is *treated* as a string of Unicode characters
>>>is perfectly okay. For RDF to say that XML *is* nothing but a string
>>>of Unicode characters is a bad idea.
>>
>>I don't think the issue here is that RDF is or is not trying to say 
>>anything about what an XML document may be, but rather to decide 
>>whether or not RDF embodies special treatment of literals that 
>>happen to be XML documents.  My position being:  why shouldn't RDF 
>>adopt the same techniques for talking about XML documents that it 
>>uses for talking about any other kind of thing in the universe of 
>>discourse?
>
>So to play devil's advocate, why allow strings?

Because we were chartered to include the XSD datatypes. Honestly, 
that is the only sensible answer. If you want to ask pointed 
questions, why did XSD feel it necessary to define *strings* as a 
*datatype* in a markup language?

>Why not model them the
>RDF way as a sequence of characters?
>
>Seriously, XML fragments got into RDF because they are a natural
>extension of plain literals. The Web has brought us markup, and
>it has proven to be useful. Why go back to plain text if we don't
>have to? And XML fragments cover important internationalization needs,
>such as multilingual strings, ruby, bidirectionality, and so on.
>
>>>What is important is that the same semantic things, i.e.:
>>>- Text (without markup or language information)
>>>- Text with language information (but no markup)
>>>- Text with markup (but no language info)
>>>- Text with markup and language information
>>>are in each of the above cases recognized as being the same rather
>>>than being split up in a number of different things based on some
>>>representational details. On top of that, recognizing the continuity
>>>between the four variants above and making it easy to deal with
>>>this continuity would be a definite plus.
>>
>>Which all seems to be saying that there are different flavours of 
>>text for which consistent handling is required.  Which seems 
>>reasonable to me.  But what is confusing me is the suggestion that 
>>XML is, on one hand, just another flavour of text, yet is also 
>>something completely different.  I can't make coherent sense of 
>>this.
>
>Marked-up text is just another flavor of text. Of course text with
>markup and text without markup is not exactly the same, otherwise
>we wouldn't need markup in the first place.

Maybe the key is this notion of 'flavor'.  That reads wonderfully, 
and I THINK I see what you mean, but when writing a formal semantics 
we need to get more precise. Ive tried to make the general notion of 
hypertext precise on several occasions, but its beyond me. Can you 
tell me what it MEANS to be a different flavor of text?

>Also, an XML fragment that is just only text is just that, just only text.
>Anything that is just text can be an xml fragment.
>XML is text + markup.

And is the text in the 'text+markup' still 'just text' or is it 
somehow changed by the markup into something rich and strange? So is 
'chat' marked up as French *different* from the same text marked up 
as English? Yes, surely; but then the markup *changes* the text, so 
its not 'just only text' any more. So its rather disingenuous to 
write "text+markup": that addition isn't as simple as adding two 
numbers. Its more like applying a function: markup(text); and then 
you have to ask, what is the value space of that markup function? For 
HTML-style markup it is plausibly something like what you see on the 
screen , ie a 'document' in the Xerox sense (a pixellated screen 
image of some kind), though that doesnt really deal with URLs; for 
TeX it is clearly a printed document image.  But for XML more 
generally I have no idea what it could be.

>  An XML document has to have markup (the root element).
>An XML fragment does not have to have markup. So an XML fragment can
>be just text.
>
>>In its way, XML *is* a "representational detail", which happens to 
>>be used to represent many more things than just text.  I'm not sure 
>>what you mean by continuity in this case.
>
>'many more things than just text' may have two different senses.
>In one sense, it refers to the fact that XML is often used for
>representing (structured) data. In that case, it is probably better
>to 'convert' that XML to RDF, either explicitly or by using the
>fact that many XML formats, sometimes with a little bit of help
>such as parseType='Resource', can be interpreted as RDF. This is
>not really the topic of this discussion.
>The other sense may refer to the fact that markup adds value to
>text. It indeed does, but only if actually present.

BUt the text has to be something more than 'just text' if it is 
considered *capable* of having that extra 'value' added to it. Maybe 
we need a notion of 'text ready for markup' as something more than 
mere text, rather in the way that concepts take on extra significance 
when one considers them as existing in a modal context.

>Let me try another approach. RDF says that
>
><foo rdf:parseType="Resource">
>     <rdf:type>Dog</rdf:type>
>     <name>Fifi</name>
></foo>
>
>and
>
><foo>
>   <Dog>
>     <name>Fifi</name>
>   </Dog>
></foo>
>
>(modulo my syntax error) are the same, namely a thing foo with type
>Dog and name Fifi. Why would it then be so difficult for RDF to say that
>
>    <bar rdf:parseType="Literal">Fifi</bar>
>and
>    <bar>Fifi</bar>
>are also the same?

There is a technical reason why this would be nasty for RDF. Either 
it would require RDF parsers to recognize, and distinguish in the RDF 
graph, the cases where the object of the rdf:parseType="Literal" was 
or was not mere text or genuine XML markup (and I bet that this would 
lead to users complaining that their XML fragments were getting 
mislabelled) ; or else it would require an RDF reasoner to have 
redundant ways of expressing the same content (as a plain literal or 
as an XML typed literal) and be able to reason about their identity: 
such reasoning is to be avoided whenever possible as it leads to 
combinatorial explosions in simple reasoners.

>>This message is in danger of getting longer and longer... the more 
>>I think about what you seem to be asking for, the less I can see a 
>>coherent view of it.  So, in summary, I think we have two choices:
>>(a) XML has no distinguished status in the RDF abstract syntax.  (I 
>>like this, others don't)
>>(b) XML does have distinguished status, and we accept the 
>>consequences, warts and all.
>
>What's warty about saying that a text string without markup is the
>same as a text string without markup?

The fact that since they are encoded differently, SOMETHING has to do 
some work to figure out that they are supposed to mean the same thing 
(if indeed they are).

Pat

-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 2 July 2003 01:49:23 UTC