Re: Summary of strings, markup, and language tagging in RDF (resend) from Martin Duerst on 2003-06-30 (w3c-rdfcore-wg@w3.org from June 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 30 Jun 2003 12:09:12 -0400
To: Graham Klyne <gk@ninebynine.org>, Dan Connolly <connolly@w3.org>
Cc: w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org
Message-Id: <4.2.0.58.J.20030630112307.04caebe8@localhost>
Hello Graham, others,


At 09:42 03/06/30 +0100, Graham Klyne wrote:

>At 08:48 29/06/03 -0400, Martin Duerst wrote:

>>Obviously, to find out whether it is text with markup or text
>>without markup, one way is to look inside. Another way would be
>>to disallow rdf:parseType='Literal' on pure text strings.
>
>I think this possibility was mentioned in our discussion, but rejected on 
>the grounds of invalidating some (much?) existing RDF, and also making 
>life much harder for RDF writers.

I did not want to suggest that this possibility is a good solution.
I just tried to show what the issue was about: As long as there
is no markup in the 'XML'-Literal, there is really no semantic
distinction to plain literals.


>>>In discussion, I understood the request to be for:
>>>
>>>[[
>>><dc:title rdf:parseType='Literal'>
>>>   A Midsummer Night's Dream
>>></dc:title>
>>>]]
>>>
>>>to denote a plain string literal, but
>>>
>>>[[
>>><dc:title rdf:parseType='Literal'>
>>>   <em>A Midsummer Night's Dream</em>
>>></dc:title>
>>>]]
>>>
>>>to be a completely different kind of literal denoting an XML document in 
>>>some way (because of the presence of markup).
>>>
>>>(I originally read Martin's note to suggest that an XML document is 
>>>itself just a string of Unicode characters, not distinguished from 
>>>non-XML strings.  That is a position I could support but with which 
>>>others have expressed concerns.)
>>
>>Can we please make sure that we separate syntax and semantics?
>
>I wasn't aware of conflating the two.  This issue seems to be entirely 
>syntactic:  is a sequence of Unicode characters used to represent an XML 
>document (and conforming to XML syntax) syntactically distinguished from 
>any other sequence of Unicode characters?  (Hmmm... maybe the conflation 
>here is between concrete syntax and abstract syntax -- I'm thinking of 
>abstract syntax here.)

First, we are not dealing with XML documents, we are dealing with
XML fragments. Second, of course there is a distinction between
an XML fragment that has actual markup and a string that does
contain nothing but text. But this is not what we are talking about.
The question is whether there is a difference between an 'XML fragment'
that contains nothing else but just text and a simple string that
contains nothing else but text. What I was saying was that there
may be some syntactic differences (because there may be some need
for escaping in the first case, but not in the second case),
but there is no real difference. (I'll try to avoid the word
'semantic' from now on.)


>As for the rest of what you say, I really don't want to get into encoding 
>tricks here -- to me that is just another layer of complexity we don't 
>need, and as such should be left to implementers to deal with in their own way.

I fully agree. In the same way, if rdf:parseType='Literal' is irrelevant
if there is no markup in the literal, then we should just say so and
let implementations deal with it.


>  That is, if the string
>    "<a>Some text</a>"
>is to be distinct from the XML document encoded as:
>    "<a>Some text</a>"
>then we should just say so and deal with the consequences.

Yes, exactly. The former would turn out in RDF/XML as something
like <foo>&lt;a&gt;Some text&lt;/a&gt;<foo>, the later would turn
out as <foo rdf:parseType='Literal'><a>Some text</a></foo>.
I think nobody in this discussion claims that these two should
be the same. What we are discussing is the cases where there is
only an XML fragment without markup. I.e. if the string
     "Some text"
is to be distinct from the XML fragment encoded as:
     "Some text"
then we should just say so. But very obviously, they are the same,
so we should not claim they are different.


>Personally, I don't think XML should have this distinguished status in 
>RDF.  If it's really necessary to distinguish an XML document literal in 
>RDF, when why not use RDF facilities to do so?  e.g.
>
>    <ex:XMLDocument>
>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>    </ex:XMLDocument>
>
>as distinct from, say:
>
>    <ex:StringData>
>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>    </ex:StringData>

First, this would be against RDF Model and Syntax. Second,
as Jeremy pointed out, it would be against all the other
decisions RDF Core has taken up to last call. Third, it
would create even more different representations for what's
exactly the same thing. There would be nothing but syntax
differences between the following two:

<ex:XMLDocument>
    <rdf:value rdf:parseType='Literal'>Some text</rdf:value>
</ex:XMLDocument>

<ex:StringData>
    <rdf:value rdf:parseType='Literal'>Some text</rdf:value>
</ex:StringData>

And fourth, the second one could easily be seen as yet another
way to do CDATA Sections, see the parallel between

<ex:StringData>
     <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
</ex:StringData>

and <![CDATA[<a>Some text</a>]]>.

As I18N has worked hard to keep CDATA Sections out of the infoset,
we wouldn't be pleased about this either :-(.


>>For RDF to say that XML is *treated* as a string of Unicode characters
>>is perfectly okay. For RDF to say that XML *is* nothing but a string
>>of Unicode characters is a bad idea.
>
>I don't think the issue here is that RDF is or is not trying to say 
>anything about what an XML document may be, but rather to decide whether 
>or not RDF embodies special treatment of literals that happen to be XML 
>documents.  My position being:  why shouldn't RDF adopt the same 
>techniques for talking about XML documents that it uses for talking about 
>any other kind of thing in the universe of discourse?

So to play devil's advocate, why allow strings? Why not model them the
RDF way as a sequence of characters?

Seriously, XML fragments got into RDF because they are a natural
extension of plain literals. The Web has brought us markup, and
it has proven to be useful. Why go back to plain text if we don't
have to? And XML fragments cover important internationalization needs,
such as multilingual strings, ruby, bidirectionality, and so on.


>>What is important is that the same semantic things, i.e.:
>>- Text (without markup or language information)
>>- Text with language information (but no markup)
>>- Text with markup (but no language info)
>>- Text with markup and language information
>>are in each of the above cases recognized as being the same rather
>>than being split up in a number of different things based on some
>>representational details. On top of that, recognizing the continuity
>>between the four variants above and making it easy to deal with
>>this continuity would be a definite plus.
>
>Which all seems to be saying that there are different flavours of text for 
>which consistent handling is required.  Which seems reasonable to me.  But 
>what is confusing me is the suggestion that XML is, on one hand, just 
>another flavour of text, yet is also something completely different.  I 
>can't make coherent sense of this.

Marked-up text is just another flavor of text. Of course text with
markup and text without markup is not exactly the same, otherwise
we wouldn't need markup in the first place.
Also, an XML fragment that is just only text is just that, just only text.
Anything that is just text can be an xml fragment.
XML is text + markup. An XML document has to have markup (the root element).
An XML fragment does not have to have markup. So an XML fragment can
be just text.


>In its way, XML *is* a "representational detail", which happens to be used 
>to represent many more things than just text.  I'm not sure what you mean 
>by continuity in this case.

'many more things than just text' may have two different senses.
In one sense, it refers to the fact that XML is often used for
representing (structured) data. In that case, it is probably better
to 'convert' that XML to RDF, either explicitly or by using the
fact that many XML formats, sometimes with a little bit of help
such as parseType='Resource', can be interpreted as RDF. This is
not really the topic of this discussion.
The other sense may refer to the fact that markup adds value to
text. It indeed does, but only if actually present.


Let me try another approach. RDF says that

<foo rdf:parseType="Resource">
     <rdf:type>Dog</rdf:type>
     <name>Fifi</name>
</foo>

and

<foo>
   <Dog>
     <name>Fifi</name>
   </Dog>
</foo>

(modulo my syntax error) are the same, namely a thing foo with type
Dog and name Fifi. Why would it then be so difficult for RDF to say that

    <bar rdf:parseType="Literal">Fifi</bar>
and
    <bar>Fifi</bar>
are also the same?


>This message is in danger of getting longer and longer... the more I think 
>about what you seem to be asking for, the less I can see a coherent view 
>of it.  So, in summary, I think we have two choices:
>(a) XML has no distinguished status in the RDF abstract syntax.  (I like 
>this, others don't)
>(b) XML does have distinguished status, and we accept the consequences, 
>warts and all.

What's warty about saying that a text string without markup is the
same as a text string without markup?


Regards,     Martin.
Received on Monday, 30 June 2003 14:27:46 UTC