Re: Summary of strings, markup, and language tagging in RDF (resend) from Jeremy Carroll on 2003-06-30 (w3c-rdfcore-wg@w3.org from June 2003)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 30 Jun 2003 13:41:42 +0100
To: Graham Klyne <gk@ninebynine.org>
CC: Martin Duerst <duerst@w3.org>, Dan Connolly <connolly@w3.org>, w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org
Message-ID: <3F003006.7040907@hplb.hpl.hp.com>
Graham Klyne wrote:

> 
> At 08:48 29/06/03 -0400, Martin Duerst wrote:
> 
>> Hello Graham,
>>
>> At 18:53 03/06/27 +0100, Graham Klyne wrote:
>>
>>> Speaking for myself, and my understanding of our discussion...
>>>
>>> What I found "distasteful" was the suggestion that one would have to 
>>> look *inside* the content of a literal to figure out what type it is.
>>
>>
>> Obviously, to find out whether it is text with markup or text
>> without markup, one way is to look inside. Another way would be
>> to disallow rdf:parseType='Literal' on pure text strings.
> 
> 
> I think this possibility was mentioned in our discussion, but rejected 
> on the grounds of invalidating some (much?) existing RDF, and also 
> making life much harder for RDF writers.
> 


An example application is one I have which has a form which permits the 
user to include xhtml markup. The value of this form becomes embedded 
within an RDF document inside an rdf:parseType="Literal" element.

> 
>>> In discussion, I understood the request to be for:
>>>
>>> [[
>>> <dc:title rdf:parseType='Literal'>
>>>   A Midsummer Night's Dream
>>> </dc:title>
>>> ]]
>>>
>>> to denote a plain string literal, but
>>>
>>> [[
>>> <dc:title rdf:parseType='Literal'>
>>>   <em>A Midsummer Night's Dream</em>
>>> </dc:title>
>>> ]]
>>>
>>> to be a completely different kind of literal denoting an XML document 
>>> in some way (because of the presence of markup).
>>>
>>> (I originally read Martin's note to suggest that an XML document is 
>>> itself just a string of Unicode characters, not distinguished from 
>>> non-XML strings.  That is a position I could support but with which 
>>> others have expressed concerns.)
>>
>>


Martin:

>> Can we please make sure that we separate syntax and semantics?
> 
> 
> I wasn't aware of conflating the two.  This issue seems to be entirely 
> syntactic:  is a sequence of Unicode characters used to represent an XML 
> document (and conforming to XML syntax) syntactically distinguished from 
> any other sequence of Unicode characters?  (Hmmm... maybe the conflation 
> here is between concrete syntax and abstract syntax -- I'm thinking of 
> abstract syntax here.)
> 
> As for the rest of what you say, I really don't want to get into 
> encoding tricks here -- to me that is just another layer of complexity 
> we don't need, and as such should be left to implementers to deal with 
> in their own way.   That is, if the string
>    "<a>Some text</a>"
> is to be distinct from the XML document encoded as:
>    "<a>Some text</a>"
> then we should just say so and deal with the consequences.


The WG has taken such a position for a quite a while now.
This has been motivated by the needs of applications which produce XML 
output and have to escape the non-XML strings and to not escape the known 
XML content.

> 
> Personally, I don't think XML should have this distinguished status in 
> RDF.  If it's really necessary to distinguish an XML document literal in 
> RDF, when why not use RDF facilities to do so?  e.g.
> 
>    <ex:XMLDocument>
>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>    </ex:XMLDocument>
> 
> as distinct from, say:
> 
>    <ex:StringData>
>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>    </ex:StringData>
> 


Simply that this is not the design the WG took to last call. The design the 
WG took to last call had been examined by the RDFCore WG in detail, and had 
  had, at least at an earlier stage, been reviewed by the I18N WG.

I also note that the RDF Core WG considered and rejected such models for 
typed data literals (e.g. in which an integer might have been represented as

<ex:int>
    <rdf:value>1</rdf:value>
</ex:int>

to distinguish it from the string

<ex:string>
    <rdf:value>1</rdf:value>
</ex:string>

)


>> XML is defined as a syntax on a sequence of Unicode characters,
>> so treating it as such in a particular implementation,... is
>> possible. If you are a bit careful with escaping, you can store
>> text without markup in the same form. Other implementations are
>> easily possible (for example, one could observe that "<>" is illegal
>> in XML, and thus use "<>" to escape '<', and not escape &, and
>> use '""' to escape '"' in an attribute. This would no longer look
>> like XML, but would store the same information).
>>
>> For RDF to say that XML is *treated* as a string of Unicode characters
>> is perfectly okay. For RDF to say that XML *is* nothing but a string
>> of Unicode characters is a bad idea.
> 


The current phrasing in the editors draft defers to the term exclusive 
canonical XML:
http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/#def-exclusive-canonical-XML


> 
> I don't think the issue here is that RDF is or is not trying to say 
> anything about what an XML document may be, but rather to decide whether 
> or not RDF embodies special treatment of literals that happen to be XML 
> documents.  My position being:  why shouldn't RDF adopt the same 
> techniques for talking about XML documents that it uses for talking 
> about any other kind of thing in the universe of discourse?
> 


Which is what it does, it treats the embedded XML as a special sort of 
literal value, i.e. a typed literal. This seems an entirely consistent and 
coherent position.


>> What is important is that the same semantic things, i.e.:
>> - Text (without markup or language information)
>> - Text with language information (but no markup)
>> - Text with markup (but no language info)
>> - Text with markup and language information
>> are in each of the above cases recognized as being the same rather
>> than being split up in a number of different things based on some
>> representational details. On top of that, recognizing the continuity
>> between the four variants above and making it easy to deal with
>> this continuity would be a definite plus.
> 


There is certainly more work that should be done in the area of language in 
the semantic web, for instance RDF Core has considered Tex Texin's comment

http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0460.html

concerning language ranges and realized that at present we offer no 
solution - but that that problem was outside our current charter. So we 
have created a new postponed issue as described in:

http://lists.w3.org/Archives/Public/www-rdf-comments/2003AprJun/0029.html

This wuld address the first two of Martin's list - but not the issue of 
markup. To me this looks like application space, in which semantic web 
application layers, that are currently not particularly subscribed in W3C 
documents, get to call the shots.
The different between an XML document and related strings is complex, and 
probably goes beyond the bounds of what can be systematically defined.

e.g.

If we are searching for instances of the word "pot" which of the following 
bits of XML should count as a match:

"<em>pot</em>"
"<pot/>"
"<eg eg:pot='h' xmlns:eg='http://eg.org/'/>"

etc.


> 
> Which all seems to be saying that there are different flavours of text 
> for which consistent handling is required.  Which seems reasonable to 
> me.  But what is confusing me is the suggestion that XML is, on one 
> hand, just another flavour of text, yet is also something completely 
> different.  I can't make coherent sense of this.


In the current WG solution XML is not just another flavour of text.


Jeremy
Received on Monday, 30 June 2003 08:48:46 UTC