Re: Summary of strings, markup, and language tagging in RDF (resend) from Jeremy Carroll on 2003-06-30 (w3c-rdfcore-wg@w3.org from June 2003)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 30 Jun 2003 20:46:14 +0100
To: Martin Duerst <duerst@w3.org>
CC: Graham Klyne <gk@ninebynine.org>, Dan Connolly <connolly@w3.org>, w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org, reagle@w3.org
Message-ID: <3F009386.8020709@hplb.hpl.hp.com>
Martin Duerst wrote:
...

> 2) To have the RDF parser handle the fact that for plain text strings,
>    sometimes there may be an rdf:parseType="Literal", and sometimes not?

...

> 
> In my view, the best solution is clearly 2).
> 
> 
> By the way, I was just trying to check to what extent the actual RDF
> Model and Syntax spec is expressing the fact that its authors (or at
> least one of them, Ralph) thought that rdf:parseType="Literal" without
> any actual markup is the same as a plain literal.
> 
> Here is what I have found:
> 
>    3. If E is an empty element (no content), v is the resource whose
>       identifier is given by the resource attribute of E. If the content
>       of E contains no XML markup or if parseType="Literal" is specified
>       in the start tag of E then v is the content of E (a literal). 
> Otherwise,
>       the content of E must be another Description or container and v is 
> the
>       resource named by the (possibly implicit) ID or about of that 
> Description
>       or container.
> 
> This does not make any distinction WHATSOEVER between
>    <foo>literal text</foo>
> and
>    <foo rdf:parseType="Literal">literal text</foo>
> 
> Also, the definition of Literal does not distinguish between what's
> now called 'plain' and 'XML' literals:
> 
> Literal
>    The most primitive value type represented in RDF, typically a string of
>    characters. The content of a literal is not interpreted by RDF itself
>    and may contain additional XML markup. Literals are distinguished from
>    Resources in that the RDF model does not permit literals to be the 
> subject
>    of a statement.
> 
> If you have found evidence to the contrary, please tell me.


I agree with your reading of M&S (although I would defer to Brian or DaveB 
on this one), unfortunately that was not found workable. Applications 
needed to know whether the markup was an XML literal or not. In the absence 
of helpful advice from M&S some RDF applications returned effectively an 
additional bit of information indicating whether it was a 
parseType="Literal" or not.

RDF Core was chartered to fix bugs in M&S and this was an area where there 
were definitely bugs. e.g. the mathml example in M&S requires mechanisms 
that are not even hionted at, and we have not provided with clear, if 
somewhat difficult text, defering to exc-c14n.

So in brief, M&S was broken, and we were required to fix it.

...


>> The current phrasing in the editors draft defers to the term exclusive 
>> canonical XML:
>> http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/#def-exclusive-canonica 
>> l-XML
> 
> 

Martin:

> Just before we forget it, at that place, 'exclusive canonicalization'
> is defined as follows:
> "The exclusive canonical form of a document subset is a physical 
> representation
> of the XPath node-set, as an octet sequence, produced by the method 
> described
> in this specification"
> 
> While the 'physical representation' may have been important for the people
> working on digital signatures, it seems definitely the wrong thing for RDF.
> I hope this can be fixed.



I agree its clunky - I don't believe it is cost effective to fix it.
RDF Core should be defering to an XML group as to appropriate 
representations of XML. We require that equality is well-defined. The only 
XML groups we found when we determined the main outline of this design two 
years ago was the c14n group. When they also penned exc-c14n it was clearly 
a better fit.


> 
> What is much more important, if using exclusive canonical XML means that
> the xml:lang context of the XML literal in the RDF document is ignored,
> then that's totally wrong. 


If that's totally wrong, then why is it not wrong for SOAP, or other 
applications of exc-c14n?
This seems to be a comment about exc-c14n rather than RDF.

>It:
> - has never been accepted by the I18N WG (RDF Core agreed with that)

agreed

> - is against the XML 1.0 Recommendation

in as much as exc-c14n is.

> - is against the RDF Model and Syntax Recommendation

M&S is somewhat vague, but I would concede this point.

> - is against the recent RDF last calls

yes.

> - is the opposite of what happens with plain literals, and therefore
>   highly confusing for users.

depends on the application.
I would suspect this is true for XHTML based XML literals, which I would 
view as the main application.
See below about confusion.

> 
> To make sure xml:lang is not thrown away for XML literals, there is
> no need to change exclusive canonical XML. 

We lose xml:lang by using exc-c14n out of the box ... viz:
[[
attributes in the XML namespace, such as xml:lang and xml:space are not 
imported into orphan nodes of the document subset
]]

Because of this, in the LC docs we had a complicated and confusing 
work-around that involved putting the xml-literal inside an <rdf-wrapper> 
tag, whose sole purpose was to hold the xml:lang attribute. It is certainly 
less confusing to have ditched all of that.

>As for plain literals,
> xml:lang can be carried separately.
> 


This is current behaviour.

> Maybe I wasn't clear enough above. What we are asking for is not that
> RDF provide a mechanism so that all the following four can be seen
> as one and the same thing.
> 
> 1) Text (without markup or language information)
> 2) Text with language information (but no markup)
> 3) Text with markup (but no language info)
> 4) Text with markup and language information
> 
> What we are asking for is just that all syntactic artefacts that fall 
> within
> any single of the above categories are treated the same, i.e. that in 
> addition
> to the four categories above, we don't create any spurious additional ones.
> 
> 
>> To me this looks like application space, in which semantic web 
>> application layers, that are currently not particularly subscribed in 

(typo: I meant described not subscribed)

>> W3C documents, get to call the shots.
> 
> 
> What you refer to, i.e. ignoring markup or ignoring (a suffix of) a 
> language
> tag *across* the categories above, can definitely go into application 
> space.
> What applications should not have to bother with is spurious differences
> between what is one and the same thing, i.e. *within* any of the four
> categories listed above.
> 

> 
> 
>> The different between an XML document and related strings is complex, 
>> and probably goes beyond the bounds of what can be systematically 
>> defined.
>>
>> e.g.
>>
>> If we are searching for instances of the word "pot" which of the 
>> following bits of XML should count as a match:
>>
>> "<em>pot</em>"
>> "<pot/>"
>> "<eg eg:pot='h' xmlns:eg='http://eg.org/'/>"
>>
>> etc.
> 
> 
> good question. But if we are searching for 'pot' in the following
> two examples:
>    <foo rdf:parseType='Literal'>pot</foo>
> and
>    <foo>pot</foo>
> would you ever expect an application to return one and not the other?
> 

I think you could ask the same question about searching for the string 
"100" and wondering whether the number written as "7.5100" is a match or 
not. I don't find the use of datatyping for XMLLiterals forced and unnatural.
     <foo rdf:dataType='&xsd;decimal'>7.5100</foo>
  and
     <foo>7.5100</foo>

If you really just want to treat the RDF/XML as text data use a text editor.



> 
> Regards,    Martin.

Jeremy
Received on Monday, 30 June 2003 15:46:46 UTC