Re: Summary of strings, markup, and language tagging in RDF (resend)

Hello Jeremy, others,

At 13:41 03/06/30 +0100, Jeremy Carroll wrote:

>Graham Klyne wrote:
>
>>At 08:48 29/06/03 -0400, Martin Duerst wrote:

>>>Obviously, to find out whether it is text with markup or text
>>>without markup, one way is to look inside. Another way would be
>>>to disallow rdf:parseType='Literal' on pure text strings.
>>
>>I think this possibility was mentioned in our discussion, but rejected on 
>>the grounds of invalidating some (much?) existing RDF, and also making 
>>life much harder for RDF writers.
>
>An example application is one I have which has a form which permits the 
>user to include xhtml markup. The value of this form becomes embedded 
>within an RDF document inside an rdf:parseType="Literal" element.

So given that other people and applications will also contribute
to this data, what's the best solution:

1) To have the producer (your application) check whether there is markup
    or not, and leave out rdf:parseType="Literal" if there is none?
    [I agree that this is not a good solution, because it's against
    established practice.]
2) To have the RDF parser handle the fact that for plain text strings,
    sometimes there may be an rdf:parseType="Literal", and sometimes not?
3) To have some indication in the schema saying that only
    rdf:parseType="Literal" can be used here?
4) To dump the problem on 'higher level applications'?

In my view, the best solution is clearly 2).


By the way, I was just trying to check to what extent the actual RDF
Model and Syntax spec is expressing the fact that its authors (or at
least one of them, Ralph) thought that rdf:parseType="Literal" without
any actual markup is the same as a plain literal.

Here is what I have found:

    3. If E is an empty element (no content), v is the resource whose
       identifier is given by the resource attribute of E. If the content
       of E contains no XML markup or if parseType="Literal" is specified
       in the start tag of E then v is the content of E (a literal). Otherwise,
       the content of E must be another Description or container and v is the
       resource named by the (possibly implicit) ID or about of that 
Description
       or container.

This does not make any distinction WHATSOEVER between
    <foo>literal text</foo>
and
    <foo rdf:parseType="Literal">literal text</foo>

Also, the definition of Literal does not distinguish between what's
now called 'plain' and 'XML' literals:

Literal
    The most primitive value type represented in RDF, typically a string of
    characters. The content of a literal is not interpreted by RDF itself
    and may contain additional XML markup. Literals are distinguished from
    Resources in that the RDF model does not permit literals to be the subject
    of a statement.

If you have found evidence to the contrary, please tell me.


>Martin:
>
>>>Can we please make sure that we separate syntax and semantics?
>>
>>I wasn't aware of conflating the two.  This issue seems to be entirely 
>>syntactic:  is a sequence of Unicode characters used to represent an XML 
>>document (and conforming to XML syntax) syntactically distinguished from 
>>any other sequence of Unicode characters?  (Hmmm... maybe the conflation 
>>here is between concrete syntax and abstract syntax -- I'm thinking of 
>>abstract syntax here.)
>>As for the rest of what you say, I really don't want to get into encoding 
>>tricks here -- to me that is just another layer of complexity we don't 
>>need, and as such should be left to implementers to deal with in their 
>>own way.   That is, if the string
>>    "<a>Some text</a>"
>>is to be distinct from the XML document encoded as:
>>    "<a>Some text</a>"
>>then we should just say so and deal with the consequences.
>
>
>The WG has taken such a position for a quite a while now.
>This has been motivated by the needs of applications which produce XML 
>output and have to escape the non-XML strings and to not escape the known 
>XML content.

It is clear that applications need to know whether something is markup,
or is just characters that look like markup.



>>Personally, I don't think XML should have this distinguished status in 
>>RDF.  If it's really necessary to distinguish an XML document literal in 
>>RDF, when why not use RDF facilities to do so?  e.g.
>>    <ex:XMLDocument>
>>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>>    </ex:XMLDocument>
>>as distinct from, say:
>>    <ex:StringData>
>>       <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value>
>>    </ex:StringData>
>
>
>Simply that this is not the design the WG took to last call. The design 
>the WG took to last call had been examined by the RDFCore WG in detail, 
>and had  had, at least at an earlier stage, been reviewed by the I18N WG.

I of course remember various discussions, in particular the one in Cannes.
But I do not at all remember that we ever might have agreed to treating
    <rdf:value rdf:parseType="Literal">Some text</rdf:value>
and
    <rdf:value>Some text</rdf:value>
as two completely different things, and I don't know which communication
we might have had that might have given you the impression we agreed to it.
If you think we have indeed agreed to this, please tell me when and how.


>The current phrasing in the editors draft defers to the term exclusive 
>canonical XML:
>http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/#def-exclusive-canonica 
>l-XML

Just before we forget it, at that place, 'exclusive canonicalization'
is defined as follows:
"The exclusive canonical form of a document subset is a physical representation
of the XPath node-set, as an octet sequence, produced by the method described
in this specification"

While the 'physical representation' may have been important for the people
working on digital signatures, it seems definitely the wrong thing for RDF.
I hope this can be fixed.

What is much more important, if using exclusive canonical XML means that
the xml:lang context of the XML literal in the RDF document is ignored,
then that's totally wrong. It:
- has never been accepted by the I18N WG (RDF Core agreed with that)
- is against the XML 1.0 Recommendation
- is against the RDF Model and Syntax Recommendation
- is against the recent RDF last calls
- is the opposite of what happens with plain literals, and therefore
   highly confusing for users.

To make sure xml:lang is not thrown away for XML literals, there is
no need to change exclusive canonical XML. As for plain literals,
xml:lang can be carried separately.


>Which is what it does, it treats the embedded XML as a special sort of 
>literal value, i.e. a typed literal. This seems an entirely consistent and 
>coherent position.
>
>
>>>What is important is that the same semantic things, i.e.:
>>>- Text (without markup or language information)
>>>- Text with language information (but no markup)
>>>- Text with markup (but no language info)
>>>- Text with markup and language information
>>>are in each of the above cases recognized as being the same rather
>>>than being split up in a number of different things based on some
>>>representational details. On top of that, recognizing the continuity
>>>between the four variants above and making it easy to deal with
>>>this continuity would be a definite plus.
>
>
>There is certainly more work that should be done in the area of language 
>in the semantic web, for instance RDF Core has considered Tex Texin's comment
>
>http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0460.html
>
>concerning language ranges and realized that at present we offer no 
>solution - but that that problem was outside our current charter. So we 
>have created a new postponed issue as described in:
>
>http://lists.w3.org/Archives/Public/www-rdf-comments/2003AprJun/0029.html
>
>This wuld address the first two of Martin's list - but not the issue of 
>markup.

Maybe I wasn't clear enough above. What we are asking for is not that
RDF provide a mechanism so that all the following four can be seen
as one and the same thing.

1) Text (without markup or language information)
2) Text with language information (but no markup)
3) Text with markup (but no language info)
4) Text with markup and language information

What we are asking for is just that all syntactic artefacts that fall within
any single of the above categories are treated the same, i.e. that in addition
to the four categories above, we don't create any spurious additional ones.


>To me this looks like application space, in which semantic web application 
>layers, that are currently not particularly subscribed in W3C documents, 
>get to call the shots.

What you refer to, i.e. ignoring markup or ignoring (a suffix of) a language
tag *across* the categories above, can definitely go into application space.
What applications should not have to bother with is spurious differences
between what is one and the same thing, i.e. *within* any of the four
categories listed above.



>The different between an XML document and related strings is complex, and 
>probably goes beyond the bounds of what can be systematically defined.
>
>e.g.
>
>If we are searching for instances of the word "pot" which of the following 
>bits of XML should count as a match:
>
>"<em>pot</em>"
>"<pot/>"
>"<eg eg:pot='h' xmlns:eg='http://eg.org/'/>"
>
>etc.

good question. But if we are searching for 'pot' in the following
two examples:
    <foo rdf:parseType='Literal'>pot</foo>
and
    <foo>pot</foo>
would you ever expect an application to return one and not the other?


Regards,    Martin.

Received on Monday, 30 June 2003 14:54:15 UTC