Re: Summary of strings, markup, and language tagging in RDF (resend)

At 20:46 03/06/30 +0100, Jeremy Carroll wrote:

>Martin Duerst wrote:
>...
>
>>2) To have the RDF parser handle the fact that for plain text strings,
>>    sometimes there may be an rdf:parseType="Literal", and sometimes not?
>
>...
>
>>In my view, the best solution is clearly 2).
>>
>>By the way, I was just trying to check to what extent the actual RDF
>>Model and Syntax spec is expressing the fact that its authors (or at
>>least one of them, Ralph) thought that rdf:parseType="Literal" without
>>any actual markup is the same as a plain literal.
>>Here is what I have found:
>>    3. If E is an empty element (no content), v is the resource whose
>>       identifier is given by the resource attribute of E. If the content
>>       of E contains no XML markup or if parseType="Literal" is specified
>>       in the start tag of E then v is the content of E (a literal). 
>> Otherwise,
>>       the content of E must be another Description or container and v is the
>>       resource named by the (possibly implicit) ID or about of that 
>> Description
>>       or container.
>>This does not make any distinction WHATSOEVER between
>>    <foo>literal text</foo>
>>and
>>    <foo rdf:parseType="Literal">literal text</foo>
>>Also, the definition of Literal does not distinguish between what's
>>now called 'plain' and 'XML' literals:
>>Literal
>>    The most primitive value type represented in RDF, typically a string of
>>    characters. The content of a literal is not interpreted by RDF itself
>>    and may contain additional XML markup. Literals are distinguished from
>>    Resources in that the RDF model does not permit literals to be the 
>> subject
>>    of a statement.
>>If you have found evidence to the contrary, please tell me.
>
>
>I agree with your reading of M&S (although I would defer to Brian or DaveB 
>on this one),

Good.


>unfortunately that was not found workable. Applications needed to know 
>whether the markup was an XML literal or not. In the absence of helpful 
>advice from M&S some RDF applications returned effectively an additional 
>bit of information indicating whether it was a parseType="Literal" or not.

I'm not sure I understand this. It is clear that applications need to know
whether markup originating in RDF/XML was part of an XML literal, or was
part of other RDF (e.g. parseType='Resource' or so). But this seems
self-evident and not at issue.

Assuming that applications get plain literals and XML literals as native
string datatypes, and assuming that the application doesn't want to escape
'&' and '<' for plain literals, it is also clear that applications need
to make a distinction in some cases. [The two assumptions above are both
reasonable implementation choices, but they are not the only choices.]
The distinction they need to make is whether something that looks
like XML markup in a literal (when passed to the application as a text string)
is actually XML markup, or is just a string that looks like XML.
For example, applications need to be able, in RDF/XML, to distinguish between
    <foo rdf:parseType="Literal">Hello <em>World</em>!</foo>
and
    <foo rdf:parseType="Literal">Hello &lt;em&gt;World&lt;/em&gt;!</foo>
(the later e.g. being used in an example explaining XML). But this does
NOT imply that applications need to distinguish between
    <foo rdf:parseType="Literal">Hello &lt;em&gt;World&lt;/em&gt;!</foo>
and
    <foo>Hello &lt;em&gt;World&lt;/em&gt;!</foo>

So we can conclude tha the fact that some RDF applications (I assume
this is more parsers or stores than actual applications) returned an
additional bit is not wrong. That the RDF Core WG decided to model this
additional bit by defining a new type is again not wrong.

The problem is that rather than limiting the distinction to those cases
where it was needed (actual markup vs. text that looks like markup), it
was based on some syntactical detail of RDF/XML, namely the presence
or absence of rdf:parseType="Literal", leading to unnecessary distinctions.


>RDF Core was chartered to fix bugs in M&S and this was an area where there 
>were definitely bugs.

I do not consider the fact that M&S describes
    <foo rdf:parseType="Literal">some text here</foo>
and
    <foo>some text here</foo>
as equivalent as a bug.


>e.g. the mathml example in M&S requires mechanisms that are not even 
>hionted at,

I can agree with saying that it requires quite some thought to come up
with an implementation that does what M&S specifies. But the fact that
we are both agreeing that the current proposal conflicts with M&S in
several ways seems to be a clear indication that M&S wasn't as
undefined as it might seem.


>and we have not provided with clear, if somewhat difficult text, defering 
>to exc-c14n.

I don't understand what you wanted to say here. Is it
"and we have now provided clear, ..." or
"and we are not provided with clear, ..."?


>So in brief, M&S was broken, and we were required to fix it.

I agree that M&S was not perfect. But I don't agree with fixing
what wasn't broken in the first place.


>...
>
>
>>>The current phrasing in the editors draft defers to the term exclusive 
>>>canonical XML:
>>>http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/#def-exclusive-canoni 
>>>ca l-XML
>
>Martin:
>
>>Just before we forget it, at that place, 'exclusive canonicalization'
>>is defined as follows:
>>"The exclusive canonical form of a document subset is a physical 
>>representation
>>of the XPath node-set, as an octet sequence, produced by the method described
>>in this specification"
>>While the 'physical representation' may have been important for the people
>>working on digital signatures, it seems definitely the wrong thing for RDF.
>>I hope this can be fixed.
>
>
>
>I agree its clunky - I don't believe it is cost effective to fix it.

Stating that it is exclusive canonicalization, but in terms of characters,
not necessarily UTF-8, should not be too difficult to fix (it could be
done at CR). Referring to a specific octet representation in the day and
age of the Semantic Web just doesn't seem right.


>RDF Core should be defering to an XML group as to appropriate 
>representations of XML.

I agree. XML 1.0 clearly says that XML documents are defined as sequences
of characters, not octets.


>We require that equality is well-defined. The only XML groups we found 
>when we determined the main outline of this design two years ago was the 
>c14n group. When they also penned exc-c14n it was clearly a better fit.

I don't disagree that exc-c14n is overall a good fit for your purposes.
But that does not mean that you have to throw out the language information.


>>What is much more important, if using exclusive canonical XML means that
>>the xml:lang context of the XML literal in the RDF document is ignored,
>>then that's totally wrong.
>
>
>If that's totally wrong, then why is it not wrong for SOAP, or other 
>applications of exc-c14n?

exc-c14n clearly says under what conditions it can be used, so it
is an issue for the user to choose it or not depending on his/her
needs. Rather than saying "there is this exc-c14n, that seems about
right, so we are going to ignore xml:lang", it should be
"we need to preserve xml:lang, in a similar way as we do with plain
literals, so let's see how we can use exc-c14n the right way". Using
exc-c14n with an additional wrapper element would be one easy solution.

As for SOAP, I have not found any reference to exc-c14n in any of the
three SOAP 1.2 Recommendation documents just recently published.
Please tell me if I'm overlooking something.

I seem to remember from memory that the question was discussed whether
in SOAP, elements such as Envelope and Body should allow xml:lang,
and that it was decided that it was okay for these elements to not
allow xml:lang because the elements themselves did not contain
any real text, language information on that level could not be
canceled (remember that this was some time ago, when the solution
xml:lang="" was not yet agreed upon) and the structure of header
and body was course enough, and closer to the actual application,
to require the necessary language info to go there.

This is quite different from RDF, where we have very small granulation
and an already well established (and used for plain literals) language
inheritance.


>This seems to be a comment about exc-c14n rather than RDF.
>
>>It:
>>- has never been accepted by the I18N WG (RDF Core agreed with that)
>
>agreed
>
>>- is against the XML 1.0 Recommendation
>
>in as much as exc-c14n is.

see above.

>>- is against the RDF Model and Syntax Recommendation
>
>M&S is somewhat vague, but I would concede this point.

M&S is somewhat vague in that it allows applications to
consider or ignore xml:lang. But it didn't say anything
about pick-and-choose.


>>- is against the recent RDF last calls
>
>yes.
>
>>- is the opposite of what happens with plain literals, and therefore
>>   highly confusing for users.
>
>depends on the application.
>I would suspect this is true for XHTML based XML literals, which I would 
>view as the main application.
>See below about confusion.
>
>>To make sure xml:lang is not thrown away for XML literals, there is
>>no need to change exclusive canonical XML.
>
>We lose xml:lang by using exc-c14n out of the box ... viz:
>[[
>attributes in the XML namespace, such as xml:lang and xml:space are not 
>imported into orphan nodes of the document subset
>]]
>
>Because of this, in the LC docs we had a complicated and confusing 
>work-around that involved putting the xml-literal inside an <rdf-wrapper> 
>tag, whose sole purpose was to hold the xml:lang attribute. It is 
>certainly less confusing to have ditched all of that.

If the only purpose of the wrapper was to hold the xml:lang tag,
then I think a solution similar to the one for plain literals
should also work for XML literals.


>>As for plain literals,
>>xml:lang can be carried separately.
>
>This is current behaviour.

Yes, I know. Is there any reason that the same solution cannot
be used for XML literals, if it turns out that <rdf-wrapper> is
too clumsy?



Regards,     Martin.

Received on Tuesday, 1 July 2003 14:36:23 UTC