Re: Test cases: XML Literal value space and exclusive canonicalization from Martin Duerst on 2003-08-04 (w3c-rdfcore-wg@w3.org from August 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 04 Aug 2003 10:55:01 -0400
To: Dave Beckett <dave.beckett@bristol.ac.uk>
Cc: www-rdf-comments@w3.org, pat hayes <phayes@ihmc.us>, Benja Fallenstein <b.fallenstein@gmx.de>, Jeremy Carroll <jjc@hplb.hpl.hp.com>, w3c-rdfcore-wg@w3.org, w3c-i18n-ig@w3.org, msm@w3.org
Message-Id: <4.2.0.58.J.20030804100402.04c9d8e8@localhost>
Hello Dave,

Many thanks for your quick and detailed reply!


At 12:12 03/08/04 +0100, Dave Beckett wrote:

>On Sun, 03 Aug 2003 17:36:46 -0400
>Martin Duerst <duerst@w3.org> wrote:
>
> > This message is prompted by some details in the recent discussion
> > about XML Literals between Pat Hayes and Benja Fallenstein.
> > I have tried to express this as much as possible as test cases.
> >
> >
> > There are two somewhat related issues:
> > A) Lexical space of XML Literals vs. allowed syntax in elements
> >     with rdf:parseType="Literal".
> > B) Allowed syntax with rdf:dataType="&rdf;XMLLiteral"
> > C) Context information for rdf:parseType="Literal"
> >
> > First to A):
> >
> > Two recent messages from Pat Hayes say that the lexical space
> > of XML Literals and the value space is in 1:1 correspondence:
> >
> > http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Aug/0026.html
> >  >>>>
> > "Note that the XML values of well-typed XML literals are in precise
> > 1:1 correspondence with the XML literal strings of such literals, but
> > are not themselves character strings."
> >  >>>>
> >
> > http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Jul/0452.html
> >  >>>>
> > The lexical-to-value mapping is a 1:1 mapping from the lexical space
> > onto the value space. The value of the lexical-to-value mapping
> >  >>>>
>
>Those are about questions in the RDF graph.
>
>The RDF graph is an abstract syntax of triples, and is separate
>from the RDF/XML syntax which is the concrete one.

Okay. I was trying to ask this because I assume that in all
cases except XML Literals, the syntax allowed in RDF/XML is
that defined by the lexical space of the datatype (modulo
XML character escaping). Is this the case?


> > This lets me ask the following test-based questions:
>
>Which you do, in the RDF/XML syntax, however we
>usually pose questions about defails of the graph in our
>test format for the graph, N-Triples
>
> > Do the following two RDF/XML documents entail the same graph?

That seems to be an RDF/XML question.

>   <http://example.org/foo> <http://example.org/bar> 
> "<br></br>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral> .
>
>Same triple.

Okay, good.


>The "content of an element" is not in the graph (there are no elements
>in the abstract syntax) and is not the lexical form

I now understand that for XML Literals. What about all the other
datatype literals?

>(the concrete syntax
>has no lexical forms either, they are in the graph).
>
>Both RDF/XML examples are legal and give the same graph.
>
> > Also, please clarify, wherever necessary in the specs, that
> > the content of an element marked with rdf:parseType="Literal"
> > is not the literal value of the XML Literal, and make sure
> > that this is covered by an appropriate test case.
>
>Given that exc-C14N produces octets and we wanted a unicode string,

Yes, good point, for the abstract syntax, octets would be
highly confusing.


>I am rewriting that part of the RDF/XML syntax document.
>
>This particular part of <br/> exc-canonicalizing to octets equivalent to
>the Unicode "<br></br>" doesn't happen to be tested in our test cases,
>but we are not providing an exc-C14N test suite.  I can add it.

I agree that it would be a bad idea to try to provide an exc-C14N
test suite. I think it would be good to add an example like this
just to document how RDF/XML syntax, lexical value, and so on,
are related, and in particular, that they are not exactly the same.


>They are the same triple.  XML Canonicalization happens in mapping from
>the concrete syntax to the abstract.
>
>So that means there is no problem with A).

Very good, many thanks for the confirmation.


> > Now to B)

>(Aside: here and below, rdf:datatype is the correct term)

Oh, sorry, my mistake.

> > Now let's change this to:
> >
> > <rdf:Description>
> >    <eg:prop rdf:parseType="Literal"><br/></eg:prop>
> >    <eg:prop rdf:dataType="&rdf;XMLLiteral">&lt;br/></eg:prop>
> > </rdf:Description>
>
>Produces different triples in the graph, the lexical forms of them are
>the Unicode strings:
>   "<br></br>"
>   "<br/>"
>
> > Given the discussion under A), it seems to me that the most
> > plausible result of this is that the first line produces a
> > triple, but the second line is illegal, because the string
> > "<br/>" isn't cannonicalized ...
>
>illegal is vague.  It is legal XML, legal RDF/XML.  However
>in the graph it might be an ill-formed XML literal (PatH will
>have the right term).

Okay. Are there examples of 'ill-formed' other literals in the
test suite? If yes, it may be appropriate to add this one.


> > As I have explained in
> > http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Jul/0410.html,
> > I would prefer it to make rdf:dataType="&rdf;XMLLiteral" in
> > the RDF/XML syntax illegal, to make things easier for the
> > parser.
>
>We didn't want to require people to have an XML parser
>for handling RDF's abstract syntax so all the XML checking
>belongs in the mapping from RDF/XML to the triples.
>
>It might make sense to forbid rdf:datatype with the URI of rdf:XMLLiteral
>for the reason you give - to make things easier for the parser.  Do
>you feel it makes things easier for the user too?

Here is some thoughts I have gone through:
- It makes things somewhat different for software writing RDF/XML:
   It can't just write out all types with rdf:datatype. But this is
   probably a desirable effect.
- For users, there are really a lot of users out there, and it's
   not very easy to say in general. But in my view, it very much helps
   them understanding XML Literals if they see these literals always
   at the same level of escaping. Most people seem to get confused
   very quickly with different escaping levels. Using only
   rdf:parseType='Literal' would mean that the basic escaping is
   the same in the RDF/XML syntax, in the abstract syntax, and,
   as far as I understand, in most implementations. I think
   that is a serious benefit.


>If we do ban it, that would mean no problem with B), yes?

Yes, that means that problem B is gone.


> > The third issue, C), is about context information for
> > rdf:parseType="Literal". The following two test documents
> > illustrate the situation:
>
>What is context?

Sorry to not be clear enough. By context, I meant everything
outside the actual element content that represents the literal
value. In particular the xmlns:eg2="http://example.com/"
prefix declaration in the first example.


> > <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> >           xmlns:eg="http://example.org/"
> >           xmlns:eg2="http://example.com/">
> >   <rdf:Description rdf:about="http://example.org/foo">
> >     <eg:bar rdf:parseType="Literal"><eg:br/></eg:bar>
> >   </rdf:Description>
> > </rdf:RDF>
> >
> > <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> >           xmlns:eg="http://example.org/">
> >   <rdf:Description rdf:about="http://example.org/foo">
> >     <eg:bar rdf:parseType="Literal"><eg2:br
> > xmlns:eg2="http://example.com/"></eg2:br></eg:bar>
> >   </rdf:Description>
> > </rdf:RDF>
> >
> > My reading of the current spec is that both examples produce
> > the same graph, and that the canonicalization (and therefore,
> > according to the discussion above, the literal value) of
> > the literal in the graph is:
> >
> > "<eg2:br xmlns:eg2="http://example.com/"></eg2:br>"
> >
> > If this is not true, please tell me what happens in the
> > above case.
>
>The whtespace is different in your examples and is significant.
>Assuming that is a mistake, then apart from that, both lexical values
>re as given above.

There is a linebreak instead of a space after <eg2:br in the
second <rdf:RDF> piece. Is that what you meant? This was introduced
by my mailer, which doesn't like long lines. Canonicalization should
turn that back into a space again. Whitespace is significant in
element content, for good reasons, but not inside start tags.

My understanding is that the XML Literal in both cases will
come out as:

"<eg2:br xmlns:eg2="http://example.com/"></eg2:br>"^^
<http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral>

(I have added a linebreak after ^^ just to make sure that
no other ones get added)

You seem to agree.

> > This example shows that while in the literal value
> > (based on canonicalization), the context (in particular
> > namespace declarations) is internalized as described by
> > Pat, in the RDF/XML syntax, this does not have to be
> > the case.
>
>I don't understand this point or see what the problem is here.
>What document must we change to fix it?

My guess is that currently, no document needs to change.
But I wanted to make sure this was the case, and there were
no misunderstandings about canonicalization and context
(i.e. in an RDF/XML context, namespace prefix declarations
could be far away from the actual literals where they apply.
Once canonicalized, that's no longer the case.


Regards,    Martin.
Received on Monday, 4 August 2003 11:50:48 UTC