- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 30 Jun 2003 13:09:41 -0400
- To: Jeremy Carroll <jjc@hplb.hpl.hp.com>, Graham Klyne <gk@ninebynine.org>
- Cc: Dan Connolly <connolly@w3.org>, w3c-i18n-ig@w3.org, "Ralph R. Swick" <swick@w3.org>, misha.wolf@reuters.com, Tim Berners-Lee <timbl@w3.org>, w3c-rdfcore-wg@w3.org, reagle@w3.org
Hello Jeremy, others, At 13:41 03/06/30 +0100, Jeremy Carroll wrote: >Graham Klyne wrote: > >>At 08:48 29/06/03 -0400, Martin Duerst wrote: >>>Obviously, to find out whether it is text with markup or text >>>without markup, one way is to look inside. Another way would be >>>to disallow rdf:parseType='Literal' on pure text strings. >> >>I think this possibility was mentioned in our discussion, but rejected on >>the grounds of invalidating some (much?) existing RDF, and also making >>life much harder for RDF writers. > >An example application is one I have which has a form which permits the >user to include xhtml markup. The value of this form becomes embedded >within an RDF document inside an rdf:parseType="Literal" element. So given that other people and applications will also contribute to this data, what's the best solution: 1) To have the producer (your application) check whether there is markup or not, and leave out rdf:parseType="Literal" if there is none? [I agree that this is not a good solution, because it's against established practice.] 2) To have the RDF parser handle the fact that for plain text strings, sometimes there may be an rdf:parseType="Literal", and sometimes not? 3) To have some indication in the schema saying that only rdf:parseType="Literal" can be used here? 4) To dump the problem on 'higher level applications'? In my view, the best solution is clearly 2). By the way, I was just trying to check to what extent the actual RDF Model and Syntax spec is expressing the fact that its authors (or at least one of them, Ralph) thought that rdf:parseType="Literal" without any actual markup is the same as a plain literal. Here is what I have found: 3. If E is an empty element (no content), v is the resource whose identifier is given by the resource attribute of E. If the content of E contains no XML markup or if parseType="Literal" is specified in the start tag of E then v is the content of E (a literal). Otherwise, the content of E must be another Description or container and v is the resource named by the (possibly implicit) ID or about of that Description or container. This does not make any distinction WHATSOEVER between <foo>literal text</foo> and <foo rdf:parseType="Literal">literal text</foo> Also, the definition of Literal does not distinguish between what's now called 'plain' and 'XML' literals: Literal The most primitive value type represented in RDF, typically a string of characters. The content of a literal is not interpreted by RDF itself and may contain additional XML markup. Literals are distinguished from Resources in that the RDF model does not permit literals to be the subject of a statement. If you have found evidence to the contrary, please tell me. >Martin: > >>>Can we please make sure that we separate syntax and semantics? >> >>I wasn't aware of conflating the two. This issue seems to be entirely >>syntactic: is a sequence of Unicode characters used to represent an XML >>document (and conforming to XML syntax) syntactically distinguished from >>any other sequence of Unicode characters? (Hmmm... maybe the conflation >>here is between concrete syntax and abstract syntax -- I'm thinking of >>abstract syntax here.) >>As for the rest of what you say, I really don't want to get into encoding >>tricks here -- to me that is just another layer of complexity we don't >>need, and as such should be left to implementers to deal with in their >>own way. That is, if the string >> "<a>Some text</a>" >>is to be distinct from the XML document encoded as: >> "<a>Some text</a>" >>then we should just say so and deal with the consequences. > > >The WG has taken such a position for a quite a while now. >This has been motivated by the needs of applications which produce XML >output and have to escape the non-XML strings and to not escape the known >XML content. It is clear that applications need to know whether something is markup, or is just characters that look like markup. >>Personally, I don't think XML should have this distinguished status in >>RDF. If it's really necessary to distinguish an XML document literal in >>RDF, when why not use RDF facilities to do so? e.g. >> <ex:XMLDocument> >> <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value> >> </ex:XMLDocument> >>as distinct from, say: >> <ex:StringData> >> <rdf:value rdf:parseType="Literal"><a>Some text</a></rdf:value> >> </ex:StringData> > > >Simply that this is not the design the WG took to last call. The design >the WG took to last call had been examined by the RDFCore WG in detail, >and had had, at least at an earlier stage, been reviewed by the I18N WG. I of course remember various discussions, in particular the one in Cannes. But I do not at all remember that we ever might have agreed to treating <rdf:value rdf:parseType="Literal">Some text</rdf:value> and <rdf:value>Some text</rdf:value> as two completely different things, and I don't know which communication we might have had that might have given you the impression we agreed to it. If you think we have indeed agreed to this, please tell me when and how. >The current phrasing in the editors draft defers to the term exclusive >canonical XML: >http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/#def-exclusive-canonica >l-XML Just before we forget it, at that place, 'exclusive canonicalization' is defined as follows: "The exclusive canonical form of a document subset is a physical representation of the XPath node-set, as an octet sequence, produced by the method described in this specification" While the 'physical representation' may have been important for the people working on digital signatures, it seems definitely the wrong thing for RDF. I hope this can be fixed. What is much more important, if using exclusive canonical XML means that the xml:lang context of the XML literal in the RDF document is ignored, then that's totally wrong. It: - has never been accepted by the I18N WG (RDF Core agreed with that) - is against the XML 1.0 Recommendation - is against the RDF Model and Syntax Recommendation - is against the recent RDF last calls - is the opposite of what happens with plain literals, and therefore highly confusing for users. To make sure xml:lang is not thrown away for XML literals, there is no need to change exclusive canonical XML. As for plain literals, xml:lang can be carried separately. >Which is what it does, it treats the embedded XML as a special sort of >literal value, i.e. a typed literal. This seems an entirely consistent and >coherent position. > > >>>What is important is that the same semantic things, i.e.: >>>- Text (without markup or language information) >>>- Text with language information (but no markup) >>>- Text with markup (but no language info) >>>- Text with markup and language information >>>are in each of the above cases recognized as being the same rather >>>than being split up in a number of different things based on some >>>representational details. On top of that, recognizing the continuity >>>between the four variants above and making it easy to deal with >>>this continuity would be a definite plus. > > >There is certainly more work that should be done in the area of language >in the semantic web, for instance RDF Core has considered Tex Texin's comment > >http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0460.html > >concerning language ranges and realized that at present we offer no >solution - but that that problem was outside our current charter. So we >have created a new postponed issue as described in: > >http://lists.w3.org/Archives/Public/www-rdf-comments/2003AprJun/0029.html > >This wuld address the first two of Martin's list - but not the issue of >markup. Maybe I wasn't clear enough above. What we are asking for is not that RDF provide a mechanism so that all the following four can be seen as one and the same thing. 1) Text (without markup or language information) 2) Text with language information (but no markup) 3) Text with markup (but no language info) 4) Text with markup and language information What we are asking for is just that all syntactic artefacts that fall within any single of the above categories are treated the same, i.e. that in addition to the four categories above, we don't create any spurious additional ones. >To me this looks like application space, in which semantic web application >layers, that are currently not particularly subscribed in W3C documents, >get to call the shots. What you refer to, i.e. ignoring markup or ignoring (a suffix of) a language tag *across* the categories above, can definitely go into application space. What applications should not have to bother with is spurious differences between what is one and the same thing, i.e. *within* any of the four categories listed above. >The different between an XML document and related strings is complex, and >probably goes beyond the bounds of what can be systematically defined. > >e.g. > >If we are searching for instances of the word "pot" which of the following >bits of XML should count as a match: > >"<em>pot</em>" >"<pot/>" >"<eg eg:pot='h' xmlns:eg='http://eg.org/'/>" > >etc. good question. But if we are searching for 'pot' in the following two examples: <foo rdf:parseType='Literal'>pot</foo> and <foo>pot</foo> would you ever expect an application to return one and not the other? Regards, Martin.
Received on Monday, 30 June 2003 14:54:15 UTC