Re: first pass parseType="Literal" text for primer from Martin Duerst on 2003-07-28 (w3c-rdfcore-wg@w3.org from July 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 28 Jul 2003 17:20:26 -0400
To: Graham Klyne <gk@ninebynine.org>, rdf core <w3c-rdfcore-wg@w3.org>, i18n <w3c-i18n-ig@w3.org>
Message-Id: <4.2.0.58.J.20030728151746.02542e38@localhost>
Hello Graham,

At 11:03 03/07/28 +0100, Graham Klyne wrote:

>[I'm going to try and respond to your message insofar as it affects 
>RDFcore deliberations first, and later I'll engage some of the other 
>points, so that disinterested parties can skip that.]
>
>At heart, I think the problem is something like this.  Consider some RDF 
>properties (your examples):
>
>1. <title>Why the &lt;FONT&gt; Tag is Bad</title>
>
>By my understanding, the *value* of this 'title' property is:
>
>   "Why the <FONT> Tag is Bad"
>
>I.e. the '<' and '>' are uninterpreted parts of the string value.

Yes.


>2. <title rdf:parseType='Literal'>Why the &lt;FONT&gt; Tag is Bad</title>
>
>I take the value of this 'title' property to be:
>
>   "Why the &lt;FONT&gt; Tag is Bad"^^rdf:XMLLiteral
>
>where the '&' characters are interpreted as XML character entities.

The '&' are interpreted as XML entity start characters (this sounds
more like an SGML term than an XML term, but anyway). The '&lt;'
is, or is expected to be, interpreted as a less-than character.


>(So, yes, 1 and 2 are two different things.  I believe this to be a 
>necessary distinction...)

Their representations are different. But why do their denotations
have to be different?


>3.  <title rdf:parseType='Literal'>The <strong>Best</strong>Cruise 
>Vacations for Dummies</title>
>
>Has the value
>
>   "The <strong>Best</strong>Cruise Vacations for Dummies"^^rdf:XMLLiteral
>
>Here, compared with case (1), the '<' and '>' characters are to be treated 
>as introducing XML markup.

Yes, of course.


>So, by your own examples, we have a need to represent simple strings like 
>(1) that contain XML markup characters that are not to be interpreted as 
>such.  On the other hand, we have examples like (2) and (3) where the XML 
>markup characters *are* to be interpreted.
>
>So to my "horse has bolted comment"... I think there is now a lot of 
>software and data out there that assumes that plain literals are simply 
>sequences of characters, without any provision for interpretation.  Where 
>'<' and '&' appear, they may be treated simply as characters, with no 
>attempt to interpret them.

Yes, correct.


>Within this scheme, I assert there is *no way* to require that XML 
>interpretation of these characters be applied without marking the 
>distinction, which is what the XML datatype provides.

But note that we are not speaking about changing the interpretation
of something by changing from plain literal to XML literal, we are
speaking about two different representations ((1) and (2)) that
could/should denote the same string of characters.


>So we find things like this:
>
>   <description>Note the lovely XHTML&amp;CSS in that one.</description>
>
>(which I've extracted from some actual RSS/RDF that I found on the web today)
>which is presented as:
>
>   "Note the lovely XHTML&CSS in that one."
>
>(I also found examples with '<' and '>' as part of the literal value, 
>uninterpreted.)

Very nice examples.


>The only way we're going to be able represent this kind of data, *and* to 
>handle markup in the same uniform framework, is to completely revisit the 
>design of RDF literal data so that a lexical form is not just a sequence 
>of Unicode characters, and is self-denoting.  To change that would be a 
>late-stage fundamental change to the design with who-knows-what kinds of 
>repercussion.

There is no need to change this self-denotedness for plain literals.
And the denotation of XML Literals is somewhat under discussion
anyway.


>So we have a design that says you have to decide whether or not markup 
>characters in the lexical form are to be interpreted -- by presence of the 
>XMLLiteral datatype, signalled by parseType=Literal.  The proposed text 
>suggests a couple of migration paths:
>
>(1) For new data representing human-readable text, use the XMLLiteral 
>datatype.  If there is no markup, then fine.  If uninterpreted markup 
>characters are present, they must be escaped in the lexical form.  Markup 
>can be included and used to encode all kinds of I18N information, using 
>designs that are being developed with active input from I18N experts.

This seems plausible in theory, but doesn't take into account
the reality of distributed applications on the Web. One of the
very clear advantages of Web technology, including RDF, is that
it allows you to start small, and expand. Start small in this
case often means to start with plain literals.


>(2) Develop software that in appropriately handle both plain literals and 
>XML literals.  This seems a plausible approach where there is large 
>amounts of existing data (all those RSS feeds?).

This is definitely a plausible approach as long as plain literals
and XML literals are close enough in structure. But by removing
the language information from XML literals and forcing that to
be added internally to the XML Literal with a dummy tag, this
becomes very difficult.


>A third approach that I'll offer is to develop standard inference tools to 
>filter the data:  for properties known to have human-readable text values, 
>one might use a rule like:
>
>   ?a my:prop ?b .
>   WHERE plainLiteral(?b) =>
>     ?a my:prop XHTMLEncode(?b) .
>
>(This kind of capability could, I'm fairly sure, be easily added to a tool 
>like CWM, and I do expect to be able to do this kind of thing with the 
>inference tools I'm working on.)

This is another possibility. But what exactly is XHTMLEncode?
And what happens if your Ontology says that my:prop is functional?


>In conclusion, I find that I agree with your fundamental concerns as 
>illustrated by your examples, but not with the conclusions you draw.  If 
>the requirement for a seamless path from simple plain text to text with 
>markup had been articulated early in the process, we might have been able 
>to come up with a design for literals that met it.  There is RDF data out 
>there that uses markup characters in non-markup ways, which argues most 
>strongly against saying that markup characters in plain literals must be 
>interpreted as such (as do your own examples).  To change a fundamental 
>element of RDF design at this late stage would in my view carry a high 
>risk of introducing a new set of problems.

There is no need for any change in the fundamental design for plain
literals. It would be nice if we could change XML Literals to make
XML Literals without markup (except character escapes in their
usual representation) to be equialent to the corresponding plain
literal (i.e. after unescaping the escapes). Although we believe
that that is a reasonable interpretation of M&S (voiced, e.g.,
by one of its two coauthors), we acknowledge that we should have
become aware of this point earlier, at the latest at last call,
and we do therefore not plan to object on this point.


>And we do have a design that is less-than seamless, but workable, with 
>respect to existing data, and which does provide a seamless option for new 
>RDF data designs.  Most of the problems you have described can, I believe, 
>be overcome by good implementation design.  I don't claim the current 
>design couldn't be improved, but I don't see that any of its flaws are fatal.

We claim that the current treatment of language information on XML
Literals is significantly less than seamless, and that the 'option'
for seamless new designs won't actually work out in practice.
Because the change of language information treatement has been
made after last call, and in explicit disagreement to prior
agreements, we continue insisting that this be changed back.


>...
>
>[Miscellaneous chit-chat follows - not especially relevant to WG business]
>
>At 12:46 27/07/03 -0400, Martin Duerst wrote:
>>Hello Graham,
>>
>>At 11:35 03/07/24 +0100, Graham Klyne wrote:

>>>My browser can't handle the extra characters,
>>
>>What browser do you have? I'm rather sure that it can handle
>>these characters (otherwise you have a really old and crappy
>>browser and should upgrade asap). The only thing you may need
>>to do is to install some fonts. Please contact me privately
>>for details.
>
>Mozilla 1.2.1

That's definitely a decent browser. On what OS?


>>So can you tell me how a tool would infer that
>>the XML Literal "<dummy xml:lang='en'>Moby Dick</dummy>"
>>and the plain literal "Moby Dick"@en are the same, if
>>'dummy' can be anything? It would be much easier to do
>>this if the XML Literal was "Moby Dick"@en^^XML
>>(or whatever the actual notation would be).
>
>The inference would need to know about the <dummy> property ... which is 
>why I suggest using inference rather than a one-off conversion tool.

How would inference know which <dummy> property to use?
Wouldn't things work much better if it wouldn't have to know
which <dummy> property to use?


>>And now the RDF Core WG tries to solve that issue by claiming that
>>XML Literals are sequences of octets! Sorry, but I don't want to
>>call such a backwards layer violation progress.
>
>I don't particularly like that either, but in the circumstances we had a 
>choice of:
>
>(a) using the existing spec with as little adjustment as possible
>(b) inventing a new spec
>(c) adapting the existing spec, by adding additional interpretive text
>
>Choice (a) seemed reasonable at the time, given the requirements we were 
>considering.
>
>Personally, I find this whole issue of dealing with canonicalization to be 
>over-complicated and not particularly beneficial.  Even when two different 
>XML documents represent different infosets, they may still contain the 
>same information, so I don't see the benefit of C14N to be very 
>great.  But others clearly think this is worthwhile, and who am I to say 
>they're wrong?

I don't have anything against C14N per se. I just don't think that
something that was intended for things such as cryptography should
be reused in its entirety for abstract modeling.


Regards,    Martin.
Received on Monday, 28 July 2003 17:24:26 UTC