Re: first pass parseType="Literal" text for primer from Graham Klyne on 2003-07-28 (w3c-rdfcore-wg@w3.org from July 2003)

From: Graham Klyne <gk@ninebynine.org>
Date: Mon, 28 Jul 2003 11:03:46 +0100
To: Martin Duerst <duerst@w3.org>, rdf core <w3c-rdfcore-wg@w3.org>, i18n <w3c-i18n-ig@w3.org>
Message-Id: <5.1.0.14.2.20030728093126.00b7ae40@127.0.0.1>
[I'm going to try and respond to your message insofar as it affects RDFcore 
deliberations first, and later I'll engage some of the other points, so 
that disinterested parties can skip that.]

At heart, I think the problem is something like this.  Consider some RDF 
properties (your examples):

1. <title>Why the &lt;FONT&gt; Tag is Bad</title>

By my understanding, the *value* of this 'title' property is:

   "Why the <FONT> Tag is Bad"

I.e. the '<' and '>' are uninterpreted parts of the string value.

2. <title rdf:parseType='Literal'>Why the &lt;FONT&gt; Tag is Bad</title>

I take the value of this 'title' property to be:

   "Why the &lt;FONT&gt; Tag is Bad"^^rdf:XMLLiteral

where the '&' characters are interpreted as XML character entities.

(So, yes, 1 and 2 are two different things.  I believe this to be a 
necessary distinction...)

3.  <title rdf:parseType='Literal'>The <strong>Best</strong>Cruise 
Vacations for Dummies</title>

Has the value

   "The <strong>Best</strong>Cruise Vacations for Dummies"^^rdf:XMLLiteral

Here, compared with case (1), the '<' and '>' characters are to be treated 
as introducing XML markup.

So, by your own examples, we have a need to represent simple strings like 
(1) that contain XML markup characters that are not to be interpreted as 
such.  On the other hand, we have examples like (2) and (3) where the XML 
markup characters *are* to be interpreted.

So to my "horse has bolted comment"... I think there is now a lot of 
software and data out there that assumes that plain literals are simply 
sequences of characters, without any provision for interpretation.  Where 
'<' and '&' appear, they may be treated simply as characters, with no 
attempt to interpret them.  Within this scheme, I assert there is *no way* 
to require that XML interpretation of these characters be applied without 
marking the distinction, which is what the XML datatype provides.

So we find things like this:

   <description>Note the lovely XHTML&amp;CSS in that one.</description>

(which I've extracted from some actual RSS/RDF that I found on the web today)
which is presented as:

   "Note the lovely XHTML&CSS in that one."

(I also found examples with '<' and '>' as part of the literal value, 
uninterpreted.)

The only way we're going to be able represent this kind of data, *and* to 
handle markup in the same uniform framework, is to completely revisit the 
design of RDF literal data so that a lexical form is not just a sequence of 
Unicode characters, and is self-denoting.  To change that would be a 
late-stage fundamental change to the design with who-knows-what kinds of 
repercussion.

So we have a design that says you have to decide whether or not markup 
characters in the lexical form are to be interpreted -- by presence of the 
XMLLiteral datatype, signalled by parseType=Literal.  The proposed text 
suggests a couple of migration paths:

(1) For new data representing human-readable text, use the XMLLiteral 
datatype.  If there is no markup, then fine.  If uninterpreted markup 
characters are present, they must be escaped in the lexical form.  Markup 
can be included and used to encode all kinds of I18N information, using 
designs that are being developed with active input from I18N experts.

(2) Develop software that in appropriately handle both plain literals and 
XML literals.  This seems a plausible approach where there is large amounts 
of existing data (all those RSS feeds?).

A third approach that I'll offer is to develop standard inference tools to 
filter the data:  for properties known to have human-readable text values, 
one might use a rule like:

   ?a my:prop ?b .
   WHERE plainLiteral(?b) =>
     ?a my:prop XHTMLEncode(?b) .

(This kind of capability could, I'm fairly sure, be easily added to a tool 
like CWM, and I do expect to be able to do this kind of thing with the 
inference tools I'm working on.)

In conclusion, I find that I agree with your fundamental concerns as 
illustrated by your examples, but not with the conclusions you draw.  If 
the requirement for a seamless path from simple plain text to text with 
markup had been articulated early in the process, we might have been able 
to come up with a design for literals that met it.  There is RDF data out 
there that uses markup characters in non-markup ways, which argues most 
strongly against saying that markup characters in plain literals must be 
interpreted as such (as do your own examples).  To change a fundamental 
element of RDF design at this late stage would in my view carry a high risk 
of introducing a new set of problems.

And we do have a design that is less-than seamless, but workable, with 
respect to existing data, and which does provide a seamless option for new 
RDF data designs.  Most of the problems you have described can, I believe, 
be overcome by good implementation design.  I don't claim the current 
design couldn't be improved, but I don't see that any of its flaws are fatal.

...

[Miscellaneous chit-chat follows - not especially relevant to WG business]

At 12:46 27/07/03 -0400, Martin Duerst wrote:
>Hello Graham,
>
>At 11:35 03/07/24 +0100, Graham Klyne wrote:
>
>>At 14:39 23/07/03 -0400, Martin Duerst wrote:
>
>>>Many people designing 'RDF Applications' will start out with e.g.
>>><Title> being a plain literal. Later, they may discover that there
>>>are cases where they would need markup. But with the current design,
>>>they would have to go back and change all the <Title>s from plain
>>>literals to XML Literals. The way RDF is supposed to work, this
>>>will just not work out. So the needs for micro-markup, in particular
>>>for internationalization, will very sadly just be ignored if we
>>>don't change the design.
>>
>>While I can appreciate that having a seamless path from simple text to 
>>marked-up text would be nice, I feel that this particular horse has 
>>already bolted.
>
>What do you mean?

(see above)

>>   I think there's a lot of RDF "out there" that is based on simple plain 
>> literals, which would be damaged if some plain text were to be 
>> reinterpreted as markup.
>
>Sorry, there is a very serious difference between plain text being
>reinterpreted as markup (which is a bad thing), and literals with
>markup being added alongside literals without markup in the same
>application.
>
>What we don't want to happen is the value of
>
><title>Why the &lt;FONT&gt; Tag is Bad</title>
>
>to suddenly be interpreted as XML (and therefore, in this case,
>become non-well-formed). What we want is to be able to add
>another title, e.g.
>
><title rdf:parseType='Literal'>The <strong>Best</strong>
>Cruise Vacations for Dummies</title>
>
>without having to go back and change all the previous titles to e.g.
>
><title rdf:parseType='Literal'>Why the &lt;FONT&gt; Tag is Bad</title>
>
>thereby creating all kinds of confusion because RDF applications
>have been told that
>    <title>Why the &lt;FONT&gt; Tag is Bad</title>
>and
>    <title rdf:parseType='Literal'>Why the &lt;FONT&gt; Tag is Bad</title>
>are two different things.

(I agree with this:  see above)

>>I'll also note that one RDF-based design, FOAF, has been used with 
>>Japanese names based simply on the current form of plain literals:
>>   http://kanzaki.com/docs/sw/foaf.html
>
>I have looked at that previously. The association between names
>and natural language is quite complex. For example, my last name
>is clearly German, but my first name is very international.
>
>Also, pronunciations (readings) are very important for Japanese names,
>but are usually given in separate fields (e.g. separate properties)
>rather than e.g. with Ruby Annotation markup. Such separate
>properties are currently still missing, but I had some discussions
>about them recently with Dan Brickley.
>
>So this is not really a good example to show the need for
>inline markup.


>>My browser can't handle the extra characters,
>
>What browser do you have? I'm rather sure that it can handle
>these characters (otherwise you have a really old and crappy
>browser and should upgrade asap). The only thing you may need
>to do is to install some fonts. Please contact me privately
>for details.

Mozilla 1.2.1

>>so I cannot comment how well it works.  There is some discussion of this at:
>>   http://rdfweb.org/pipermail/rdfweb-dev/2003-June/011202.html
>>
>>But the point I really wanted to make is that I think, in RDF, the 
>>migration need not be so painful.  Even if plain literals cannot handle 
>>all the markup, standard RDF inference tools should be able to recognize 
>>the simple form and infer a more flexible form as and when such is needed.
>
>So can you tell me how a tool would infer that
>the XML Literal "<dummy xml:lang='en'>Moby Dick</dummy>"
>and the plain literal "Moby Dick"@en are the same, if
>'dummy' can be anything? It would be much easier to do
>this if the XML Literal was "Moby Dick"@en^^XML
>(or whatever the actual notation would be).

The inference would need to know about the <dummy> property ... which is 
why I suggest using inference rather than a one-off conversion tool.

>>I grant that's not a good basis for designing all new applications, and 
>>recommendations of the kind Brian has suggested [1] should encourage the 
>>use of an appropriate form other than simple plain-text literals, such as 
>>using parseType=Literal, for any value that may reasonably need to be 
>>multilingual text.
>
>I think recommendations may help. But we need something better than that.
>If it's necessary for people to start out with XML Literals if they want
>to have a chance to at any time in the future use markup, then that's wrong.
>But the text currently being worked on seems to suggest exactly that.

It may not be ideal, but I don't see that it's "wrong".

>>I think a comprehensive handling of multilingual text may need features 
>>more comprehensive than just XML literals, and I think that datatyping 
>>provides the way forward.  I think we are here discussing the details of 
>>features which are not ultimately going to solve these problems, whatever 
>>choices we make.  I don't know what the final solution may look like, but 
>>I could imagine something like a "multilingual text" datatype whose 
>>lexical forms are a well thought-out framework for handling all manner of 
>>textual values, isolated (by RDF) from the kinds of problems that Pat 
>>raised in his "tiger by the tail" message [2].
>
>I find the discussion of further solutions interesting in its own right.
>I'm sure we would have been open for such discussions e.g. at the Tech
>Plenary in Cannes, or in future work.
>But I do not think it should be an excuse for arbitrary inconsistencies in
>the current design. It will be much easier to add new things for handling
>multilingual texts if the current design is clear and flexible.

I guess we see different "arbitrary inconsistencies".  My experience in 
software design leads me to believe that overloading a single design 
element to serve incompatible purposes inevitable leads to consistencies 
that have to be covered over somehow.  Better to factor out the design 
elements.

>I'll get back to Pat's message to answer some of his points, too.
>
>
>>I found some discussion from the original WG [3] where one of the 
>>alternative options seems to do with different namespaces,
>
>As I proposed this solution, I can just say that it isn't what you
>may think it is. It is clearly different from what Ralph mentions
>about namespaces in his mail [5]. It just proposed that instead of
>labeling each 'XML Literal' with parseType="Literal", one would simply
>look at the document and note that some of the namespaces were used for
>RDF (properties, e.g. doublin core,...), whereas others would be
>used in XML Literals (starting with XHTML,...). There would simply
>be a global declaration saying which namespaces would be used which
>way (and different prefixes could be used with the same namespace
>to make a distinction). Given the current climate against attributes
>with global consequences in RDF/XML, it may have been a good thing
>that we didn't go that way.
>
>
>>but that was problematic (apparently) due to interactions with other XML 
>>applications.  I observe that the RDF datatype mechanism provides a 
>>similar effect while avoiding those unwanted interactions, in that is 
>>provides a very specific way to say how the literal text is to be 
>>interpreted.  I think [4] is also worthy of review, in that it describes 
>>the problems being addressed, even if the solution proposed was not 
>>really workable.
>
>Yes indeed. Please note that although when we comment, we may propose 
>solutions,
>we do not want to constrain the WG in charge on the actual solution taken.
>[In the current context, this means that we do not care whether a 'wrapper'
>solution is adopted, whether XML Literals are a 'third category', whether
>the inconsistencies pointed out in the semantics document are simply
>carefully fixed, or whether there is another solution.]
>
>
>>In summary, I think the issues of multilingual text representation should 
>>be distanced from the RDF core, not bound up with it, for the long-term 
>>advantage of both.
>
>Before discussing distancing or binding up, we would greatly appreciate if
>it were not messed up!

Well, that's why I think that "distancing" is a Good Thing... because it 
allows things to be fixed in one place without breaking something else.

>>(While I'm digging, [5] appears to be Ralph's original "parseType" 
>>proposal.  I note that this is a purely syntactic approach, as issues of 
>>what it actually represents are explicitly ducked.)
>
>And now the RDF Core WG tries to solve that issue by claiming that
>XML Literals are sequences of octets! Sorry, but I don't want to
>call such a backwards layer violation progress.

I don't particularly like that either, but in the circumstances we had a 
choice of:

(a) using the existing spec with as little adjustment as possible
(b) inventing a new spec
(c) adapting the existing spec, by adding additional interpretive text

Choice (a) seemed reasonable at the time, given the requirements we were 
considering.

Personally, I find this whole issue of dealing with canonicalization to be 
over-complicated and not particularly beneficial.  Even when two different 
XML documents represent different infosets, they may still contain the same 
information, so I don't see the benefit of C14N to be very great.  But 
others clearly think this is worthwhile, and who am I to say they're wrong?

#g
--

>>[3] http://lists.w3.org/Archives/Member/w3c-rdf-syntax-wg/1998Oct/0085.html
>>(member-only archive)
>>
>>[4] 
>>http://www.w3.org/International/Group/1998/10/NOTE-i18n-rev-rdfms-19981023
>>
>>[5] http://lists.w3.org/Archives/Member/w3c-rdf-syntax-wg/1998Oct/0064.html
>
>-------------------
>Graham Klyne
><GK@NineByNine.org>
>PGP: 0FAA 69FF C083 000B A2E9  A131 01B9 1C7A DBCA CB5E
Received on Monday, 28 July 2003 07:03:27 UTC