Re: I18N Issue alternative: collapsing plain and xml literals from pat hayes on 2003-09-08 (w3c-rdfcore-wg@w3.org from September 2003)

From: pat hayes <phayes@ihmc.us>
Date: Mon, 8 Sep 2003 12:45:22 -0700
To: Brian McBride <bwm@hplb.hpl.hp.com>
Cc: Martin Duerst <duerst@w3.org>, "Richard Ishida" <ishida@w3.org>, i18n <w3c-i18n-ig@w3.org>, w3c-rdfcore-wg@w3.org
Message-Id: <p06001f03bb828c5fa46d@[192.168.1.2]>
(sorry, sent twice: forgot some of the CCs the first time).

>After discussing this informally over lunch, Danbri asked me to send 
>it to the list to make our consideration of it explicit.
>
>This is an alternative design for literals.  The  idea is to drop 
>the rdf:XMLLiteral datatype and allow plain literals to contain 
>markup.

Allow, or require? That is, if they happen to contain the symbol '<', 
is that *required* to be considered to be XML markup?

>  Two test cases illustrate:
>
><rdf:Description>
>   <eg:prop>foo &lt;br /&gt; bar</eg:prop>
></rdf:Description
>
>parses to:
>
>_:a eg:prop "foo &lt;br /&gt; bar" .

I take it that this is intended to illustrate that without the 
parseType, the literal string is rendered exactly as it is (?)
How about
<rdf:Description>
   <eg:prop><<</eg:prop>
</rdf:Description>

does that parse to

:_a eg:prop "<<" .

?


>
><rdf:Description>
>   <eg:prop rdf:parseType="Literal"><br /></eg:prop>
></rdf:Description>
>
>parses to:
>
>_:a eg:prop "foo <br></br> bar" .

?? Eh? Where do the foo and bar come from?

>
>The definition of a plain literal changes.  The lexical space of 
>plain literal becomes the lexical space of rdf:XMLLiteral, i.e. is 
>restricted to (the unicode representation of) canonicalised well 
>formed balanced xml markup.

That is unacceptable right there. Applications may want to have plain 
literals that are not XML, eg the Reuters applications where literals 
are used to capture free-text paragraphs.

>  The denotation of a plain literal remains - it is a sequence of 
>unicode characters - permitting string comparison for equality 
>testing.

?? So this amounts to a proposal to get rid of plain literals, in 
effect, and to just not mention the 'XMLLiteral' type explicitly?

>
>Advantages:
>
>I think this provides everything that Martin has been asking for:
>
>   - no discontinuity between plain and xml literals

Indeed, but I do not want us to concede this point to Martin. He is 
WRONG about this, and we should refuse to let him (or i18n) bully us 
into conceding this issue.  Plain text is not the same as XML without 
markup; that view only makes sense in a completely XML-centric view 
of the entire world of lexicography and notation. Most of the world's 
languages and notations are not dialects of XML. SCL, for one, is not 
XML without markup. Virtually every piece of program code ever 
written is not XML without markup. The mathematical statement (I 
quote) "2<3" is not XML without markup, and it certainly isn't XML 
with markup.  And "2&lt;3" is just gibberish.

>   - lang on mixed content
>   - no use of datatypes
>
>Disadvantages:
>
>- a bigger change than alternatives
>- builds XML into the core of the RDF model

Which is a fatal objection. If RDF does not outlive XML then it will 
have been a failure.

>- breaks current implementations (but see below)
>
>Ameliorating the Disadvantages - implementation strategy
>
>The above design says that e.g. "<" is not in the lexical space of 
>plain literals

Wait a minute. If plain literals do not have a datatype, then this 
'lexical space' terminology is meaningless. Are they typed or not?

>, and many (all?) current implementations will store
>"<" in their representation of a graph.  The point to note is that 
>implementations are free to represent literals any way they please. 
>Thus "<" is just the way this implementation represents the literal 
>"&lt;".

But that is INSANE. Here I am wanting to write an RDF ontology about 
mathematical expressions, say, and I want to refer to pieces of 
mathematical text like "2<3" (two is less than three). But now I 
can't do that because this is short for "2&lt;3", which is completely 
meaningless.

The basic point is that the RDF machinery is not intended to be 
restricted to only REFER to XML text. It is required to be encodable 
in XML, but it is that already. This proposal makes it impossible to 
refer to non-XML text.

>The implementation does need to distinguish between markup and plain text.

No, the SEMANTICS needs to distinguish them.

>To do this, it adds a single bit to literals to indicate whether 
>they are stored in escaped or unescaped form.  The above example was 
>in unescaped form, which cannot represent markup.  To represent 
>markup, the literal must be be stored in escaped form.
>
>Literal comparison becomes more complex - literals stored in 
>unescaped form should first be escaped and then canonicalized. 
>Various optimization strategies can be employed here.
>
>By this strategy, It may be possible to argue that this approach 
>does not break current implementations of plain literals.  It simply 
>makes clearer what xml literals are.

It completely destroys the idea of a plain literal.

I think that my 'wet fish' proposal is better than this, if we have 
to accede to Martin.  I would prefer not to accede, since Martin has 
not responded to the technical objections adequately, and has not 
given actual technical arguments for his requests. They amount to 
statements of opinion about the proper role of XML in semiotics, 
opinions with which one may have legitimate disagreement.

Pat
-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Monday, 8 September 2003 15:45:17 UTC