Re: I18N Issue alternative: collapsing plain and xml literals from pat hayes on 2003-09-12 (w3c-rdfcore-wg@w3.org from September 2003)

From: pat hayes <phayes@ihmc.us>
Date: Fri, 12 Sep 2003 10:03:52 -0700
To: Brian McBride <bwm@hplb.hpl.hp.com>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p06001f09bb864b3e6085@[192.168.1.2]>
<snips everywhere>
<key point marked ***>
>>
>>That is unacceptable right there. Applications may want to have 
>>plain literals that are not XML, eg the Reuters applications where 
>>literals are used to capture free-text paragraphs.
>
>I don't see a problem.   XML can represent free-text paragraphs.

Well, XML is itself encoded as text, so I fail to see the importance 
of the distinction. The point is not that XML cannot *represent* free 
text, but that there is free text around which just plain *is* not 
XML. Examples are legion: they include for example the use of the 
less-than symbol to mean less than (ubiquitous in mathematics), the 
use of '<=>" to mean 'iff', and '=>' to mean 'implies' (both commonly 
used ASCII conventions) and the use of enclosing '<  >' to indicate 
n-tuples (also ubiquitous in mathematics). It is just unacceptable to 
rule out such common and well-established uses of the very limited 
ASCII  character set.  If I want to write 2<3, I do NOT want to have 
to write 2&lt;3. That, unlike the natural rendering of 'two is less 
than three', is gibberish.  Now, if some processor wishes to encode 
my text that way, then fine; I have no problem with that, as long as 
my text gets reconstituted for human eyes at the other end. But I 
would very much like any RDF ontology which refers to my text to 
refer to the actual text, not to the gibberish. Some other processor 
might encode it differently, so the ontology should refer to the 
*actual* text; particularly if its an ontology which is concerned 
with textual matters.

>>
>>>  The denotation of a plain literal remains - it is a sequence of 
>>>unicode characters - permitting string comparison for equality 
>>>testing.
>>
>>
>>?? So this amounts to a proposal to get rid of plain literals, in 
>>effect, and to just not mention the 'XMLLiteral' type explicitly?
>
>Pretty much.  Replace plain literals with XMLLiterals, since plain 
>literals are a subset of XMLLiterals (modulo appropriate escaping).

? But they aren't. For example, "<<" and "2<3" are not legal XML. 
Most of the world's mathematical, texts are not legal XML, 
particularly the ones encoded in some variety of TeX.

>I've tried to be careful not to describe it as a proposal.  This is 
>an alternative design.  I'm not proposing it, just describing it.

Fair enough.

>>>
>>>Advantages:
>>>
>>>I think this provides everything that Martin has been asking for:
>>>
>>>   - no discontinuity between plain and xml literals
>>
>>
>>Indeed, but I do not want us to concede this point to Martin. He is 
>>WRONG about this, and we should refuse to let him (or i18n) bully 
>>us into conceding this issue.  Plain text is not the same as XML 
>>without markup; that view only makes sense in a completely 
>>XML-centric view of the entire world of lexicography and notation. 
>>Most of the world's languages and notations are not dialects of 
>>XML. SCL, for one, is not XML without markup. Virtually every piece 
>>of program code ever written is not XML without markup. The 
>>mathematical statement (I quote) "2<3" is not XML without markup, 
>>and it certainly isn't XML with markup.  And "2&lt;3" is just 
>>gibberish.
>
>"2&lt;3" is an encoding of "2<3".

It is *in XML*, right. I am tempted to say at this point, so what? It 
isn't in TeX, for example.

>  I fear I haven't got this right in what I wrote, and I fear even 
>more to get into model theory, however, what if:
>
>   "2&lt;3" denotes "2<3" in all interpretations.
>
>Now I have a problem with what does
>
>   "<br></br>" denote.   Given the above, it can't be the xsd:string, 
>underminging something I wrote earlier.  Martin previously suggested 
>it might denote something like  seq(markup("<br>"), markup("</br>")).

I seem to have missed that suggestion. Is it in the archive? I would 
like to know more about what it means exactly. If those 'markup's are 
function calls, what is the range of the function? If 'seq' means a 
sequence, then where is the i18n advantage of using markup to encode 
R2L scripts like Hebrew or Arabic?

>I was trying to avoid getting into defining a new value space for 
>literals to allow for distinguishing markup, hence the attempt to 
>keep the denotation as strings.  I think I'm beginning to see that 
>does not work.

***
Ive been struggling with this also. There is a real problem here, and 
to me it is the fatal objection to Martin's wanting text to be the 
non-markedup subset of XML. The issue is, what does XML denote? What 
*is* marked-up text, exactly? We already had this problem, as you 
know, and we resolved it by inventing 'XML values' which are roughly 
the same as XML infosets or Xpath nodesets, ie they are certainly 
highly structured objects, rather like a parse tree. In any case, 
they are certainly not pieces of text, or character strings.

I would like to pose a challenge to Martin. If he thinks that XML 
without markup is plain text, then can he please give us a (sketch of 
a) mathematically coherent account of what it is that XML text 
denotes, so defined that
(1) it works for any piece of XML,
(2) XML without markup denotes itself (ie a character string),
(3) a well-defined piece of any piece of XML denotes some kind of 
coherently describable substructure of whatever the larger piece of 
XML denotes, and
(4)  "2&lt;3" denotes "2<3".

***

>Regarding not conceding this point to Martin, I think where Martin 
>is coming from is something like the following view:
>
>  - simple sequences of characters are sufficient to represent all 
>expressions in common western languages, but not in all languages.

Isn't that what Unicode is designed to handle? That is, sequences of 
Unicode characters *are* able to represent all expressions in all 
languages, I thought, even if the normal-form orderings do get a 
little complicated for some scripts.

>  - where simple sequences of characters are allowed in formal 
>languages, such as programming languages and the like they are 
>sufficient to represent expressions in common western languages but 
>not in all languages.

The *formal* languages are international, and indeed are used within 
text of many non-western languages. They require new orthography, but 
they require that in western languages also.  Algebra was invented in 
a non-western culture which used, and uses, a non-western orthography.

>  This represents discrimination by the dominant technical community 
>against non-western languages.

"Discrimination" my arse. This comment has so much pseudo-political 
cultural baggage attached to it that I so profoundly disagree with 
that I had better not comment further; but I will strongly resist any 
attempt to allow such a manifestly political agenda pollute or warp a 
technical specification.

>  - to avoid such discrimination, wherever simple sequences of 
>characters are allowed in a formal language, xml markup should be 
>allowed so that expressions in other languages are also permitted.

Total crap. First, XML itself is encoded as a sequence of characters, 
so any problems with sequences will arise in any case. Second, 
"2&lt;3" is just as much gibberish in Japanese or Tongan as it is in 
English. Third, markup itself does not provide internationalization. 
Fourth, XML *is* encoded as a sequence of characters, so a plain 
literal can already encode XML (so what is the fuss about?); and 
fifth, to my mind most important, allowing text-with-markup 
introduces a lot more stuff as well as just being able to indicate 
things like glyph orderings. XML markup lets you incorporate datatype 
values into text, for one thing.  So allowing code-with-markup 
*completely* alters the basic model of text as, well, text. Literals 
would now have to denote things that are nothing whatever like 
strings: the nearest model of what they denote would be something 
like infosets or Xpath nodesets; and it's not even clear that this 
would be enough, actually, if someone wants to get clever about 
normal forms or matching literals.

>[...]
>
>>
>>But that is INSANE.
>
>Really.  As an implementor I can choose to represent literals using 
>a sequence of f*rts in morse code if it suits my purpose, right?

You can choose, yes. But what is crazy here is that in this picture, 
we have implementors using a format which obviously means something 
that is obviously understood by everyone, but the formal story is 
that it is not allowed to mean that: it must mean something else; but 
that something else, miraculously, means what the implementation 
meant in the first place. Or well, maybe it does, if someone can ever 
come up with an account of what it actually means. So what have we 
gained by this wierd dance around the blindingly obvious assumption 
that a string is a string is a string? If your implementation allows 
things like "2<3" then it (the implementation) is going to break when 
it tries to encode markup in any case.

>[...]
>
>>
>>It completely destroys the idea of a plain literal.
>
>Replaces it with XML literal, but xml literals can represent 
>anything a plain literal can represent (modula lang tag).

COBOL can represent any thing that a plain literal can represent. So 
why not propose using COBOL to encode plain literals?

Pat
-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 12 September 2003 13:03:46 UTC