- From: pat hayes <phayes@ihmc.us>
- Date: Fri, 12 Sep 2003 10:03:52 -0700
- To: Brian McBride <bwm@hplb.hpl.hp.com>
- Cc: w3c-rdfcore-wg@w3.org
<snips everywhere> <key point marked ***> >> >>That is unacceptable right there. Applications may want to have >>plain literals that are not XML, eg the Reuters applications where >>literals are used to capture free-text paragraphs. > >I don't see a problem. XML can represent free-text paragraphs. Well, XML is itself encoded as text, so I fail to see the importance of the distinction. The point is not that XML cannot *represent* free text, but that there is free text around which just plain *is* not XML. Examples are legion: they include for example the use of the less-than symbol to mean less than (ubiquitous in mathematics), the use of '<=>" to mean 'iff', and '=>' to mean 'implies' (both commonly used ASCII conventions) and the use of enclosing '< >' to indicate n-tuples (also ubiquitous in mathematics). It is just unacceptable to rule out such common and well-established uses of the very limited ASCII character set. If I want to write 2<3, I do NOT want to have to write 2<3. That, unlike the natural rendering of 'two is less than three', is gibberish. Now, if some processor wishes to encode my text that way, then fine; I have no problem with that, as long as my text gets reconstituted for human eyes at the other end. But I would very much like any RDF ontology which refers to my text to refer to the actual text, not to the gibberish. Some other processor might encode it differently, so the ontology should refer to the *actual* text; particularly if its an ontology which is concerned with textual matters. >> >>> The denotation of a plain literal remains - it is a sequence of >>>unicode characters - permitting string comparison for equality >>>testing. >> >> >>?? So this amounts to a proposal to get rid of plain literals, in >>effect, and to just not mention the 'XMLLiteral' type explicitly? > >Pretty much. Replace plain literals with XMLLiterals, since plain >literals are a subset of XMLLiterals (modulo appropriate escaping). ? But they aren't. For example, "<<" and "2<3" are not legal XML. Most of the world's mathematical, texts are not legal XML, particularly the ones encoded in some variety of TeX. >I've tried to be careful not to describe it as a proposal. This is >an alternative design. I'm not proposing it, just describing it. Fair enough. >>> >>>Advantages: >>> >>>I think this provides everything that Martin has been asking for: >>> >>> - no discontinuity between plain and xml literals >> >> >>Indeed, but I do not want us to concede this point to Martin. He is >>WRONG about this, and we should refuse to let him (or i18n) bully >>us into conceding this issue. Plain text is not the same as XML >>without markup; that view only makes sense in a completely >>XML-centric view of the entire world of lexicography and notation. >>Most of the world's languages and notations are not dialects of >>XML. SCL, for one, is not XML without markup. Virtually every piece >>of program code ever written is not XML without markup. The >>mathematical statement (I quote) "2<3" is not XML without markup, >>and it certainly isn't XML with markup. And "2<3" is just >>gibberish. > >"2<3" is an encoding of "2<3". It is *in XML*, right. I am tempted to say at this point, so what? It isn't in TeX, for example. > I fear I haven't got this right in what I wrote, and I fear even >more to get into model theory, however, what if: > > "2<3" denotes "2<3" in all interpretations. > >Now I have a problem with what does > > "<br></br>" denote. Given the above, it can't be the xsd:string, >underminging something I wrote earlier. Martin previously suggested >it might denote something like seq(markup("<br>"), markup("</br>")). I seem to have missed that suggestion. Is it in the archive? I would like to know more about what it means exactly. If those 'markup's are function calls, what is the range of the function? If 'seq' means a sequence, then where is the i18n advantage of using markup to encode R2L scripts like Hebrew or Arabic? >I was trying to avoid getting into defining a new value space for >literals to allow for distinguishing markup, hence the attempt to >keep the denotation as strings. I think I'm beginning to see that >does not work. *** Ive been struggling with this also. There is a real problem here, and to me it is the fatal objection to Martin's wanting text to be the non-markedup subset of XML. The issue is, what does XML denote? What *is* marked-up text, exactly? We already had this problem, as you know, and we resolved it by inventing 'XML values' which are roughly the same as XML infosets or Xpath nodesets, ie they are certainly highly structured objects, rather like a parse tree. In any case, they are certainly not pieces of text, or character strings. I would like to pose a challenge to Martin. If he thinks that XML without markup is plain text, then can he please give us a (sketch of a) mathematically coherent account of what it is that XML text denotes, so defined that (1) it works for any piece of XML, (2) XML without markup denotes itself (ie a character string), (3) a well-defined piece of any piece of XML denotes some kind of coherently describable substructure of whatever the larger piece of XML denotes, and (4) "2<3" denotes "2<3". *** >Regarding not conceding this point to Martin, I think where Martin >is coming from is something like the following view: > > - simple sequences of characters are sufficient to represent all >expressions in common western languages, but not in all languages. Isn't that what Unicode is designed to handle? That is, sequences of Unicode characters *are* able to represent all expressions in all languages, I thought, even if the normal-form orderings do get a little complicated for some scripts. > - where simple sequences of characters are allowed in formal >languages, such as programming languages and the like they are >sufficient to represent expressions in common western languages but >not in all languages. The *formal* languages are international, and indeed are used within text of many non-western languages. They require new orthography, but they require that in western languages also. Algebra was invented in a non-western culture which used, and uses, a non-western orthography. > This represents discrimination by the dominant technical community >against non-western languages. "Discrimination" my arse. This comment has so much pseudo-political cultural baggage attached to it that I so profoundly disagree with that I had better not comment further; but I will strongly resist any attempt to allow such a manifestly political agenda pollute or warp a technical specification. > - to avoid such discrimination, wherever simple sequences of >characters are allowed in a formal language, xml markup should be >allowed so that expressions in other languages are also permitted. Total crap. First, XML itself is encoded as a sequence of characters, so any problems with sequences will arise in any case. Second, "2<3" is just as much gibberish in Japanese or Tongan as it is in English. Third, markup itself does not provide internationalization. Fourth, XML *is* encoded as a sequence of characters, so a plain literal can already encode XML (so what is the fuss about?); and fifth, to my mind most important, allowing text-with-markup introduces a lot more stuff as well as just being able to indicate things like glyph orderings. XML markup lets you incorporate datatype values into text, for one thing. So allowing code-with-markup *completely* alters the basic model of text as, well, text. Literals would now have to denote things that are nothing whatever like strings: the nearest model of what they denote would be something like infosets or Xpath nodesets; and it's not even clear that this would be enough, actually, if someone wants to get clever about normal forms or matching literals. >[...] > >> >>But that is INSANE. > >Really. As an implementor I can choose to represent literals using >a sequence of f*rts in morse code if it suits my purpose, right? You can choose, yes. But what is crazy here is that in this picture, we have implementors using a format which obviously means something that is obviously understood by everyone, but the formal story is that it is not allowed to mean that: it must mean something else; but that something else, miraculously, means what the implementation meant in the first place. Or well, maybe it does, if someone can ever come up with an account of what it actually means. So what have we gained by this wierd dance around the blindingly obvious assumption that a string is a string is a string? If your implementation allows things like "2<3" then it (the implementation) is going to break when it tries to encode markup in any case. >[...] > >> >>It completely destroys the idea of a plain literal. > >Replaces it with XML literal, but xml literals can represent >anything a plain literal can represent (modula lang tag). COBOL can represent any thing that a plain literal can represent. So why not propose using COBOL to encode plain literals? Pat -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32501 (850)291 0667 cell phayes@ihmc.us http://www.ihmc.us/users/phayes
Received on Friday, 12 September 2003 13:03:46 UTC