- From: David G. Durand <dgd@cs.bu.edu>
- Date: Sat, 19 Oct 1996 17:55:12 -0400
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
At 2:56 PM 10/19/96, Michael Sperberg-McQueen wrote: >On Sat, 19 Oct 1996 15:15:27 -0400 David G. Durand said: >>>Instead of a declaration like >>> >>> <!ENTITY auml SDATA "[auml ]"> >> >>How do I declare a character like "TENGWAR VOWEL LETTER A" or "ME >>GLYPH GARDINER 201" (Middle Egyptian). It's just 宊 or >>something. >> >>My gut reaction to that is #$*#&$#?! > >I'm not sure what you're arguing here. Would > > <!ENTITY a.teng SDATA "[a.teng ]"> > <!ENTITY a.teng SDATA "[TENGWAR VOWEL LETTER A]"> > <!ENTITY a.teng SDATA "[U+5B8A]"> > >be (a) equally likely to be understood by all XML processors and (b) >materially less likely to elicit the reaction &expletive;? Yes, because I can be told, by the application, that "An unknown character was found in the data stream, here's its name." Otherwise, I get the information from the application: "An unknown character was found, here's its number (in an officially undefined section of unicode)". In one case, the name may help me to understand the reference, or at least give me a starting point for research, in the other, I have no further recourse. >If the ERB decides question C.5 the way I hope we will, then we can >at least say > > <!ENTITY auml "&u00E4;"><!-- auml = a umlaut (dec 228) --> > <!ENTITY a.teng "&u5B8A;"><!-- a.teng = Tengwar vowel A > (decimal 23434) --> Something like this is essential, but it addresses a different problem. The problem with the above is that hiding critical information in comments is not a good prescription for portability. >which I, for one, prefer, since I don't have to be logged on to a >system with a shell-level hex-to-decimal calculator, and it's >easier to proofread the declarations against Unicode documentation. > >(But to be honest I suspect that a.teng belongs at 57344 (U+E000) or >thereabouts in the Private Use Area.) (Sure, I just made the number up by hitting my numeric keypad) That sure doesn't help. U+E000 is almost, but not quite, utterly uninformative. Tengwar vowel mark A, may help a human, ig not an application. > >Come on, now. The question of including some entities automatically >is on the table, and has been addressed by several postings, most in >favor, as far as I can remember. The proposals I've heard so far are >(a) just lt, gt, and amp (possibly augmented by declarations for ' and >"), (b) the above plus ISOLat1, and (c) everything in HTML (which >version?). If you are hoping for automatic inclusion of Tengwar and >other similar scripts, I predict disappointment. Or even Middle >Egyptian, for that matter. If you think we should decide to include >them automatically without worrying about the form of their declaration, >then I respectfully but vigorously disagree. No, my point was that the set of possible character references is an open set, and that numbers are never preferable to names, absent a well-defined place to look up the number. (Private use does not solve that problem, since all I can do is look it up and be told that it's officially undefined.) Hey, I can point you to a unicode definition of where TENGWAR LETTER A might appear in the private use set. But even if I knew nothing of unicode that description is a lot more useful than any arbitrary code value. If we applied the proposed SDATA removal policy to programming languages we could replace annoying variable names with convenient numbers, and improve the bandwidth of programmers. If we have unique labels for things, why bother to make them meaningful? >On the whole, I don't see how retaining or excising SDATA has any >effect on the ease or difficulty of declaring things like TENGWAR >VOWEL A and ME GLYPH GARDINER 201 in useful ways. And so I don't >see how these examples bear upon the decision to do without SDATA >in version 1.0 of XML. A comment that the parser strips that tells me the name of a character is measurably less useful that an application available string containing the same information. And a lot more For instance, I might put URLs of .gif renditions of characters in my SDATA entities, instead of character names. I don't think that we should standardize the meaning of SDATA strings, but that we need the hook to attach meaningful data to undefined refrences. For instance, many of the less-common Asian characters might have meaningful string that could document the ideogram, even for someone who only has a JIS font. > extended metaphor about the promised land deleted... This is all very well but we do have to get to the promised land. I don't want to end up like the Red 'Lectroids in Buckaroo Banzai, whose triumphal return home was terminated with the words: "We are not in the eighth dimension. We are over New Jersey." If we could at least make it to Seattle, I'd be happier. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ http://www.dynamicdiagrams.com/services_map_main.html
Received on Saturday, 19 October 1996 17:50:41 UTC