- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Wed, 27 Nov 96 20:00:39 CST
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Wed, 27 Nov 1996 19:34:42 -0500 Terry Allen said: >What does [2] Character include that [3] BaseChar excludes? >Or, what is a short description in plain language of BaseChar? >Am I right in imagining that unassigned code points are >excluded from BaseChar? What is a correct SGML declaration >for XML? You stress that your query is purely for information and that private replies are fine, but since I can't imagine anyone not being fascinated by character-set issues, I'll reply to the list ... BaseChar is, very very roughly, the set of letters without diacritics, or, more precisely -- but not *wholly* precisely since I don't have the Unicode manual at home and have not memorized its terminology of character classification --, the set of alphabetics minus the set of combining forms. (Since the set subtraction was done by hand rather than by dbms manipulation of the Unicode data file, it needs checking. But the intent was Alpha \ Combining.) So 3 BaseChar excludes several things included in 2 Character: (a) punctuation, (b) numerals both decimal and others -- it's startling to realize Unicode has two sets of 'characters' devoted to Roman numerals in upper and lower case (looks like a serious case of legacy compatibility, but what legacy system has characters for i, ii, iii, iv, v, ...?), (c) ideographs, (d) combining characters, (e) various special-purpose things one can read about in the Unicode book, (f) code points unassigned in the current version of ISO 10646/Unicode, and (g) other stuff I have surely forgotten. A correct SGML declaration for XML ought to use the extended naming rules to specify name characters and name-start characters; it should also assign case properly to those characters which have case. (Bleagh.) There may be a canned declaration with the desired effect already in existence; I don't know. The one in the version of 14 November is, as has been pointed out variously, in error, since it declares ASCII, not 10646, as the document character set. (It got included because it was the one which actually worked on Jon's machine, but that was because he has an ASCII machine rather than a Unicode machine.) What an SGML declaration should do about the unassigned code points, I don't know. Is this where, after all this time, I actually get to (? have to?) learn what NONSGML and SHUNCHAR are supposed to mean? If anyone with an interest in Unicode is willing to confirm that my translation of the Unicode recommendations for identifier formation into SGML terms actually makes sense, I'd be very glad. In particular, I changed the way combining characters were used, in a way which may be harmless, but which won't be harmless if combining characters really have to be legal in identifiers after digits and other characters not in BaseChar. While I'm on this general topic, may I ask anyone still reading this if you can recommend a good description of the character encoding model which I gather WG8 has now formulated and accepted? I hear people saying things about the SGML character-set declarations (e.g. that they have nothing to do with the way characters are represented in the document being parsed) which I don't believe I ever read in 8879, and I'd like a chance to catch up. I've looked on the WG8 server and Charles Goldfarb's server, without seeing anything that looked promising. Can anyone help me out on this? Many thanks. -C. M. Sperberg-McQueen
Received on Wednesday, 27 November 1996 21:33:31 UTC