Re: Request for charset clarification from Michael Sperberg-McQueen on 1996-11-28 (w3c-sgml-wg@w3.org from November 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Wed, 27 Nov 96 20:00:39 CST
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199611280233.VAA19611@www10.w3.org>
On Wed, 27 Nov 1996 19:34:42 -0500 Terry Allen said:
>What does [2] Character include that [3] BaseChar excludes?
>Or, what is a short description in plain language of BaseChar?
>Am I right in imagining that unassigned code points are
>excluded from BaseChar?  What is a correct SGML declaration
>for XML?

You stress that your query is purely for information and that private
replies are fine, but since I can't imagine anyone not being fascinated
by character-set issues, I'll reply to the list ...

BaseChar is, very very roughly, the set of letters without diacritics,
or, more precisely -- but not *wholly* precisely since I don't have the
Unicode manual at home and have not memorized its terminology of
character classification --, the set of alphabetics minus the set of
combining forms.  (Since the set subtraction was done by hand rather
than by dbms manipulation of the Unicode data file, it needs checking.
But the intent was Alpha \ Combining.)  So 3 BaseChar excludes several
things included in 2 Character:  (a) punctuation, (b) numerals both
decimal and others -- it's startling to realize Unicode has two sets of
'characters' devoted to Roman numerals in upper and lower case (looks
like a serious case of legacy compatibility, but what legacy system has
characters for i, ii, iii, iv, v, ...?), (c) ideographs, (d) combining
characters, (e) various special-purpose things one can read about in the
Unicode book, (f) code points unassigned in the current version of ISO
10646/Unicode, and (g) other stuff I have surely forgotten.

A correct SGML declaration for XML ought to use the extended naming
rules to specify name characters and name-start characters; it should
also assign case properly to those characters which have case.
(Bleagh.)  There may be a canned declaration with the desired effect
already in existence; I don't know.  The one in the version of 14
November is, as has been pointed out variously, in error, since it
declares ASCII, not 10646, as the document character set.  (It got
included because it was the one which actually worked on Jon's machine,
but that was because he has an ASCII machine rather than a Unicode
machine.)

What an SGML declaration should do about the unassigned code points, I
don't know.  Is this where, after all this time, I actually get to (?
have to?) learn what NONSGML and SHUNCHAR are supposed to mean?

If anyone with an interest in Unicode is willing to confirm that my
translation of the Unicode recommendations for identifier formation into
SGML terms actually makes sense, I'd be very glad.  In particular, I
changed the way combining characters were used, in a way which may be
harmless, but which won't be harmless if combining characters really
have to be legal in identifiers after digits and other characters not in
BaseChar.

While I'm on this general topic, may I ask anyone still reading this if
you can recommend a good description of the character encoding model
which I gather WG8 has now formulated and accepted?  I hear people
saying things about the SGML character-set declarations (e.g. that they
have nothing to do with the way characters are represented in the
document being parsed) which I don't believe I ever read in 8879, and
I'd like a chance to catch up.  I've looked on the WG8 server and
Charles Goldfarb's server, without seeing anything that looked
promising.  Can anyone help me out on this?  Many thanks.

-C. M. Sperberg-McQueen
Received on Wednesday, 27 November 1996 21:33:31 UTC