[Prev][Next][Index][Thread]

questions on XML sgml decl's charsets



Looking at the SGML decl in the appendix of the XML draft, I have
a few questions.  (I'll admit up front that if there is anything
that boggles my mind more than BOSs, it's character sets.)

The decl therein seems to define a document baseset from 0 to 255 
and a syntax baseset from 646 (isn't that from 0 to 255?).

Anyway, even though I cannot figure out document from syntax basesets
from charsets from descsets, it doesn't look like that SGML decl
allows characters above 255, so where does unicode come it?  Am
I getting confused between encodings and character sets again or what?

In the syntax, I see we shun no characters whereas most sgml decls
shun some set of control characters at least.  What's the rationale
here for no shun characters?

The syntax baseset refers to some 1983 version of 646.  I note that
the recent "I18n-ization of HTML" RFC 2070 talks changing the HTML 2.0
declaration's syntax character set declaration:

   Another change was made from the HTML 2.0 SGML declaration, in the
   belief that the latter did not express its authors' true intent. The
   syntax character set declaration was changed from ISO 646.IRV:1983 to
   the newer ISO 646.IRV:1991, the latter, but not the former, being
   identical with US-ASCII.

That document also shows a baseset/descset declaration of:

     BASESET "ISO Registration Number 177//CHARSET
              ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
              //ESC 2/5 2/15 4/6"
     DESCSET  0   9     UNUSED
              9   2     9
              11  2     UNUSED
              13  1     13
              14  18    UNUSED
              32  95    32
              127 1     UNUSED
              128 32    UNUSED
              160 2147483486 160

which is UCS-4.  I gather from other sources that a UCS-2 declaration
might look something like:

  BASESET "ISO Registration Number 176//CHARSET
    ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"

  DESCSET
	0	9   UNUSED
	9	2   9
       11	2   UNUSED
       13	1   13
       14	18  UNUSED
       32	95  32
      127	1   UNUSED
      128	32  32
      160	65376 160

but that's not what I see in the XML draft.

And how do all this square with the ERCS stuff WG8 recently approved?


Follow-Ups: