[Prev][Next][Index][Thread]

Re: questions on XML sgml decl's charsets



>>I consider this a serious flaw in the spec.
>
>I think it's kind of unavoidable, since SGML character handling is such a
>mess (in practice) that I doubt we could find any 16 bit solution that
>would work on all systems -- possibly even any single declaration that
>would work on all systems. The elegant idea of a declaration as a document
>specific data specification has long been replaced by the ugly practice of
>the declaration as dependent on both processor and input, in my experience.

I should note that for HTML I18N (somewhat before BCTF) I took the
view that the document character set defined the *characters* that
were available to the parser, not the character *numbers* (ie. the
character numbers were nothing but a shorthand name for the actual
character).

This works out well, because it doesn't constrain the parser
implementation by requiring that it represent things with bit
combinations of a given width.

><soapbox>Of course, this is a new way to see the basic point that dealing
>with character numbers at all is inherently fragile. We could solve this
>problem as well by structured use of SDATA (as we have made structured use
>of PI).</soapbox>

Part of the global glyph/character repositiory idea I have is
basically just that: you refer to character by *name* rather than
number. 

>I'm being clueless again, but isn't there a way out for UTF-8 encoded
>files

I think it's important not to confuse encodings and coded character
sets. You should be able to have any encoding you like as input, and
all numeric character references should still refer to the same
character. For example, if you have &#13117; (SQUARE POINTO)in your
UTF8 document, and you do a blinf encoding conversion to shift-jis,
then that numeric character reference should still refer to SQUARE
POINTO.


References: