Re: Request for charset clarification
Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU> wrote:
>What an SGML declaration should do about the unassigned code points, I
>don't know. Is this where, after all this time, I actually get to (?
>have to?) learn what NONSGML and SHUNCHAR are supposed to mean?
Don't sweat SHUNCHAR; it doesn't really make sense--definitely doesn't
make sense in the new model. I predict it'll disappear.
NONSGML applies to character numbers (the numerals that occur within
numeric character references), not characters. Non-SGML character
numbers are those that are marked UNUSED in the document character
set parameter of the SGML declaration. They may be used in numeric
character references, but what character they represent must be
communicated to the system using some system-dependent means outside
of the document.
>While I'm on this general topic, may I ask anyone still reading this if
>you can recommend a good description of the character encoding model
>which I gather WG8 has now formulated and accepted? I hear people
>saying things about the SGML character-set declarations (e.g. that they
>have nothing to do with the way characters are represented in the
>document being parsed) which I don't believe I ever read in 8879, and
>I'd like a chance to catch up. I've looked on the WG8 server and
>Charles Goldfarb's server, without seeing anything that looked
>promising. Can anyone help me out on this? Many thanks.
See my paper "The SGML Character Model", in the SGML '96 Conference
Proceedings, pp 681ff.
Basically: The point of view to take is that an SGML parser is, in
the abstract, processing a stream of abstract characters--which are
of necessity represented somehow in a computer system, but SGML does
not prescribe how they must be represented. The document character
set maps abstract characters to integers so that the parser will
know what character intended by a numeric character reference. In
the process, the characters to which numbers are assigned become
"SGML characters", the characters that are permitted in the document.
Other numbers are specified non-SGML; they can be used in numeric
character references but what character they represent is not described
in the document.
The document character set *may* describe the representation of
characters at the interface of the entity manager and parser--or
they may use some other representation (possibly hard-wired). There
are pros and cons to both implementation strategies.
The document character set *may* also describe the storage representation
used to store the data from which entities are assembled by the entity
manager. Or the entity manager may be told by the storage system how
it is representing characters. Not-to-smart entity managers may not
be able to handle more than one storage representation, or may not be
able to handle a representation other than that defined by the document
Hope this helps