- From: Dave Peterson <davep@acm.org>
- Date: Wed, 27 Nov 1996 22:25:00 -0500
- To: w3c-sgml-wg@w3.org
Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU> wrote: >What an SGML declaration should do about the unassigned code points, I >don't know. Is this where, after all this time, I actually get to (? >have to?) learn what NONSGML and SHUNCHAR are supposed to mean? Don't sweat SHUNCHAR; it doesn't really make sense--definitely doesn't make sense in the new model. I predict it'll disappear. NONSGML applies to character numbers (the numerals that occur within numeric character references), not characters. Non-SGML character numbers are those that are marked UNUSED in the document character set parameter of the SGML declaration. They may be used in numeric character references, but what character they represent must be communicated to the system using some system-dependent means outside of the document. >While I'm on this general topic, may I ask anyone still reading this if >you can recommend a good description of the character encoding model >which I gather WG8 has now formulated and accepted? I hear people >saying things about the SGML character-set declarations (e.g. that they >have nothing to do with the way characters are represented in the >document being parsed) which I don't believe I ever read in 8879, and >I'd like a chance to catch up. I've looked on the WG8 server and >Charles Goldfarb's server, without seeing anything that looked >promising. Can anyone help me out on this? Many thanks. See my paper "The SGML Character Model", in the SGML '96 Conference Proceedings, pp 681ff. Basically: The point of view to take is that an SGML parser is, in the abstract, processing a stream of abstract characters--which are of necessity represented somehow in a computer system, but SGML does not prescribe how they must be represented. The document character set maps abstract characters to integers so that the parser will know what character intended by a numeric character reference. In the process, the characters to which numbers are assigned become "SGML characters", the characters that are permitted in the document. Other numbers are specified non-SGML; they can be used in numeric character references but what character they represent is not described in the document. The document character set *may* describe the representation of characters at the interface of the entity manager and parser--or they may use some other representation (possibly hard-wired). There are pros and cons to both implementation strategies. The document character set *may* also describe the storage representation used to store the data from which entities are assembled by the entity manager. Or the entity manager may be told by the storage system how it is representing characters. Not-to-smart entity managers may not be able to handle more than one storage representation, or may not be able to handle a representation other than that defined by the document character set. Hope this helps Dave Peterson SGMLWorks! davep@acm.org
Received on Wednesday, 27 November 1996 22:26:35 UTC