- From: Steven J. DeRose <sjd@ebt.com>
- Date: Tue, 14 Jan 1997 10:03:23 -0500
- To: Joe English <jenglish@crl.com>, w3c-sgml-wg@www10.w3.org
>I strongly suspect that most SGML declarations include >"SHUNCHAR CONTROLS etc." only because few people ever >write an SGML declaration from scratch; most just copy >an existing one and modify it, leaving the parts they don't >understand untouched. Since few people understand SHUNCHAR >(I am not among them, BTW), that clause has survived largely >unmutated from the ancestral declaration in ISO 8879. > >Let me ask the converse question: is there a rationale >*for* specifiying shunned characters? So far as I can determine, SGML provides no semantics for shunned characters. All it says about them (other than defining the syntax for specifying them in the SGML declaration) is: ---------- 4.297 shunned character (number): A character number, identified by a concrete syntax, that should be avoided in documents employing the syntax because some systems might erroneously treat it as a control character. [note the 'should' here -- what happens if you *do* find one?] ---------- 13.1.2 ... A shunned character must be identified as a non-SGML character, unless it is a significant SGML character. NOTES: a) For example, in figure 8, characters numbered 9, 10, and 13, which are shunned characters, are nevertheless not assigned as non-SGML characters because they are function characters. [there is an interesting interaction here since 10 and 13 are assigned to RS and RE, even though RS and RE are not characters in the system character set ata ll (they are internal constructs inserted by the entity manager, which may or may not have any relation to 10 and 13 as LF and CR in files. Only on DOS is the corresponence close, and there it is not exact).] b) If the document uses two concrete syntaxes, the shunned characters of both are subject to this requirement. ---------- 13.4.2 Shunned Character Number Identification ... NONE means there are no shunned character numbers. CONTROLS means that any character number that the system character set considers to be the coded representation of a control character, and not a graphic character, is a shunned character. Each specified character number is identified as a shunned character number. [this would be difficult to fulfill for any character set like the DIS 10646, in which any 4-byte value in which *any* single byte fell into the ranges 0-31 or 128-159 was considered a control character (a large number of hundreds of millions of distinct characters, if i remember correctly).] NOTE - Character numbers in this parameter need not (and should not) be changed when a document is translated to another character set. [this mystifies me. since the characters 'should' not occur, this clause 'should' be moot; but if they *do* occur, a translation would very likely need to change them (it might even put graphical characters where the prior character set had controls, in which case it *must* change them] --------- That's all I could find. Since there is no statement of what an SGML system should do if it does find such characters,
Received on Tuesday, 14 January 1997 10:06:02 UTC