[Prev][Next][Index][Thread]

Re: questions on XML sgml decl's charsets




>I strongly suspect that most SGML declarations include
>"SHUNCHAR CONTROLS etc." only because few people ever
>write an SGML declaration from scratch; most just copy
>an existing one and modify it, leaving the parts they don't
>understand untouched.  Since few people understand SHUNCHAR
>(I am not among them, BTW), that clause has survived largely
>unmutated from the ancestral declaration in ISO 8879.
>
>Let me ask the converse question: is there a rationale
>*for* specifiying shunned characters?

So far as I can determine, SGML provides no semantics for shunned
characters. All it says about them (other than defining the syntax for
specifying them in the SGML declaration) is:

----------

4.297 shunned character (number): A character number, identified by a
concrete syntax, that should be avoided in documents employing the syntax
because some systems might erroneously treat it as a control character.


[note the 'should' here -- what happens if you *do* find one?]

----------

13.1.2 ...

A shunned character must be identified as a non-SGML character, unless it is
a significant SGML character.
  
NOTES:

a) For example, in figure 8, characters numbered 9, 10, and 13, which are
shunned characters, are nevertheless not assigned as non-SGML characters
because they are function characters.

[there is an interesting interaction here since 10 and 13 are assigned to RS
and RE, even though RS and RE are not characters in the system character set
ata ll 
(they are internal constructs inserted by the entity manager, which may or
may not have any relation to 10 and 13 as LF and CR in files. Only on DOS is
the corresponence close, and there it is not exact).]

b) If the document uses two concrete syntaxes, the shunned characters of
both are subject to this requirement.

----------

13.4.2 Shunned Character Number Identification
...  
NONE means there are no shunned character numbers.

CONTROLS means that any character number that the system character set
considers to be the coded representation of a control character, and not a
graphic character, is a shunned character. Each specified character number
is identified as a shunned character number.

[this would be difficult to fulfill for any character set like the DIS
10646, in which any 4-byte value in which *any* single byte fell into the
ranges 0-31 or 128-159 was considered a control character (a large number of
hundreds of millions of distinct characters, if i remember correctly).]

NOTE - Character numbers in this parameter need not (and should not) be
changed when a document is translated to another character set.

[this mystifies me. since the characters 'should' not occur, this clause
'should' be moot; but if they *do* occur, a translation would very likely
need to change them (it might even put graphical characters where the prior
character set had controls, in which case it *must* change them]

---------

That's all I could find. Since there is no statement of what an SGML system
should do if it does find such characters,