Re: Comments on 31 March spec

On Thu, 3 Apr 1997 08:25:59 -0500 Martin Bryan said:
>In 1.5, why is production [2] the only one to use the two character
>forms of character references (e.g. #x0d) rather than the 4 character
>form (e.g.  #x00ad) used elsewhere?

Because production 2 is also the only production to need hex character
references of *more* than four, and it seemed (to me, at least)
pointless to normalize 0d to 000d when we could not normalize 7FFFFFFF
to a four-digit value.  By the definition of the notation, &#0d and
&#000d (and &#d!) are all synonymous, but if everyone who sees this
assumes we must be trying to convey something of occult significance,
we might as well normalize to 000d etc.

>It should be made clear in 2.7 that there is no way in which you can
>enter ]]> in a CDATA section as ]]> will only be recognized outside
>of such sections.

Isn't this clear enough already?  It does follow directly from the
explicit statement that only ']]>' is recognized.  We cannot hope to
list explicitly every consequence of every rule; there needs to be some
reliance on having a reader capable of seeing that if only CDEnd is
recognized, then Reference is not recognized (particularly since this
inference is confirmed by the parenthetical remark about lt and amp).

>In 2.8 the second paragraph ends with a hanging sentence, viz:

Not in my copy.  Are you sure you're not falling victim to the
Netscape bug that displays some lines as white space if you scroll
by small increments?

>In 3.3 the sentence reading:
>> At user option, an XML processor may issue a warning if
>>attributes are declared for an entity type not itself declared, but
>this is not an error.  should have "entity type" changed to "element

Thank you.  My typo.

>For 3.4, under what circumstances is SkipLit valid if ignored marked
>sections can only contain complete markup declarations?

Where does it say that ignored marked sections can contain only complete
markup declarations?  8879 doesn't say that, and neither do we -- if we
did, that would suggest that a validating parser would have to parse
ignored sections completely, which I think we don't want to do.  The
rule for ignored sections has, I concede, become rather involved as a
result of trying to ensure that all conditional sections -- and the DTD
-- begin and end at the same locations, regardless of the values of
their controlling parameter entities.  I think the current rules achieve
that end; I don't think any simpler rules do.  Thanks to James for
working it out (and for providing the example that illustrated the
very real danger).

>For 4.3.3 shouldn't a statement be added that EncodingPI must be encoded
>in UTF-8 0r be proceeded by #xFEFF if encoded in UCS-2? (Allowing it to
>be encoded in any other way would give interoperability problems.)

I'm not sure I follow.  Para 2 of 4.3.3 begins "Entities encoded in
UCS-2 must begin with the Byte Order Mark ...", so it seems to me that
what you are suggesting for UCS-2 is already required.

The EncodingPI itself is *not* required to be encoded in UTF-8; that
suggestion was made last fall, and failed to generate consensus.
The EncodingPI is written in the same encoding as the rest of the file,
because one of the main advantages of in-file headers of this sort
is that they can be maintained directly by users, without reliance on
anything more elaborate than an editor that understands the encoding
in use in the file.  Anyone who has struggled for years, as I have,
with system administrators too ignorant to understand character
set issues and too busy to learn, and too wise to let me 'fix' the
system routines myself, will appreciate the importance of letting the
data be labeled by users who know what it is, rather than by system
routines that don't.

Thanks for your corrections.

-C. M. Sperberg-McQueen

Received on Thursday, 3 April 1997 10:10:39 UTC