&, %, Literals, QuotedCData
1) Regarding the five deadly entities, I am concerned that they don't
actually appear in any productions, just in prose. We just fell over
on processing torture.xml for that reason. If they're IN, they should
REALLY be IN, i.e., in the grammar, presumably in production 64,
2) Regarding productions 15 and 16, the definitions of Literal and
QuotedCData, the more I look at them the less I like them. The
rationale behind forbiding '&' in PCData is (presumably) so that syntax errors
(i.e. ". . . Liggett&:Myers are for the high-jump . . .") will get caught.
Why not then catch <!ENTITY foo 'this %oops: will not get caught, ever!'> ?
Or this at the point of declaration, instead of every time it's used
<!ENTITY baz 'Another &#?$? typo'>
Or this <foo x="my SGML &mistake"> ?
I think we'd be MUCH better off if 15 and 16 were changed as follows:
Literal := ... [^"&%] ... [^'&%]
QuotedCData := ... [^"&<] ... [^'&<]
[Actually, I think we could lose the '<' from the exclusion with
benefit, as well]
The minimal negative consequence of this would be that you would be
REQUIRED to use a (character if the built-ins go away, or in Literals)
reference to introduce an & into an entity or attribute value.
In other words, your examples from torture.xml would look like this:
<!ENTITY s2a '&quot;Don&apos;t!&quot; he cried.' >
<!ENTITY s2b '&quot;Don#&38;apos;t!#&38;quot; he cried.' >
<!ENTITY s2bref '&s2b;' >
<!ENTITY s2c '"Don&apos;t&u-0021;" he cried.' >
<!ENTITY s3 "despite differences in physical structure." >
<!ENTITY s3ref "&s3;" >
An alternative which would keep the examples simple, but require more
prose in section 4.3, would be to still exclude '&', but allow
Reference, rather than CharRef, in Literals, and NOT TO SUBSTITUTE FOR
EntityRef in the context of parsing a Literal.
I'd prefer either of these to the status quo.