RE: Tidy (oct22) failed to parse comments

On Sat, 20 Nov 1999, Dave Raggett wrote:

> SGML/XML says:
>
>   good     <!---->
>   bad      <!----->
>   bad      <!------>
>   bad      <!------->
>   good     <!-------->
>
> weird isn't it!
>
> I will adjust the parser to trim trailing hyphens to the
> nearest legal number.

I believe this would be insufficient for XML. XML's comment syntax is a
subset of SGML/HTML's. Production 15 in XML 1.0 says:

   Comment ::=  '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

and the text says:

   For compatibility, the string "--" (double-hyphen) must not occur
   within comments.

This means that the characters between the opening <!-- and the closing -->
cannot contain two consecutive hyphens. Also they cannot end in a hyphen (as
per the BNF even though the text fails to mention it).

So for XML (as opposed to SGML/HTML):

   <!---->      good (empty comment)
   <!----->     bad (trailing hyphen)
   <!------>    bad (consecutive hyphens, trailing hyphen)
   <!------->   bad (consecutive hyphens, trailing hyphen)
   <!-------->  bad (consecutive hyphens, trailing hyphen)
   <!--- -->    good
   <!-- - - --> good

For XML, Tidy could fix consecutive hyphens by examining the characters
between the <!-- and the --> and replacing the first, third, etc. hyphen
with a space and also replacing any trailing hyphen with a space. This
should preserve much of the visual effect intended by people who use
consecutive hyphens as dividers.

If you wanted to avoid a special case for XML, perhaps Tidy could make all
comments conform to XML's stricter syntax. (The extra latitude allowed by
SGML/HTML is small enough and obscure enough that I wonder if anyone would
miss it.)

Randy

Received on Saturday, 20 November 1999 19:12:15 UTC