RE: iso-2022 HTML -> XML

Tomohisa Yazaki wrote:
>
> For the first problem, I added cleanComment() method to
> Clean.java. This method accepts a Node object and removes all comment
> nodes which include "--". To make this patch work, I fixed a bug that
> doctype.next.prev is not set in setXHTMLDocType() and fixDocType() in
> Lexer.java. Removing the whole comment is a rough way, replacing "--"
> to something else may be a better way.

How about inserting a space between consecutive hyphens?  In other words,
change all occurrences of "--" to "- -".  This is how XSLT fixes up illegal
comments [1].

Also, it's illegal for a comment to end with "-" in XML (it's buried in the
BNF [2]).  This may be illegal in HTML, too.  It depends on how you
interpret the HTML 4.0 spec [3].  Dave probably knows.  For example,

   <!-- Illegal in XML, maybe in HTML, too --->

Note the three hyphens at the end.

[1] http://www.w3.org/TR/WD-xslt#section-Creating-Comments
[2] http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-comments
[3] http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.4

Randy

Received on Thursday, 2 September 1999 12:57:10 UTC