Re: special characters in comments getting 'mangled' from Bjoern Hoehrmann on 2002-10-11 (html-tidy@w3.org from October to December 2002)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Fri, 11 Oct 2002 21:04:35 +0200
To: Dave Raggett <dsr@w3.org>
Cc: html-tidy@w3.org
Message-ID: <3ddd1cbe.166180945@smtp.bjoern.hoehrmann.de>

* Dave Raggett wrote:
>> >> >the comment looks something like
>> >> ><!-- <o:tag>Coulomb's law</o:tag> -->
>> >> >Except that the ' is a chr 146.  In other words, a 'smart apostropy' or
>> >> >'curly apostropy'

>> >> >This character is getting changed to something else. In my text editor it
>> >> >indicates it is a chr 25.

>> To convert 0x92 to 0x19 is a bug, yes.
>
>Why?  If you are converting a broken document (invalid characters)
>into a valid document with the equivalent Unicode characters and
>a Unicode character set, surely this is in direct alignment with
>the goals of HTML Tidy?

Again, a reduced test case looks like this

  <!--<U+2019>-->

It's encoded using Windows-1252, hence a hex dump looks like

  3C     21     2D     2D     92     2D     2D     3E
  ---------------------------------------------------
  <      !      -      -   <U+2019>  -      -      >

After running Tidy on that document without any special configuration
option it becomes

  3C     21     2D     2D     19     2D     2D     3E
  ---------------------------------------------------
  <      !      -      -   <U+0019>  -      -      >

U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid
character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am
not against transcoding or replacing the U+2019 in the source document,
but Tidy does not do this at the moment, it just mangles the character.

Received on Friday, 11 October 2002 15:04:29 UTC