- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 11 Oct 2002 21:04:35 +0200
- To: Dave Raggett <dsr@w3.org>
- Cc: html-tidy@w3.org
* Dave Raggett wrote:
>> >> >the comment looks something like
>> >> ><!-- <o:tag>Coulomb's law</o:tag> -->
>> >> >Except that the ' is a chr 146. In other words, a 'smart apostropy' or
>> >> >'curly apostropy'
>> >> >This character is getting changed to something else. In my text editor it
>> >> >indicates it is a chr 25.
>> To convert 0x92 to 0x19 is a bug, yes.
>
>Why? If you are converting a broken document (invalid characters)
>into a valid document with the equivalent Unicode characters and
>a Unicode character set, surely this is in direct alignment with
>the goals of HTML Tidy?
Again, a reduced test case looks like this
<!--<U+2019>-->
It's encoded using Windows-1252, hence a hex dump looks like
3C 21 2D 2D 92 2D 2D 3E
---------------------------------------------------
< ! - - <U+2019> - - >
After running Tidy on that document without any special configuration
option it becomes
3C 21 2D 2D 19 2D 2D 3E
---------------------------------------------------
< ! - - <U+0019> - - >
U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid
character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am
not against transcoding or replacing the U+2019 in the source document,
but Tidy does not do this at the moment, it just mangles the character.
Received on Friday, 11 October 2002 15:04:29 UTC