- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 11 Oct 2002 21:04:35 +0200
- To: Dave Raggett <dsr@w3.org>
- Cc: html-tidy@w3.org
* Dave Raggett wrote: >> >> >the comment looks something like >> >> ><!-- <o:tag>Coulomb's law</o:tag> --> >> >> >Except that the ' is a chr 146. In other words, a 'smart apostropy' or >> >> >'curly apostropy' >> >> >This character is getting changed to something else. In my text editor it >> >> >indicates it is a chr 25. >> To convert 0x92 to 0x19 is a bug, yes. > >Why? If you are converting a broken document (invalid characters) >into a valid document with the equivalent Unicode characters and >a Unicode character set, surely this is in direct alignment with >the goals of HTML Tidy? Again, a reduced test case looks like this <!--<U+2019>--> It's encoded using Windows-1252, hence a hex dump looks like 3C 21 2D 2D 92 2D 2D 3E --------------------------------------------------- < ! - - <U+2019> - - > After running Tidy on that document without any special configuration option it becomes 3C 21 2D 2D 19 2D 2D 3E --------------------------------------------------- < ! - - <U+0019> - - > U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am not against transcoding or replacing the U+2019 in the source document, but Tidy does not do this at the moment, it just mangles the character.
Received on Friday, 11 October 2002 15:04:29 UTC