* Dave Raggett wrote: >> >> >the comment looks something like >> >> ><!-- <o:tag>Coulomb's law</o:tag> --> >> >> >Except that the ' is a chr 146. In other words, a 'smart apostropy' or >> >> >'curly apostropy' >> >> >This character is getting changed to something else. In my text editor it >> >> >indicates it is a chr 25. >> To convert 0x92 to 0x19 is a bug, yes. > >Why? If you are converting a broken document (invalid characters) >into a valid document with the equivalent Unicode characters and >a Unicode character set, surely this is in direct alignment with >the goals of HTML Tidy? Again, a reduced test case looks like this <!--<U+2019>--> It's encoded using Windows-1252, hence a hex dump looks like 3C 21 2D 2D 92 2D 2D 3E --------------------------------------------------- < ! - - <U+2019> - - > After running Tidy on that document without any special configuration option it becomes 3C 21 2D 2D 19 2D 2D 3E --------------------------------------------------- < ! - - <U+0019> - - > U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am not against transcoding or replacing the U+2019 in the source document, but Tidy does not do this at the moment, it just mangles the character.Received on Friday, 11 October 2002 15:04:29 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 24 September 2008 09:20:21 GMT