- From: Dave Raggett <dsr@w3.org>
- Date: Sat, 12 Oct 2002 11:52:02 +0100 (BST)
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- cc: html-tidy@w3.org
On Fri, 11 Oct 2002, Bjoern Hoehrmann wrote: > * Dave Raggett wrote: > >> >> >the comment looks something like > >> >> ><!-- <o:tag>Coulomb's law</o:tag> --> > >> >> >Except that the ' is a chr 146. In other words, a 'smart apostropy' or > >> >> >'curly apostropy' > > >> >> >This character is getting changed to something else. In my text editor it > >> >> >indicates it is a chr 25. > > >> To convert 0x92 to 0x19 is a bug, yes. > > > >Why? If you are converting a broken document (invalid characters) > >into a valid document with the equivalent Unicode characters and > >a Unicode character set, surely this is in direct alignment with > >the goals of HTML Tidy? > > Again, a reduced test case looks like this > > <!--<U+2019>--> > > It's encoded using Windows-1252, hence a hex dump looks like > > 3C 21 2D 2D 92 2D 2D 3E > --------------------------------------------------- > < ! - - <U+2019> - - > > > After running Tidy on that document without any special configuration > option it becomes > > 3C 21 2D 2D 19 2D 2D 3E > --------------------------------------------------- > < ! - - <U+0019> - - > > > U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid > character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am > not against transcoding or replacing the U+2019 in the source document, > but Tidy does not do this at the moment, it just mangles the character. Agreed - if both input and output character sets are Windows-1252 then Tidy shouldn't mess with valid Windows-1252 characters. That said, I think it is bad practice to use Windows-1252 and we should set the default to a portable character set based on Unicode. Linux and Mac user's aren't happy with folks delivering documents using Windows-1252 special characters when these characters have equivalent Unicode code points. It is also the case that Windows-1252 TrueType fonts include the Unicode glyph tables so that using the Unicode code points works fine on Windows. -- Dave Raggett <dsr@w3.org> or <dave.raggett@openwave.com> W3C lead for voice/multimodal. http://www.w3.org/People/Raggett tel/fax: +44 1225 866240 (or 867351) +44 771 213 7629 (GSM)
Received on Saturday, 12 October 2002 06:53:27 UTC