W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2002

Re: special characters in comments getting 'mangled'

From: Dave Raggett <dsr@w3.org>
Date: Sat, 12 Oct 2002 11:52:02 +0100 (BST)
To: Bjoern Hoehrmann <derhoermi@gmx.net>
cc: html-tidy@w3.org
Message-ID: <Pine.LNX.4.44.0210121147340.1777-100000@hazel>

On Fri, 11 Oct 2002, Bjoern Hoehrmann wrote:

> * Dave Raggett wrote:
> >> >> >the comment looks something like
> >> >> ><!-- <o:tag>Coulomb's law</o:tag> -->
> >> >> >Except that the ' is a chr 146.  In other words, a 'smart apostropy' or
> >> >> >'curly apostropy'
> 
> >> >> >This character is getting changed to something else. In my text editor it
> >> >> >indicates it is a chr 25.
> 
> >> To convert 0x92 to 0x19 is a bug, yes.
> >
> >Why?  If you are converting a broken document (invalid characters)
> >into a valid document with the equivalent Unicode characters and
> >a Unicode character set, surely this is in direct alignment with
> >the goals of HTML Tidy?
> 
> Again, a reduced test case looks like this
> 
>   <!--<U+2019>-->
> 
> It's encoded using Windows-1252, hence a hex dump looks like
> 
>   3C     21     2D     2D     92     2D     2D     3E
>   ---------------------------------------------------
>   <      !      -      -   <U+2019>  -      -      >
> 
> After running Tidy on that document without any special configuration
> option it becomes
> 
>   3C     21     2D     2D     19     2D     2D     3E
>   ---------------------------------------------------
>   <      !      -      -   <U+0019>  -      -      >
> 
> U+0019 is a c0 control character ("END OF MEDIUM") and it is an invalid
> character in HTML ("UNUSED" in the SGML declaration). That's a bug. I am
> not against transcoding or replacing the U+2019 in the source document,
> but Tidy does not do this at the moment, it just mangles the character.


Agreed - if both input and output character sets are Windows-1252
then Tidy shouldn't mess with valid Windows-1252 characters.

That said, I think it is bad practice to use Windows-1252 and we
should set the default to a portable character set based on Unicode.

Linux and Mac user's aren't happy with folks delivering documents
using Windows-1252 special characters when these characters have
equivalent Unicode code points. It is also the case that 
Windows-1252 TrueType fonts include the Unicode glyph tables
so that using the Unicode code points works fine on Windows. 

-- 
 Dave Raggett <dsr@w3.org> or <dave.raggett@openwave.com>
 W3C lead for voice/multimodal. http://www.w3.org/People/Raggett 
 tel/fax: +44 1225 866240 (or 867351) +44 771 213 7629 (GSM)
Received on Saturday, 12 October 2002 06:53:27 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 23:39:48 UTC