Re: Macintosh charset blowing up?

On Sun, 6 Jan 2002, Martin Duerst wrote:

> >I suggest a quick-hack fix for this, that I've added to Page Valet:
> >
> >if ( charset matches /^mac(intosh|roman)/i ) {
> >   message("charset not supported; treating it as UTF-8") ;
> >   charset = "UTF-8" ;
> >}
>
> It seems that most of the characters are supported;
> would be a pity to give up completely.
>
> Also, treating something as UTF-8 while it's clearly not
> is a really bad idea.

OK, that would probably be a bad idea for the W3C validator.
OTOH, printing a warning message "charset not correctly supported"
would seem like a good idea.

In the case of Page Valet, I needed a more drastic measure, because
the symptom of the problem was that OpenSP generated broken XML
(an opening "<" was eaten up by the null byte).  But yes, I'll
be looking for a better fix - perhaps

if ( charset is macintosh ) {
  entify the offending bytes ; // accept a performance hit :-(
}

BTW, treating it as UTF-8 and emitting a warning is also fallback
behaviour when iconv fails due to an explicitly unsupported charset.
Probably not good, but I'm not sure how best to deal with it.
For 8-bit charsets, wholesale entification would be an option,
but how does one know if an unknown charset is 8-bit?

-- 
Nick Kew

Site Valet - the mark of Quality on the Web.
<URL:http://valet.webthing.com/>

Received on Monday, 7 January 2002 16:46:26 UTC