Re: Added charset -> iconv conf file; checking UTF-8

On 18.06.01 at 19:32, Martin Duerst <duerst@w3.org> wrote:

>I just added a new file that maps IANA 'charset' parameters to iconv
>parameters.

We may have to rethink our configuration strategy; several things aren't
possible with the current format (which was a 5 minute hack one night I was
feeling clever ;D) and we're starting to accumulate a lot of $foo_db.
Either use one of the myriad Config::* modules from CPAN, roll our own, or
extend the current format. Namespace issues can be solved simply by
stuffing all off them in a global $CFG hash-ref or in $File->{CFG} (I'm not
sure whether there's any point to the latter).


>The special 'windows-xxxx' code is gone.

Good riddance! :-)


>I also added a very thorough (but fast) check of UTF-8 byte patterns.

Why check output from iconv()? If it's not correct it should be fixed in
libiconv not in check. Perhaps rework it so we only check UTF-8 input?

BTW, ISTR that a similar check is possible for UTF-16; is there any point
to checking that or should we just recode it into UTF-8?


>http://cvs.w3.org/Team/validator/httpd/cgi-bin/check.diff?r1=1.116&r2=1.117&cvsroot=Public

Uhm, perhaps you meant to write
<URL:http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check.diff?r1=1.116&r2=1.117>
so that us mere mortals can play too? :-)


>I'm thinking about what to do with 'unknown'. Throwing it out altogether
>would be best, but maybe this would create too much opposition.

I'm leaning towards saying that there is no such thing as an "unknown"
charset. There are only known charsets that we can handle and invalid
charsets (we may of course have limitations or bugs, but there are no
"unkown" charsets).

Put another way, I think it would be better to assume ISO-8859-1 (cf.
HTTP), punt ("Don't know that charset, can't validate"), or assume UNICODE
(either UTF-8 or UTF-16 and try to guess which by BOM or number of null
bytes in first 30% of file); in that order. I think assuming ISO-Latin-1 is
the "correct" behaviour, but since that is a bit controversial just punting
and putting up an error is an acceptable "compromise".


>Anyway, I don't want pages with an 'unknown' charset to get
>"Congratulations!".

No, definitely not, but I'm weary of spitting out errors for the (big
majority of) pages that are served without a Content-Type (in HTTP or a
META equivalent). Fair enough, a lot of them are using MacRoman or
Windows-1252 instead of ISO-8859-1, but these are minor misunderstandings
and understandable in light of the Latin 1 defaulting in HTTP and the
difficulty in changing it for most users.

I really *don't* want to encourage using the quick fix -- META -- because
that one is a mistake IMO, and should never have been introduced much less
propogated by XML.

Much better to ignore these -- any significant errors will be caught as
"Non-SGML char", the rest will be smart quotes that show up funny on other
platforms -- and concentrate on the ones that create actual problems (i.e.
anything that needs more then Latin 1 for the basic language (including
human language and technical writing).


OTOH, I'm open to other views. Given that you evidently have a far greater
experience with these issues then I have, I'd like to hear your take on
this Martin. Maybe Björn would like to chime in too? Anyone else? Masa?




BTW, I was supposed to send the below two weeks ago, but things got a
little crazy. :-)

I'm not an IRC person, but after using it as a channel for discussing the
Validator with Gerald I'm forced to admit that it can be a pretty effective
addition to email. Since there seems to be a bit of interest in
contributing to the development of the validator it'd be good if y'all
would consider stopping by #validator from time to time. Gerald complains
that you (Martin) don't use IRC much, but you may consider this a prod to
get you moving. :-)

Nick, you also expressed interest in following the development a bit
closer, didn't you? Björn? Liam? #validator lives on the private server
irc.w3.org:6665. Gerald is usually there (though not always quite "there"
;D) and I try to check in at least every few days (the joys of unmetered
Internet access ;D).

Received on Monday, 18 June 2001 09:53:03 UTC