Re: Added charset -> iconv conf file; checking UTF-8

From: Terje Bless (link@tss.no)
Date: Mon, Jun 18 2001

  • Next message: Christian Smith: "Re: Producing an XML report from the validator"

    Date: Mon, 18 Jun 2001 15:32:19 +0200
    From: Terje Bless <link@tss.no>
    To: Martin Duerst <duerst@w3.org>
    cc: www-validator@w3.org
    Message-ID: <20010618155244-b01010704-ed46cd6c-0910-010c@192.168.1.6>
    Subject: Re: Added charset -> iconv conf file; checking UTF-8 
    
    On 18.06.01 at 19:32, Martin Duerst <duerst@w3.org> wrote:
    
    >I just added a new file that maps IANA 'charset' parameters to iconv
    >parameters.
    
    We may have to rethink our configuration strategy; several things aren't
    possible with the current format (which was a 5 minute hack one night I was
    feeling clever ;D) and we're starting to accumulate a lot of $foo_db.
    Either use one of the myriad Config::* modules from CPAN, roll our own, or
    extend the current format. Namespace issues can be solved simply by
    stuffing all off them in a global $CFG hash-ref or in $File->{CFG} (I'm not
    sure whether there's any point to the latter).
    
    
    >The special 'windows-xxxx' code is gone.
    
    Good riddance! :-)
    
    
    >I also added a very thorough (but fast) check of UTF-8 byte patterns.
    
    Why check output from iconv()? If it's not correct it should be fixed in
    libiconv not in check. Perhaps rework it so we only check UTF-8 input?
    
    BTW, ISTR that a similar check is possible for UTF-16; is there any point
    to checking that or should we just recode it into UTF-8?
    
    
    >http://cvs.w3.org/Team/validator/httpd/cgi-bin/check.diff?r1=1.116&r2=1.117&cvsroot=Public
    
    Uhm, perhaps you meant to write
    <URL:http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check.diff?r1=1.116&r2=1.117>
    so that us mere mortals can play too? :-)
    
    
    >I'm thinking about what to do with 'unknown'. Throwing it out altogether
    >would be best, but maybe this would create too much opposition.
    
    I'm leaning towards saying that there is no such thing as an "unknown"
    charset. There are only known charsets that we can handle and invalid
    charsets (we may of course have limitations or bugs, but there are no
    "unkown" charsets).
    
    Put another way, I think it would be better to assume ISO-8859-1 (cf.
    HTTP), punt ("Don't know that charset, can't validate"), or assume UNICODE
    (either UTF-8 or UTF-16 and try to guess which by BOM or number of null
    bytes in first 30% of file); in that order. I think assuming ISO-Latin-1 is
    the "correct" behaviour, but since that is a bit controversial just punting
    and putting up an error is an acceptable "compromise".
    
    
    >Anyway, I don't want pages with an 'unknown' charset to get
    >"Congratulations!".
    
    No, definitely not, but I'm weary of spitting out errors for the (big
    majority of) pages that are served without a Content-Type (in HTTP or a
    META equivalent). Fair enough, a lot of them are using MacRoman or
    Windows-1252 instead of ISO-8859-1, but these are minor misunderstandings
    and understandable in light of the Latin 1 defaulting in HTTP and the
    difficulty in changing it for most users.
    
    I really *don't* want to encourage using the quick fix -- META -- because
    that one is a mistake IMO, and should never have been introduced much less
    propogated by XML.
    
    Much better to ignore these -- any significant errors will be caught as
    "Non-SGML char", the rest will be smart quotes that show up funny on other
    platforms -- and concentrate on the ones that create actual problems (i.e.
    anything that needs more then Latin 1 for the basic language (including
    human language and technical writing).
    
    
    OTOH, I'm open to other views. Given that you evidently have a far greater
    experience with these issues then I have, I'd like to hear your take on
    this Martin. Maybe Björn would like to chime in too? Anyone else? Masa?
    
    
    
    
    BTW, I was supposed to send the below two weeks ago, but things got a
    little crazy. :-)
    
    I'm not an IRC person, but after using it as a channel for discussing the
    Validator with Gerald I'm forced to admit that it can be a pretty effective
    addition to email. Since there seems to be a bit of interest in
    contributing to the development of the validator it'd be good if y'all
    would consider stopping by #validator from time to time. Gerald complains
    that you (Martin) don't use IRC much, but you may consider this a prod to
    get you moving. :-)
    
    Nick, you also expressed interest in following the development a bit
    closer, didn't you? Björn? Liam? #validator lives on the private server
    irc.w3.org:6665. Gerald is usually there (though not always quite "there"
    ;D) and I try to check in at least every few days (the joys of unmetered
    Internet access ;D).