Re: Better internationalization of validator

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon, Jun 04 2001

  • Next message: Martin Duerst: "Re: Better internationalization of validator"

    From: Bjoern Hoehrmann <derhoermi@gmx.net>
    To: Martin Duerst <duerst@w3.org>
    Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
    Date: Tue, 05 Jun 2001 04:32:30 +0200
    Message-ID: <kvfohtono067gtvg0ov2uver9m19nfkeom@4ax.com>
    Subject: Re: Better internationalization of validator
    
    * Martin Duerst wrote:
    >> >- Make sure that only the byte sequences legal in an encoding
    >> >   are accepted. (including the top item on the todo list)
    
    I've written these regexes but
    
        sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }
        
        sub is_valid_utf8
        {
            shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
                          [\xE0-\xEF][\x80-\xBF]{2} |
                          [\xF0-\xF7][\x80-\xBF]{3} |
                          [\xF8-\xFB][\x80-\xBF]{4} |
                          [\xFC-\xFD][\x80-\xBF]{5} |
                          [\x00-\x7f])*$/x;
        }
        
        sub is_valid_latin1
        {
            shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
        }
    
        sub is_valid_windows_1252 { 1 }
    
    >>I've been wanting to do this but 1) I haven't found any good ways to do it
    
    I don't really like that, too.
    
    >The conversion to UTF-8 should give you that, if you catch the right
    >errors (and use a converter that tells you, but Text::Iconv should
    >be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
    >though I haven't tested it yet).
    
    Text::Iconv doesn't to so in a manner usable for the validator, e.g.
    
      use Text::Iconv;
      use Data::Dumper;
      my $c = Text::Iconv->new('utf-8' => 'cp850');
      $c->raise_error(1);
      eval { $c->convert("Björn") }; # ö is CP850 encoded
    
    Well have
    
      $@ ::= 'Character not from source char set: Illegal byte sequence '.
             'at - line 5.'
      $! ::= 'Illegal byte sequence'
    
    and Text::Iconv stops further parsing and conversion.
    
    >>and 2) I have yet to see a good definition of "valid"
    >
    >Good point.
    
    Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
    sequence is.
    
    >> >- <meta ... charset over multiple lines.
    >>
    >>I've been meaning to take *all* that code out back and shoot it for a while
    >>now. It's been postponed because it's rather drastic and needs some serious
    >>testing to avoid snafus and I'm desperately short on time ATM. The New Deal
    >>is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).
    >
    >Any good docu available on html::parser?
    
    The manual should be quite good, if not, google helps.
    
    >If it does a similar job to
    >what the validator currently does, it may be okay. But does it allow
    >to add new doctypes,...?
    
    HTML::Parser is no validator, it's just a generic parser for SGML and
    XML documents, it doesn't know much about syntax, structure, semantics
    etc. If it encounters a start tag, it reports a start tag, if it
    encounters PCDATA, it reports PCDATA. Ok, it knows about HTML entities
    and character references though.
    -- 
    Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
    am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
    25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/