Re: Better internationalization of validator from Bjoern Hoehrmann on 2001-06-05 (www-validator@w3.org from June 2001)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 05 Jun 2001 04:32:30 +0200
To: Martin Duerst <duerst@w3.org>
Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
Message-ID: <kvfohtono067gtvg0ov2uver9m19nfkeom@4ax.com>

* Martin Duerst wrote:
>> >- Make sure that only the byte sequences legal in an encoding
>> >   are accepted. (including the top item on the todo list)

I've written these regexes but

    sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }
    
    sub is_valid_utf8
    {
        shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
                      [\xE0-\xEF][\x80-\xBF]{2} |
                      [\xF0-\xF7][\x80-\xBF]{3} |
                      [\xF8-\xFB][\x80-\xBF]{4} |
                      [\xFC-\xFD][\x80-\xBF]{5} |
                      [\x00-\x7f])*$/x;
    }
    
    sub is_valid_latin1
    {
        shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
    }

    sub is_valid_windows_1252 { 1 }

>>I've been wanting to do this but 1) I haven't found any good ways to do it

I don't really like that, too.

>The conversion to UTF-8 should give you that, if you catch the right
>errors (and use a converter that tells you, but Text::Iconv should
>be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
>though I haven't tested it yet).

Text::Iconv doesn't to so in a manner usable for the validator, e.g.

  use Text::Iconv;
  use Data::Dumper;
  my $c = Text::Iconv->new('utf-8' => 'cp850');
  $c->raise_error(1);
  eval { $c->convert("Björn") }; # ö is CP850 encoded

Well have

  $@ ::= 'Character not from source char set: Illegal byte sequence '.
         'at - line 5.'
  $! ::= 'Illegal byte sequence'

and Text::Iconv stops further parsing and conversion.

>>and 2) I have yet to see a good definition of "valid"
>
>Good point.

Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
sequence is.

>> >- <meta ... charset over multiple lines.
>>
>>I've been meaning to take *all* that code out back and shoot it for a while
>>now. It's been postponed because it's rather drastic and needs some serious
>>testing to avoid snafus and I'm desperately short on time ATM. The New Deal
>>is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).
>
>Any good docu available on html::parser?

The manual should be quite good, if not, google helps.

>If it does a similar job to
>what the validator currently does, it may be okay. But does it allow
>to add new doctypes,...?

HTML::Parser is no validator, it's just a generic parser for SGML and
XML documents, it doesn't know much about syntax, structure, semantics
etc. If it encounters a start tag, it reports a start tag, if it
encounters PCDATA, it reports PCDATA. Ok, it knows about HTML entities
and character references though.
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Received on Monday, 4 June 2001 22:31:34 UTC