- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 05 Jun 2001 04:32:30 +0200
- To: Martin Duerst <duerst@w3.org>
- Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
* Martin Duerst wrote: >> >- Make sure that only the byte sequences legal in an encoding >> > are accepted. (including the top item on the todo list) I've written these regexes but sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ } sub is_valid_utf8 { shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} | [\xF8-\xFB][\x80-\xBF]{4} | [\xFC-\xFD][\x80-\xBF]{5} | [\x00-\x7f])*$/x; } sub is_valid_latin1 { shift =~ /^[\x00-\x7f\xA0-\xFF]*$/ } sub is_valid_windows_1252 { 1 } >>I've been wanting to do this but 1) I haven't found any good ways to do it I don't really like that, too. >The conversion to UTF-8 should give you that, if you catch the right >errors (and use a converter that tells you, but Text::Iconv should >be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS), >though I haven't tested it yet). Text::Iconv doesn't to so in a manner usable for the validator, e.g. use Text::Iconv; use Data::Dumper; my $c = Text::Iconv->new('utf-8' => 'cp850'); $c->raise_error(1); eval { $c->convert("Björn") }; # ö is CP850 encoded Well have $@ ::= 'Character not from source char set: Illegal byte sequence '. 'at - line 5.' $! ::= 'Illegal byte sequence' and Text::Iconv stops further parsing and conversion. >>and 2) I have yet to see a good definition of "valid" > >Good point. Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte sequence is. >> >- <meta ... charset over multiple lines. >> >>I've been meaning to take *all* that code out back and shoot it for a while >>now. It's been postponed because it's rather drastic and needs some serious >>testing to avoid snafus and I'm desperately short on time ATM. The New Deal >>is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such). > >Any good docu available on html::parser? The manual should be quite good, if not, google helps. >If it does a similar job to >what the validator currently does, it may be okay. But does it allow >to add new doctypes,...? HTML::Parser is no validator, it's just a generic parser for SGML and XML documents, it doesn't know much about syntax, structure, semantics etc. If it encounters a start tag, it reports a start tag, if it encounters PCDATA, it reports PCDATA. Ok, it knows about HTML entities and character references though. -- Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Monday, 4 June 2001 22:31:34 UTC