- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 05 Jun 2001 04:32:30 +0200
- To: Martin Duerst <duerst@w3.org>
- Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
* Martin Duerst wrote:
>> >- Make sure that only the byte sequences legal in an encoding
>> > are accepted. (including the top item on the todo list)
I've written these regexes but
sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }
sub is_valid_utf8
{
shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
[\xE0-\xEF][\x80-\xBF]{2} |
[\xF0-\xF7][\x80-\xBF]{3} |
[\xF8-\xFB][\x80-\xBF]{4} |
[\xFC-\xFD][\x80-\xBF]{5} |
[\x00-\x7f])*$/x;
}
sub is_valid_latin1
{
shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
}
sub is_valid_windows_1252 { 1 }
>>I've been wanting to do this but 1) I haven't found any good ways to do it
I don't really like that, too.
>The conversion to UTF-8 should give you that, if you catch the right
>errors (and use a converter that tells you, but Text::Iconv should
>be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
>though I haven't tested it yet).
Text::Iconv doesn't to so in a manner usable for the validator, e.g.
use Text::Iconv;
use Data::Dumper;
my $c = Text::Iconv->new('utf-8' => 'cp850');
$c->raise_error(1);
eval { $c->convert("Björn") }; # ö is CP850 encoded
Well have
$@ ::= 'Character not from source char set: Illegal byte sequence '.
'at - line 5.'
$! ::= 'Illegal byte sequence'
and Text::Iconv stops further parsing and conversion.
>>and 2) I have yet to see a good definition of "valid"
>
>Good point.
Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
sequence is.
>> >- <meta ... charset over multiple lines.
>>
>>I've been meaning to take *all* that code out back and shoot it for a while
>>now. It's been postponed because it's rather drastic and needs some serious
>>testing to avoid snafus and I'm desperately short on time ATM. The New Deal
>>is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).
>
>Any good docu available on html::parser?
The manual should be quite good, if not, google helps.
>If it does a similar job to
>what the validator currently does, it may be okay. But does it allow
>to add new doctypes,...?
HTML::Parser is no validator, it's just a generic parser for SGML and
XML documents, it doesn't know much about syntax, structure, semantics
etc. If it encounters a start tag, it reports a start tag, if it
encounters PCDATA, it reports PCDATA. Ok, it knows about HTML entities
and character references though.
--
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Monday, 4 June 2001 22:31:34 UTC