Re: Better internationalization of validator

Hello Bjoern,

At 04:32 01/06/05 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >> >- Make sure that only the byte sequences legal in an encoding
> >> >   are accepted. (including the top item on the todo list)
>
>I've written these regexes but
>
>     sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }
>
>     sub is_valid_utf8
>     {
>         shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
>                       [\xE0-\xEF][\x80-\xBF]{2} |
>                       [\xF0-\xF7][\x80-\xBF]{3} |
>                       [\xF8-\xFB][\x80-\xBF]{4} |
>                       [\xFC-\xFD][\x80-\xBF]{5} |
>                       [\x00-\x7f])*$/x;
>     }

There would be a few more things to check, see e.g. sub CheckUTF8
in http://dev.w3.org/cvsweb/charlint/charlint.pl.


>
>     sub is_valid_latin1
>     {
>         shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
>     }
>
>     sub is_valid_windows_1252 { 1 }
>
> >>I've been wanting to do this but 1) I haven't found any good ways to do it
>
>I don't really like that, too.

I agree that doing it this way adds too much code for not enough benefit.


> >The conversion to UTF-8 should give you that, if you catch the right
> >errors (and use a converter that tells you, but Text::Iconv should
> >be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
> >though I haven't tested it yet).
>
>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
>
>   use Text::Iconv;
>   use Data::Dumper;
>   my $c = Text::Iconv->new('utf-8' => 'cp850');

The syntax (with the => going in the wrong direction) looks a bit strange here.


>   $c->raise_error(1);
>   eval { $c->convert("Bj$B‹S(Bn") }; # $B‹(Bis CP850 encoded

Thanks for this. I was wondering how to get more out of
Text::Iconv. The description at
http://www.perldoc.com/cpan/Text/Iconv.html didn't say
anything about raise_error. The same description is also at
http://www.perldoc.com/cpan/Locale/Iconv.html, but I prefer
this to be Text::Iconv, because it shouldn't depend on locale.

Anyway, any pointers to better descriptions are highly appreciated.


>Well have
>
>   $@ ::= 'Character not from source char set: Illegal byte sequence '.
>          'at - line 5.'
>   $! ::= 'Illegal byte sequence'
>
>and Text::Iconv stops further parsing and conversion.

Well, as usual, the error messages aren't terribly user friendly,
but that's a general validator problem. Also, in many cases, it's
better to stop after one error than to produce hundreds of errors.
And for XML, it's actually the only right thing to do, please
see http://www.w3.org/TR/REC-xml#charencoding:

 >>>>
It is a fatal error if an XML entity is determined (via default,
encoding declaration, or higher-level protocol) to be in a certain
encoding but contains octet sequences that are not legal in that
encoding.
 >>>>

So I'm still wondering why you say:

>Text::Iconv doesn't to so in a manner usable for the validator, e.g.

What is not suitable, exactly?


> >>and 2) I have yet to see a good definition of "valid"
> >
> >Good point.
>
>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
>sequence is.

Even for UTF-8, it's not that easy. Can it start with (the equivalent
of) a BOM? The RFC doesn't say yes, and doesn't say no.


Regards,   Martin.

Received on Tuesday, 5 June 2001 03:34:02 UTC