Re: Better internationalization of validator

* Martin Duerst wrote:
>> >The conversion to UTF-8 should give you that, if you catch the right
>> >errors (and use a converter that tells you, but Text::Iconv should
>> >be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
>> >though I haven't tested it yet).
>>
>>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
>>
>>   use Text::Iconv;
>>   my $c = Text::Iconv->new('utf-8' => 'cp850');
>
>The syntax (with the => going in the wrong direction) looks a bit strange here.

That's Perl's "fat comma", it's equivalent to the comma operator (',')
besides the fact that => forces the left side to be a string, see

  `perldoc perlop`/"Comma Operator"

>>   $c->raise_error(1);
>>   eval { $c->convert("Bj?n") }; # ?is CP850 encoded
>
>Thanks for this. I was wondering how to get more out of
>Text::Iconv. The description at
>http://www.perldoc.com/cpan/Text/Iconv.html didn't say
>anything about raise_error. The same description is also at
>http://www.perldoc.com/cpan/Locale/Iconv.html, but I prefer
>this to be Text::Iconv, because it shouldn't depend on locale.
>
>Anyway, any pointers to better descriptions are highly appreciated.

http://search.cpan.org/doc/MPIOTR/Text-Iconv-1.1/Iconv.pm or just

  `perldoc Text::Iconv`

on a properly configured system.

>>>[Text::Iconv stops on encoding errors]
>>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
>
>What is not suitable, exactly?

Assume someone got a UTF-8 encoded document, opend it in Windows
'Notepad', inserted two sentences containing lots of characters with
diaeresis and now tries to validate the document. The validator would
then refuse to validate it. Assumed it's an XHTML document, the user
might go to the mentioned location and write ö instead of 'ö' and
passes the document again through the validator and so on. This will get
very frustrating after some time. Yes, XML 1.0 says one must treat this
as fatal error but applications are allowed to search for further errors
in the document. If one must run the document for every little error
through the validator, noone would use it anymore.

>> >>and 2) I have yet to see a good definition of "valid"
>> >
>> >Good point.
>>
>>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
>>sequence is.
>
>Even for UTF-8, it's not that easy. Can it start with (the equivalent
>of) a BOM? The RFC doesn't say yes, and doesn't say no.

Unicode papers say yes, XML 1.0 says yes.
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Received on Tuesday, 5 June 2001 05:45:38 UTC