Re: Better internationalization of validator

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Tue, Jun 05 2001

  • Next message: Martin Duerst: "Re: Better internationalization of validator"

    From: Bjoern Hoehrmann <derhoermi@gmx.net>
    To: Martin Duerst <duerst@w3.org>
    Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
    Date: Tue, 05 Jun 2001 11:46:30 +0200
    Message-ID: <736phtg65f9i5han27s7f9fggin29oivdu@4ax.com>
    Subject: Re: Better internationalization of validator
    
    * Martin Duerst wrote:
    >> >The conversion to UTF-8 should give you that, if you catch the right
    >> >errors (and use a converter that tells you, but Text::Iconv should
    >> >be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
    >> >though I haven't tested it yet).
    >>
    >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
    >>
    >>   use Text::Iconv;
    >>   my $c = Text::Iconv->new('utf-8' => 'cp850');
    >
    >The syntax (with the => going in the wrong direction) looks a bit strange here.
    
    That's Perl's "fat comma", it's equivalent to the comma operator (',')
    besides the fact that => forces the left side to be a string, see
    
      `perldoc perlop`/"Comma Operator"
    
    >>   $c->raise_error(1);
    >>   eval { $c->convert("Bj?n") }; # ?is CP850 encoded
    >
    >Thanks for this. I was wondering how to get more out of
    >Text::Iconv. The description at
    >http://www.perldoc.com/cpan/Text/Iconv.html didn't say
    >anything about raise_error. The same description is also at
    >http://www.perldoc.com/cpan/Locale/Iconv.html, but I prefer
    >this to be Text::Iconv, because it shouldn't depend on locale.
    >
    >Anyway, any pointers to better descriptions are highly appreciated.
    
    http://search.cpan.org/doc/MPIOTR/Text-Iconv-1.1/Iconv.pm or just
    
      `perldoc Text::Iconv`
    
    on a properly configured system.
    
    >>>[Text::Iconv stops on encoding errors]
    >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
    >
    >What is not suitable, exactly?
    
    Assume someone got a UTF-8 encoded document, opend it in Windows
    'Notepad', inserted two sentences containing lots of characters with
    diaeresis and now tries to validate the document. The validator would
    then refuse to validate it. Assumed it's an XHTML document, the user
    might go to the mentioned location and write &ouml; instead of 'ö' and
    passes the document again through the validator and so on. This will get
    very frustrating after some time. Yes, XML 1.0 says one must treat this
    as fatal error but applications are allowed to search for further errors
    in the document. If one must run the document for every little error
    through the validator, noone would use it anymore.
    
    >> >>and 2) I have yet to see a good definition of "valid"
    >> >
    >> >Good point.
    >>
    >>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
    >>sequence is.
    >
    >Even for UTF-8, it's not that easy. Can it start with (the equivalent
    >of) a BOM? The RFC doesn't say yes, and doesn't say no.
    
    Unicode papers say yes, XML 1.0 says yes.
    -- 
    Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
    am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
    25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/