Re: Better internationalization of validator

From: Martin Duerst (duerst@w3.org)
Date: Tue, Jun 05 2001

  • Next message: Masayasu Ishikawa: "Re: XHTML1.1?"

    Message-Id: <4.2.0.58.J.20010605185326.040ec1f0@sh.w3.mag.keio.ac.jp>
    Date: Tue, 05 Jun 2001 22:24:14 +0900
    To: Bjoern Hoehrmann <derhoermi@gmx.net>
    From: Martin Duerst <duerst@w3.org>
    Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
    Subject: Re: Better internationalization of validator
    
    At 11:46 01/06/05 +0200, Bjoern Hoehrmann wrote:
    
    > >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
    > >>
    > >>   use Text::Iconv;
    > >>   my $c = Text::Iconv->new('utf-8' => 'cp850');
    > >
    > >The syntax (with the => going in the wrong direction) looks a bit 
    > strange here.
    >
    >That's Perl's "fat comma", it's equivalent to the comma operator (',')
    >besides the fact that => forces the left side to be a string, see
    >
    >   `perldoc perlop`/"Comma Operator"
    
    It should still go the other way round:
      $converter = Text::Iconv->new("fromcode", "tocode");
    
    
    > >>>[Text::Iconv stops on encoding errors]
    > >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
    > >
    > >What is not suitable, exactly?
    >
    >Assume someone got a UTF-8 encoded document, opend it in Windows
    >'Notepad', inserted two sentences containing lots of characters with
    >diaeresis and now tries to validate the document. The validator would
    >then refuse to validate it. Assumed it's an XHTML document, the user
    >might go to the mentioned location and write &ouml; instead of '$B‹(B and
    >passes the document again through the validator and so on. This will get
    >very frustrating after some time.
    
    Well, I would assume that there will be some 'learning effect'.
    Also, we can try to make the error message more easily understandable.
    
    
    >Yes, XML 1.0 says one must treat this
    >as fatal error but applications are allowed to search for further errors
    >in the document.
    
    Yes. We can easily make a list of which lines contain errors.
    But it is difficult to do anything more when we have found errors,
    because a line with an error is just 'eaten up' by Text::Iconv.
    I just did a test with http://www.w3.org/2001/01/xml-latin1.html
    (first point on http://validator.w3.org/todo.html), with a version
    of 'check' that actively converts from us-ascii to utf-8.
    What I got in the source listing (and thus what was fed to
    the validator) is:
    
    Source Listing
    
    Below is the source input I used for this validation:
    
    1: <?xml version="1.0"?>
    2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    3:     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    4: <html xmlns="http://www.w3.org/1999/xhtml">
    5:
    6: <head>
    7:   <title>test</title>
    8:   <meta http-equiv="Content-Type" content="text/html" />
    9: </head>
    10:
    11: <body>
    12:
    13: <h1>test</h1>
    14:
    15:
    16:
    17: <address>
    18:   Gerald Oskoboiny<br />
    19:   <a href="Mail">gerald@w3.org</a><br />
    20:   $Date: 2001/01/05 00:30:45 $
    21: </address>
    22:
    23: </body>
    24:
    25: </html>
    
    Line 15 (<p>here is a non- us-ascii character:  </p>) has just been
    swallowed by Text::Iconv. The document as it stands is still
    valid, but only by chance, because the <p> and the </p> are on
    the same line.
    
    So writing out the line numbers of the lines that caused errors
    for Text::Iconv is a good idea, but feeding that into the next
    step is a bad idea, because the result would be in danger to
    be misleading.
    
    It would be nice to get more detailled results (e.g. where on
    a line the problem is), but it seems difficult to get that
    from Text::Iconv. One thing to do might be to split up the
    input, but that's not easy to do in a general way that assures
    to split on character (and not only byte) boundaries.
    
    
    > >>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
    > >>sequence is.
    > >
    > >Even for UTF-8, it's not that easy. Can it start with (the equivalent
    > >of) a BOM? The RFC doesn't say yes, and doesn't say no.
    >
    >Unicode papers say yes, XML 1.0 says yes.
    
    For XML, that's only in a non-normative appendix.
    
    Regards,   Martin.