Re: Better internationalization of validator

At 11:46 01/06/05 +0200, Bjoern Hoehrmann wrote:

> >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
> >>
> >>   use Text::Iconv;
> >>   my $c = Text::Iconv->new('utf-8' => 'cp850');
> >
> >The syntax (with the => going in the wrong direction) looks a bit 
> strange here.
>
>That's Perl's "fat comma", it's equivalent to the comma operator (',')
>besides the fact that => forces the left side to be a string, see
>
>   `perldoc perlop`/"Comma Operator"

It should still go the other way round:
  $converter = Text::Iconv->new("fromcode", "tocode");


> >>>[Text::Iconv stops on encoding errors]
> >>Text::Iconv doesn't to so in a manner usable for the validator, e.g.
> >
> >What is not suitable, exactly?
>
>Assume someone got a UTF-8 encoded document, opend it in Windows
>'Notepad', inserted two sentences containing lots of characters with
>diaeresis and now tries to validate the document. The validator would
>then refuse to validate it. Assumed it's an XHTML document, the user
>might go to the mentioned location and write ö instead of '$B‹(B and
>passes the document again through the validator and so on. This will get
>very frustrating after some time.

Well, I would assume that there will be some 'learning effect'.
Also, we can try to make the error message more easily understandable.


>Yes, XML 1.0 says one must treat this
>as fatal error but applications are allowed to search for further errors
>in the document.

Yes. We can easily make a list of which lines contain errors.
But it is difficult to do anything more when we have found errors,
because a line with an error is just 'eaten up' by Text::Iconv.
I just did a test with http://www.w3.org/2001/01/xml-latin1.html
(first point on http://validator.w3.org/todo.html), with a version
of 'check' that actively converts from us-ascii to utf-8.
What I got in the source listing (and thus what was fed to
the validator) is:

Source Listing

Below is the source input I used for this validation:

1: <?xml version="1.0"?>
2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3:     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
4: <html xmlns="http://www.w3.org/1999/xhtml">
5:
6: <head>
7:   <title>test</title>
8:   <meta http-equiv="Content-Type" content="text/html" />
9: </head>
10:
11: <body>
12:
13: <h1>test</h1>
14:
15:
16:
17: <address>
18:   Gerald Oskoboiny<br />
19:   <a href="Mail">gerald@w3.org</a><br />
20:   $Date: 2001/01/05 00:30:45 $
21: </address>
22:
23: </body>
24:
25: </html>

Line 15 (<p>here is a non- us-ascii character:  </p>) has just been
swallowed by Text::Iconv. The document as it stands is still
valid, but only by chance, because the <p> and the </p> are on
the same line.

So writing out the line numbers of the lines that caused errors
for Text::Iconv is a good idea, but feeding that into the next
step is a bad idea, because the result would be in danger to
be misleading.

It would be nice to get more detailled results (e.g. where on
a line the problem is), but it seems difficult to get that
from Text::Iconv. One thing to do might be to split up the
input, but that's not easy to do in a general way that assures
to split on character (and not only byte) boundaries.


> >>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte
> >>sequence is.
> >
> >Even for UTF-8, it's not that easy. Can it start with (the equivalent
> >of) a BOM? The RFC doesn't say yes, and doesn't say no.
>
>Unicode papers say yes, XML 1.0 says yes.

For XML, that's only in a non-normative appendix.

Regards,   Martin.

Received on Tuesday, 5 June 2001 09:25:00 UTC