- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 05 Jun 2001 22:24:14 +0900
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: Terje Bless <link@tss.no>, W3C Validator <www-validator@w3.org>
At 11:46 01/06/05 +0200, Bjoern Hoehrmann wrote: > >>Text::Iconv doesn't to so in a manner usable for the validator, e.g. > >> > >> use Text::Iconv; > >> my $c = Text::Iconv->new('utf-8' => 'cp850'); > > > >The syntax (with the => going in the wrong direction) looks a bit > strange here. > >That's Perl's "fat comma", it's equivalent to the comma operator (',') >besides the fact that => forces the left side to be a string, see > > `perldoc perlop`/"Comma Operator" It should still go the other way round: $converter = Text::Iconv->new("fromcode", "tocode"); > >>>[Text::Iconv stops on encoding errors] > >>Text::Iconv doesn't to so in a manner usable for the validator, e.g. > > > >What is not suitable, exactly? > >Assume someone got a UTF-8 encoded document, opend it in Windows >'Notepad', inserted two sentences containing lots of characters with >diaeresis and now tries to validate the document. The validator would >then refuse to validate it. Assumed it's an XHTML document, the user >might go to the mentioned location and write ö instead of '$B‹(B and >passes the document again through the validator and so on. This will get >very frustrating after some time. Well, I would assume that there will be some 'learning effect'. Also, we can try to make the error message more easily understandable. >Yes, XML 1.0 says one must treat this >as fatal error but applications are allowed to search for further errors >in the document. Yes. We can easily make a list of which lines contain errors. But it is difficult to do anything more when we have found errors, because a line with an error is just 'eaten up' by Text::Iconv. I just did a test with http://www.w3.org/2001/01/xml-latin1.html (first point on http://validator.w3.org/todo.html), with a version of 'check' that actively converts from us-ascii to utf-8. What I got in the source listing (and thus what was fed to the validator) is: Source Listing Below is the source input I used for this validation: 1: <?xml version="1.0"?> 2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 3: "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 4: <html xmlns="http://www.w3.org/1999/xhtml"> 5: 6: <head> 7: <title>test</title> 8: <meta http-equiv="Content-Type" content="text/html" /> 9: </head> 10: 11: <body> 12: 13: <h1>test</h1> 14: 15: 16: 17: <address> 18: Gerald Oskoboiny<br /> 19: <a href="Mail">gerald@w3.org</a><br /> 20: $Date: 2001/01/05 00:30:45 $ 21: </address> 22: 23: </body> 24: 25: </html> Line 15 (<p>here is a non- us-ascii character: </p>) has just been swallowed by Text::Iconv. The document as it stands is still valid, but only by chance, because the <p> and the </p> are on the same line. So writing out the line numbers of the lines that caused errors for Text::Iconv is a good idea, but feeding that into the next step is a bad idea, because the result would be in danger to be misleading. It would be nice to get more detailled results (e.g. where on a line the problem is), but it seems difficult to get that from Text::Iconv. One thing to do might be to split up the input, but that's not easy to do in a general way that assures to split on character (and not only byte) boundaries. > >>Why? The relevant RFC should be very clear what e.g. a valid UTF-8 byte > >>sequence is. > > > >Even for UTF-8, it's not that easy. Can it start with (the equivalent > >of) a BOM? The RFC doesn't say yes, and doesn't say no. > >Unicode papers say yes, XML 1.0 says yes. For XML, that's only in a non-normative appendix. Regards, Martin.
Received on Tuesday, 5 June 2001 09:25:00 UTC