Re: link checker and IRIs from Bjoern Hoehrmann on 2004-08-28 (public-qa-dev@w3.org from August 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 28 Aug 2004 09:24:28 +0200
To: Martin Duerst <duerst@w3.org>
Cc: public-qa-dev@w3.org
Message-ID: <41301f66.192872796@smtp.bjoern.hoehrmann.de>
* Martin Duerst wrote:
>I'm wondering where you got this last phrase from. The error
>recovery strategy in HTML 4.0 is very much compatible with
>IRIs (maybe with the exception of the IDN part, which wasn't
>imaginable at that time, but once the reference in HTML 4.0
>to RFC 2396 is updated to RFC 2396bis, that problem is
>solved, too).

For example, section 3.1, step 1, variant A and B in draft-duerst-iri-09
require NFC normalization which would yield in results that do not
comply with the suggestions in the HTML 4.01 Recommendation. I am fine
with implementing the suggestion in the HTML 4.01 Recommendation, if the
linkchecker points out that successful retrieval for such resources
depends on error recovery behavior that only few user agents implement.
But that would not conform to the IRI internet draft. So these look very
much incompatible to me.

>I don't understand how this statement and the one just above fit
>together. You say that that document doesn't contain a broken
>link, but the link checker still should say it is broken.

No, I said that it should not contain one, i.e., the author should fix
it.

>I remember well that Mozilla implemented the right behavior after
>I put out the first test. Opera did the same. If some more tests,
>and the link checker, can help getting Mozilla back on track, that
>would be great.

Please make sure that these "tests" clearly point out that the
document is non-conforming and attempts to "test" for informational
error recovery suggestions. I already see users confused by HTML Tidy
correctly pointing out that such documents are non-conforming, if we
update the Markup Validator later this year to do the same, I do not
want to get bug reports for it backed by some "W3C tests".

I am also not sure whether Mozilla will implement different behavior
any time soon, there are many sites that would break if it did. That's
also why Microsoft backed out much of this behavior during the IE5 beta
cycle.

>There may be some edge cases that don't work out, but in general,
>these things usually work out. We'll see.

I am not sure whether it is a good idea to publish software with
bugs, finding and fixing them later is costly most of the time.

>For what I'm planning for the link checker at the moment, I'm not
>sure that will become a module. But it's possible to think about
>how to move that code, or similar code,

This also helps testing and documenting the code, feel free to post here
if you would like some help writing the modules or publishing them on
CPAN. Maybe you could join one of our meetings to discuss details?

>>It would also be good if you could implement any transcoding stuff, etc.
>>in a way compatible with perlunicode, setting the UTF-8 flag etc.
>
>Is it possible to do that in a way that doesn't depend on Perl versions?

That depends on what you are trying to achieve, for the Markup Validator
we will require Perl 5.8.0 soon which should not have any relevant
problem in this regard, but I do not know whether this would be okay for
checklink. It should be possible to use your modules only if Perl 5.8 is
available.

>Thanks for the pointer. I just tested with a shift_jis page, and
>things looked okay. Could you give me the URI of the page that
>produced the errors described in your mail?

Olivier's message actually, and he mentions http://www.google.co.jp, try
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.co.jp> in the
Validator (i.e., validate the validation results) and you should get

  Sorry, I am unable to validate this document because on lines 297,
  429, 437, 473, 502, 523 it contained one or more bytes that I cannot
  interpret as utf-8 (in other words, the bytes found are not valid
  values in the specified Character Encoding). Please check both the
  content of the file and the character encoding indication.
Received on Saturday, 28 August 2004 07:25:11 UTC