Re: link checker and IRIs from Martin Duerst on 2004-09-02 (public-qa-dev@w3.org from September 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 02 Sep 2004 11:04:45 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-qa-dev@w3.org
Message-Id: <4.2.0.58.J.20040902103316.046cd748@localhost>
At 09:24 04/08/28 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >I'm wondering where you got this last phrase from. The error
> >recovery strategy in HTML 4.0 is very much compatible with
> >IRIs (maybe with the exception of the IDN part, which wasn't
> >imaginable at that time, but once the reference in HTML 4.0
> >to RFC 2396 is updated to RFC 2396bis, that problem is
> >solved, too).
>
>For example, section 3.1, step 1, variant A and B in draft-duerst-iri-09
>require NFC normalization which would yield in results that do not
>comply with the suggestions in the HTML 4.01 Recommendation.

In general, you are right. But we have explicitly excluded
windows-1258 (vietnamese) from the validator to avoid normalization
problems, and for the rest of encodings, it shouldn't be a big deal.


>I am fine
>with implementing the suggestion in the HTML 4.01 Recommendation, if the
>linkchecker points out that successful retrieval for such resources
>depends on error recovery behavior that only few user agents implement.

I'm unhappy with 'only a few user agents implement', because I don't
think it's true. But it's probably a question of whether you think
the glass is half empty or half full.


>But that would not conform to the IRI internet draft.

Were/how/why?


> >I don't understand how this statement and the one just above fit
> >together. You say that that document doesn't contain a broken
> >link, but the link checker still should say it is broken.
>
>No, I said that it should not contain one, i.e., the author should fix
>it.
>
> >I remember well that Mozilla implemented the right behavior after
> >I put out the first test. Opera did the same. If some more tests,
> >and the link checker, can help getting Mozilla back on track, that
> >would be great.
>
>Please make sure that these "tests" clearly point out that the
>document is non-conforming and attempts to "test" for informational
>error recovery suggestions. I already see users confused by HTML Tidy
>correctly pointing out that such documents are non-conforming,

I think Tidy can do some really good work. It can offer an option
to convert from e.g. Latin-1 to the corresponding %-escaping for
those cases that can't fix their servers' serving stuff with
Latin-1 paths. It can offer an option to downgrade to %-escaping
using UTF-8 for use on old browsers. It can offer an option for
converting IDNs to punycode for use on old browsers. But all
these options should be off by default. We shouldn't make it
more difficult than necessary for people to move in the right
direction.


>if we
>update the Markup Validator later this year to do the same,

I would not want that to happen. For one, these attributes are
CDATA.


>I do not
>want to get bug reports for it backed by some "W3C tests".

I think we should work on this looking forward, rather than looking
backwards. The spec labels this as error behavior, because at that
time, that was all we could do.


>I am also not sure whether Mozilla will implement different behavior
>any time soon, there are many sites that would break if it did.

Oh, so it's okay to have behavior that doesn't conform to any
spec at all, not even an error handling provision, but it's not okay
to test for this error handling provision?

There is another test, at
http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html,
that clearly shows that Mozilla is WRONG. And that test result
is 100% following the spec. That test doesn't work for the link
checker, because the link checker will find the red "WRONG" image,
and be happy with it.

Another case: Mozilla implements IDNs. Of course, according to
the HTML spec, that's wrong, because URIs don't allow non-ASCII
characters. Should we flag all that as wrong? Or do we want to
set up tests that make sure that all the browsers move in the
same, right, direction, and tests that motivate developers to
move in that direction? Should we lead the Web to its full
potential, or wait for more breakage to accumulate?


>That's
>also why Microsoft backed out much of this behavior during the IE5 beta
>cycle.

They don't activate the "URIs as UTF-8" option for East Asian
versions of IE, as far as I understand.


> >There may be some edge cases that don't work out, but in general,
> >these things usually work out. We'll see.
>
>I am not sure whether it is a good idea to publish software with
>bugs, finding and fixing them later is costly most of the time.

It would help a lot if you could come up with a realistic egde
case (an actual example) where things wouldn't work.


> >For what I'm planning for the link checker at the moment, I'm not
> >sure that will become a module. But it's possible to think about
> >how to move that code, or similar code,
>
>This also helps testing and documenting the code, feel free to post here
>if you would like some help writing the modules or publishing them on
>CPAN. Maybe you could join one of our meetings to discuss details?

I'd be glad to, but I just saw that I missed one very recently.
I guess I'll ask Olivier to give me a heads up for the next one.

I have looked into the modularization question a bit more
yesterday on the bus. One thing I might to is to take all the
charset code in 'check' and move it to a contiguous location.
The really tough job for modularization is to find a good way
to separate functionality and error reporting. I have some ideas,
but nothing firm yet.


> >>It would also be good if you could implement any transcoding stuff, etc.
> >>in a way compatible with perlunicode, setting the UTF-8 flag etc.

I have started working on that. I had to test what Perl actually meant
with 'illegal UTF-8'. My findings were that it didn't complain about
3-byte surrogates, and allowed characters >U+10FFFF. But otherwise,
I have found that using Encode, I can reduce the code quite a bit.

Unfortunately, I got stuck with some very weird phenomenon:
Many Japanese pages (shift_jis and other) work very well with my
new code, but the Google JP page just won't transcode, see
http://qa-dev.w3.org/wmvs/duerst/check?uri=http%3A%2F%2Fwww.google.co.jp&cha 
rset=%28detect+automatically%29&doctype=%28detect+automatically%29&ss=1.

I have verified that the transcoding code is actually used, that
the resulting lines have the UTF8 flag set, and also that the
pattern of readable characters and garbage that you can see
is still shift_jis. Any advice on what to test next is highly
appreciated!


> >Is it possible to do that in a way that doesn't depend on Perl versions?
>
>That depends on what you are trying to achieve, for the Markup Validator
>we will require Perl 5.8.0 soon which should not have any relevant
>problem in this regard, but I do not know whether this would be okay for
>checklink. It should be possible to use your modules only if Perl 5.8 is
>available.

The perl versioning problem I alluded to was with Perl 5.8.0 and 5.8.1,
but it looks like it's not relevant to us.


> >Thanks for the pointer. I just tested with a shift_jis page, and
> >things looked okay. Could you give me the URI of the page that
> >produced the errors described in your mail?
>
>Olivier's message actually, and he mentions http://www.google.co.jp, try
><http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.co.jp> in the
>Validator (i.e., validate the validation results) and you should get
>
>   Sorry, I am unable to validate this document because on lines 297,
>   429, 437, 473, 502, 523 it contained one or more bytes that I cannot
>   interpret as utf-8 (in other words, the bytes found are not valid
>   values in the specified Character Encoding). Please check both the
>   content of the file and the character encoding indication.

After talking with Olivier, I understood that it was the validation
of the validation result for google.co.jp, not the validation of
the original site, than showed the problem.


Regards,     Martin.
Received on Thursday, 2 September 2004 02:09:30 UTC