Re: link checker and IRIs from Martin Duerst on 2004-08-28 (public-qa-dev@w3.org from August 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 28 Aug 2004 10:30:47 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-qa-dev@w3.org
Message-Id: <4.2.0.58.J.20040828100602.03869f08@localhost>
Hello Bjoern,

At 07:48 04/08/27 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >I'm planning to work a bit on the link checker in the next few days,
> >to make it comply with the IRI spec.
>
>My understanding is that checklink only supports HTML and XHTML 1.x
>documents,

Yes. But I think it would be fairly easy to extend it to parse
other things, such as SVG,...


>these document types prohibe anything but RFC 2396 URI
>References and from HTML 4.0 suggest a poorly implemented error
>recovery strategy which is incompatible with the IRI processing model,

I'm wondering where you got this last phrase from. The error
recovery strategy in HTML 4.0 is very much compatible with
IRIs (maybe with the exception of the IDN part, which wasn't
imaginable at that time, but once the reference in HTML 4.0
to RFC 2396 is updated to RFC 2396bis, that problem is
solved, too).


>so I am not quite sure what you are proposing here. Maybe you could
>give some more details on what you have in mind?
>
> >The link checker, at:
> >http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.w3.org%2F2001%2F08 
> %2F
> >iri-test%2Flinkcheck%2FresumeHtmlImgSrcBase.html&hide_type=all&depth=&che 
> ck=
> >Check
> >claims that there is a broken link (which there shouldn't be).
>
>I agree that there should not be a broken link in that document.

Great!


>I do
>not agree that the link checker should not say that it is broken,

I don't understand how this statement and the one just above fit
together. You say that that document doesn't contain a broken
link, but the link checker still should say it is broken.


>it
>clearly is both from a conformance perspective as well as from a user
>agent support perspective, the link checker should clearly indicate
>that this is the case so that the author can fix the document. Mozilla
>Firefox for example fails the "test", I think it is important to most
>authors that their documents work in Firefox.

Well, most authors want their stuff to work in most browsers. For
the example above, IE, Opera, and Safari work, but Mozilla doesn't.
And Mozilla has worked in some earlier versions, but then for some
obscure reasons switched back to the old 'take it as bytes' model.
I remember well that Mozilla implemented the right behavior after
I put out the first test. Opera did the same. If some more tests,
and the link checker, can help getting Mozilla back on track, that
would be great.


> >What I'm planning to do is to convert downloaded pages in the link checker
> >to UTF-8 (assuming I can find out what the encoding is). This will be
> >very similar to the validator. The difference is that the link checker
> >will only complain about missing 'charset' information if that information
> >is actually relevant for linkchecking (i.e. in particular if there are
> >links containing non-ASCII characters).
>
>I am not sure how it is possible to determine whether this information
>is relevant, since you need to transcode the document in order to tell
>whether there are non-ASCII characters and for transcoding you need to
>know the original encoding.

There may be some edge cases that don't work out, but in general,
these things usually work out. We'll see.


>Is there any chance you could implement whatever you had in mind here
>as new stand-alone Perl modules, either in the W3C::* namespace or
>probably even better in the more general CPAN namespaces (HTML::, URI::,
>etc.)? It seems these would be of a mostly more general nature and
>likely to be re-used by other tools, that's quite difficult to do with
>inline code, and checklink is already > 2000 lines of code, we should
>try to avoid adding significantly more code to it.

I was myself quite frightened of the checklink code up to a few days
ago. I'm now quite a bit less after I have looked through it a few times
on the bus. [I don't in any way claim I understand it yet.]
For what I'm planning for the link checker at the moment, I'm not
sure that will become a module. But it's possible to think about
how to move that code, or similar code,


>It would also be good if you could implement any transcoding stuff, etc.
>in a way compatible with perlunicode, setting the UTF-8 flag etc.

Is it possible to do that in a way that doesn't depend on Perl versions?
Last time I looked into this area, Jungshik pointed me to some very
nasty version dependencies, and when I asked on the perl-unicode
list for advice, nobody had a solution.


>The
>MarkUp Validator currently does not do this and thus tends to generate
>garbage in error messages, see
>
>   http://lists.w3.org/Archives/Public/www-validator/2004Apr/0129.html
>
>for an example.

Thanks for the pointer. I just tested with a shift_jis page, and
things looked okay. Could you give me the URI of the page that
produced the errors described in your mail?


Regards,    Martin.
Received on Saturday, 28 August 2004 01:31:18 UTC