Re: link checker and IRIs

* Martin Duerst wrote:
>I'm planning to work a bit on the link checker in the next few days,
>to make it comply with the IRI spec.

My understanding is that checklink only supports HTML and XHTML 1.x
documents, these document types prohibe anything but RFC 2396 URI
References and from HTML 4.0 suggest a poorly implemented error
recovery strategy which is incompatible with the IRI processing model,
so I am not quite sure what you are proposing here. Maybe you could
give some more details on what you have in mind?

>The link checker, at:
>http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.w3.org%2F2001%2F08%2F 
>iri-test%2Flinkcheck%2FresumeHtmlImgSrcBase.html&hide_type=all&depth=&check= 
>Check
>claims that there is a broken link (which there shouldn't be).

I agree that there should not be a broken link in that document. I do
not agree that the link checker should not say that it is broken, it
clearly is both from a conformance perspective as well as from a user
agent support perspective, the link checker should clearly indicate
that this is the case so that the author can fix the document. Mozilla
Firefox for example fails the "test", I think it is important to most
authors that their documents work in Firefox.

>What I'm planning to do is to convert downloaded pages in the link checker
>to UTF-8 (assuming I can find out what the encoding is). This will be
>very similar to the validator. The difference is that the link checker
>will only complain about missing 'charset' information if that information
>is actually relevant for linkchecking (i.e. in particular if there are
>links containing non-ASCII characters).

I am not sure how it is possible to determine whether this information
is relevant, since you need to transcode the document in order to tell
whether there are non-ASCII characters and for transcoding you need to
know the original encoding.

Is there any chance you could implement whatever you had in mind here
as new stand-alone Perl modules, either in the W3C::* namespace or
probably even better in the more general CPAN namespaces (HTML::, URI::,
etc.)? It seems these would be of a mostly more general nature and
likely to be re-used by other tools, that's quite difficult to do with
inline code, and checklink is already > 2000 lines of code, we should
try to avoid adding significantly more code to it.

It would also be good if you could implement any transcoding stuff, etc.
in a way compatible with perlunicode, setting the UTF-8 flag etc. The
MarkUp Validator currently does not do this and thus tends to generate
garbage in error messages, see

  http://lists.w3.org/Archives/Public/www-validator/2004Apr/0129.html

for an example.

Received on Friday, 27 August 2004 05:49:08 UTC