- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 27 Aug 2004 07:48:19 +0200
- To: Martin Duerst <duerst@w3.org>
- Cc: public-qa-dev@w3.org
* Martin Duerst wrote: >I'm planning to work a bit on the link checker in the next few days, >to make it comply with the IRI spec. My understanding is that checklink only supports HTML and XHTML 1.x documents, these document types prohibe anything but RFC 2396 URI References and from HTML 4.0 suggest a poorly implemented error recovery strategy which is incompatible with the IRI processing model, so I am not quite sure what you are proposing here. Maybe you could give some more details on what you have in mind? >The link checker, at: >http://validator.w3.org/checklink?uri=http%3A%2F%2Fwww.w3.org%2F2001%2F08%2F >iri-test%2Flinkcheck%2FresumeHtmlImgSrcBase.html&hide_type=all&depth=&check= >Check >claims that there is a broken link (which there shouldn't be). I agree that there should not be a broken link in that document. I do not agree that the link checker should not say that it is broken, it clearly is both from a conformance perspective as well as from a user agent support perspective, the link checker should clearly indicate that this is the case so that the author can fix the document. Mozilla Firefox for example fails the "test", I think it is important to most authors that their documents work in Firefox. >What I'm planning to do is to convert downloaded pages in the link checker >to UTF-8 (assuming I can find out what the encoding is). This will be >very similar to the validator. The difference is that the link checker >will only complain about missing 'charset' information if that information >is actually relevant for linkchecking (i.e. in particular if there are >links containing non-ASCII characters). I am not sure how it is possible to determine whether this information is relevant, since you need to transcode the document in order to tell whether there are non-ASCII characters and for transcoding you need to know the original encoding. Is there any chance you could implement whatever you had in mind here as new stand-alone Perl modules, either in the W3C::* namespace or probably even better in the more general CPAN namespaces (HTML::, URI::, etc.)? It seems these would be of a mostly more general nature and likely to be re-used by other tools, that's quite difficult to do with inline code, and checklink is already > 2000 lines of code, we should try to avoid adding significantly more code to it. It would also be good if you could implement any transcoding stuff, etc. in a way compatible with perlunicode, setting the UTF-8 flag etc. The MarkUp Validator currently does not do this and thus tends to generate garbage in error messages, see http://lists.w3.org/Archives/Public/www-validator/2004Apr/0129.html for an example.
Received on Friday, 27 August 2004 05:49:08 UTC