Re: URIs parsing in checklink from olivier Thereaux on 2008-03-07 (public-qa-dev@w3.org from March 2008)

From: olivier Thereaux <ot@w3.org>
Date: Fri, 7 Mar 2008 12:24:10 -0500
To: Dominique Hazael-Massieux <dom@w3.org>
Cc: public-qa-dev <public-qa-dev@w3.org>
Message-Id: <B07C8820-02FA-4441-AA08-9E4BC692722C@w3.org>

Hello,

Quick answer for now, mostly as I don't claim to know all the nooks  
and crannies of checklink… Ville will probably have more authoritative  
answers.

On Mar 7, 2008, at 09:16 , Dominique Hazael-Massieux wrote:
> The mobileOK checker has currently a pretty crude algorithm when  
> parsing
> HTML pages and CSS style sheets to resolve URIs that it finds in  
> there:
> if the URI matches the syntax in the RFC, it proceeds, otherwise, it
> reports an error.

I don't think the link checker reports any error in URI syntax at  
parse time. As far as I can remember, we use a subclassed HTML::Parser  
and get the content of a few key attributes, and pass that to the  
checker's list of links to check.

> So, I have a question and a suggestion:
> * the question is: how does the link checker parses URIs? I assume it
> needs to do so when making relative URIs absolute, as well as when  
> doing
> HEAD/GET requests?

We rely on the perl URI library for that. In particular, I think  
checklink uses mostly the new_abs() routine from http://search.cpan.org/dist/URI/URI.pm

> * the suggestion is: maybe the link checker should warn its users  
> about
> links that don't match what's the RFC requires?

Have you got some test cases for that? I'd like to add them to the  
link checker's test suite - and have a better idea of how it handles  
them.

> (of course, this probably opens us some dreaded cans of works about
> URIs, IRIs and canonicalization)

Oh yes. :) Especially since the HTML spec refer to URI (not IRI)  
normatively, and we regularly get a mini-flamewar on the subject...

-- 
olivier

Received on Friday, 7 March 2008 17:24:22 UTC