Re: URIs parsing in checklink

Hello,

Quick answer for now, mostly as I don't claim to know all the nooks  
and crannies of checklink… Ville will probably have more authoritative  
answers.

On Mar 7, 2008, at 09:16 , Dominique Hazael-Massieux wrote:
> The mobileOK checker has currently a pretty crude algorithm when  
> parsing
> HTML pages and CSS style sheets to resolve URIs that it finds in  
> there:
> if the URI matches the syntax in the RFC, it proceeds, otherwise, it
> reports an error.

I don't think the link checker reports any error in URI syntax at  
parse time. As far as I can remember, we use a subclassed HTML::Parser  
and get the content of a few key attributes, and pass that to the  
checker's list of links to check.

> So, I have a question and a suggestion:
> * the question is: how does the link checker parses URIs? I assume it
> needs to do so when making relative URIs absolute, as well as when  
> doing
> HEAD/GET requests?

We rely on the perl URI library for that. In particular, I think  
checklink uses mostly the new_abs() routine from http://search.cpan.org/dist/URI/URI.pm

> * the suggestion is: maybe the link checker should warn its users  
> about
> links that don't match what's the RFC requires?

Have you got some test cases for that? I'd like to add them to the  
link checker's test suite - and have a better idea of how it handles  
them.

> (of course, this probably opens us some dreaded cans of works about
> URIs, IRIs and canonicalization)


Oh yes. :) Especially since the HTML spec refer to URI (not IRI)  
normatively, and we regularly get a mini-flamewar on the subject...

-- 
olivier

Received on Friday, 7 March 2008 17:24:22 UTC