[checklink] list of potential solutions to timing issues

Re: http://lists.w3.org/Archives/Public/public-qa-dev/2005Mar/0009.html
Here is a list (re) visiting some of the options for improving the 
situation with the link checker's RobotUA implementation, and the 
timeouts that seems to cause with some UAs.

I am including here a mixed bag of ideas, most of them bad, some of 
them outrageous, but with a bit of luck we can find one in the lot 
which isn't so terrible, or make a bad one better, or...

Cat 1: make the link checker faster
* RobotUA has a minimum latency of 1s between requests, so we can't 
make one W3C::UserAgent instance be much faster, but we could use 
several. I assume this is what the following comment means:
[[
My $ua = W3C::UserAgent->new($AGENT); # @@@ TODO: admin address
# @@@ make number of keep-alive connections customizable
]]
Having a configurable number could make sense. We could also spawn one 
W3C::UserAgent per target host (would require changes in how and when 
the parsing of the links are done, I suppose?)

Cat 2: "cheat" and pretend we are sending some output when we're not.
(these are variants of our current hack spitting out spaces in summary 
mode)
* I thought that this hack could be changed to output HTML comments nnw 
and then instead of a space every time a link is processed. Doesn't 
feel like a very good solution in any case.
* "Summary only" could have the verbose output in a display:none. Would 
defeat the purpose in non-css-happy agents, though.
* we could also admit defeat and remove the summary option altogether...

Cat 3: "tell me when you're done"
- use js to redirect to the results page when done
- use server push (i.e, as far as I can remember, serve as MIME 
multipart with each multipart boundary triggering a refresh in many - 
but not all... - UAs)

Two problems with the solutions above: 1- the basic mechanism won't 
work in all UAs, and 2- nothing tells us that the UA will not timeout 
and give up before whatever mechanism we use eventually send the 
"ready" signal.

Cat 4: Change the model
... and accept that a real-time CGI needing a few minutes to complete 
its task is perhaps not appropriate.
In this category come a bunch of asynchronous solutions where the user 
gives checklink a point of contact for sending results (by mail, or 
SOAP, or...) or where checklink gives the user a URI where the ckecking 
results will eventually be published (would have to expire these with 
410 gone after a while of be confronted with disk space issues).

one more idea in this category: supposing that we can give the link 
checker
- a cache of currently processed queries
- a buffer where the result table is being built
Then checklink's output could be

processing query, n links left to process, please reload in (estimated 
time) X seconds
   [ include current state of result table ]
And then when the request is complete, include the full table.


Needless to say I haven't found anything that satisfies me in the above 
yet... I feel dirty. :)

-- 
olivier

Received on Friday, 11 March 2005 07:06:07 UTC