W3C home > Mailing lists > Public > www-validator@w3.org > December 2003

checklink --print-uris

From: Dan Jacobson <jidanni@jidanni.org>
Date: Mon, 22 Dec 2003 08:29:49 +0800
To: www-validator@w3.org
Message-ID: <87oeu1d79e.fsf@jidanni.org>

[Note at the bottom it appears parallel fetching will satisfy me.]

What I want to do with checklink:
find . -name \*.html|
xargs checklink --just-print-the-urls-that-need-to-be-checked>url-list

pppd isp #56Kbs
< url-list ssh my.account.on.networked.machine \
checklink --just-check-that-these-urls-exist

or something like that. You see, with its much greater connectivity,
my.account.on.networked.machine could produce the results in moments,
or I could nohup it and get the results next call.

You see, running
find . -name \*.html|xargs checklink
wastes costly modem time, and doing
ssh my.account.on.networked.machine \
nohup checklink --recursive-or-whatever http://jidanni.org/

would eat unnecessarily into my precious bandwidth allotment at the
website host company, when indeed all my pages are right here on my PC
offline.

all in all, I'm saying there should be a way
to allow separation of link gathering and link checking.

Wait, all I need to do is perhaps:
ssh my.account.on.networked.machine <<\!
sed 's/.*/<a href="&">x<\/a>/'<<\EOF |checklink
url1
url2...

Also I must first extract all the urls from my pages...

Hold on, if apt-get can have a --print-uris, why can't checklink have
a --just-print-urls-we-would-have-checked a/k/a --print-uris?

Maybe such an uri list could also have, commented out, the pages in
which they were found:
#nurd.html
http://turd.oo/
http://turd.oo/blaa
#verd.html
http://hey.vern/ernest

Wait, let me try parallel fetching, indeed finished faster:
set -e
unset ftp_proxy http_proxy
p=cl$$-
w=~/jidanni.org/ #local directory
cd /var/tmp #avoid nohup.out droppings
find $w -name '*.html'|split -l 6 - $p
for i in $p*[abc]?
do
#	< $i xargs checklink -n >$i.out&
	nohup sh -c "xargs checklink -n <$i >$i.out&" #nohup for emacs' compile mode :-(
done
Received on Monday, 22 December 2003 00:41:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:10 GMT