Re: Over 1500 invalid pages at www.w3.org

On Mon, 15 Oct 2001, Bjoern Hoehrmann wrote:

> * Nick Kew wrote:
> >Site Valet reports 5322 HTML pages at W3C, so that's nearly 30% invalid.
> 
> I think there are way more pages than 5322, even if you count only the
> publically available pages.

Yes there are - it's still spidering them.  GetAgent will never
send more than one hit per minute to any one server, so it cannot
deal with more than 1440 www.w3.org docs in a day.

I took a more detailed look at the database after posting, and found it
had about 25000 www.w3.org URLs flagged as unvisited (though many of them
are non-HTML, so it'll only send a HEAD request to verify them).  I've no
doubt there will be more as it follows links in further pages.

The point of citing the number when I did is that it gives a proportion:
30% of a (substantial) sample proved to be invalid.

Actually I just re-read Gerald's post that induced me to start spidering
www.w3.org, and I find I misread what he wrote in the first place :-(

-- 
Nick Kew

Site Valet - the essential service for anyone with a website.
<URL:http://valet.webthing.com/>

Received on Monday, 15 October 2001 13:01:29 UTC