- From: Bert Bos <bert@w3.org>
- Date: Tue, 23 Oct 2007 16:13:35 +0200
- To: www-style@w3.org
I decided to check what <link rel=stylesheet> points to in practice and wrote a simple spider. I ran it on a random sample from Yahoo. (Thanks, Yahoo, for a very useful service! Why don't other search engines provide the same?) MIME type # (1) (2) ------------------------------------------- "text/css" 1234 94% 99% "text/plain" 1 0% 0% "text/html" 13 1% 1% "application/octet-stream" 1 0% 0% other 0 0% 0% none 0 0% 0% fail 60 5% NA ------------------------------------------- TOTAL 1309 100% 100% (1) Percentage of all tested <link> elements, including broken links (2) Percentage of the successful requests Some notes: a) I downloaded 2000 random pages and extracted all the links with rel="stylesheet" or "alternate stylesheet." Some pages didn't have such links, others had several. As the table shows, I found 1309 such links. I didn't check if the pages were valid and my parser isn't as smart as some browsers', so I may have missed links (or found spurious ones) if the pages were particularly bad. My parser also can't find links that are hidden in JavaScript. However, the goal wasn't to check how many links there are, but what the links point to. A few missed links probably don't change the picture. b) I ran the spider a few times more to better understand the "text/plain" links. There aren't many, but they seem to be mostly from incorrectly configured servers, because looking at the contents of the files, most are clearly CSS. I had expected this number to be much higher. The low number (0%) means that MIME types can be relied upon. Which is good news. c) Less than half of the "text/html" links are incorrectly labeled CSS files. The rest is HTML pages. Many of those seem to be either error pages (but returned with status 200 instead of 404) or redirects to some HTML page in a (misguided) attempt to avoid returning 404s. d) I didn't find any XSL or XSLT links. That may be because most such links are in <?xml-stylesheet?> processing instructions instead of <link> elements. e) The random pages come from Yahoo[1]. I haven't tried analyzing what bias the sample has. The spider[2] itself is a Bash program. It uses curl to make the HTTP requests and wls[3] to extract links. [1] http://random.yahoo.com/fast/ryl [2] http://www.w3.org/People/Bos/Robots/countstylemimetypes.sh [3] http://www.w3.org/Tools/HTML-XML-utils/ Bert -- Bert Bos ( W 3 C ) http://www.w3.org/ http://www.w3.org/people/bos W3C/ERCIM bert@w3.org 2004 Rt des Lucioles / BP 93 +33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France
Received on Tuesday, 23 October 2007 14:13:44 UTC