A little statistics on linked style sheets

I decided to check what <link rel=stylesheet> points to in practice and 
wrote a simple spider. I ran it on a random sample from Yahoo. (Thanks, 
Yahoo, for a very useful service! Why don't other search engines 
provide the same?)

    MIME type                     #   (1)   (2)
    -------------------------------------------
    "text/css"                 1234   94%   99%
    "text/plain"                  1    0%    0%
    "text/html"                  13    1%    1%
    "application/octet-stream"    1    0%    0%
    other                         0    0%    0%
    none                          0    0%    0%
    fail                         60    5%    NA
    -------------------------------------------
    TOTAL                      1309  100%  100%

    (1) Percentage of all tested <link> elements, including broken links
    (2) Percentage of the successful requests

Some notes:

a) I downloaded 2000 random pages and extracted all the links with 
rel="stylesheet" or "alternate stylesheet." Some pages didn't have such 
links, others had several. As the table shows, I found 1309 such links. 
I didn't check if the pages were valid and my parser isn't as smart as 
some browsers', so I may have missed links (or found spurious ones) if 
the pages were particularly bad. My parser also can't find links that 
are hidden in JavaScript. However, the goal wasn't to check how many 
links there are, but what the links point to. A few missed links 
probably don't change the picture.

b) I ran the spider a few times more to better understand 
the "text/plain" links. There aren't many, but they seem to be mostly 
from incorrectly configured servers, because looking at the contents of 
the files, most are clearly CSS. I had expected this number to be much 
higher. The low number (0%) means that MIME types can be relied upon. 
Which is good news.

c) Less than half of the "text/html" links are incorrectly labeled CSS 
files. The rest is HTML pages. Many of those seem to be either error 
pages (but returned with status 200 instead of 404) or redirects to 
some HTML page in a (misguided) attempt to avoid returning 404s.

d) I didn't find any XSL or XSLT links. That may be because most such 
links are in <?xml-stylesheet?> processing instructions instead of 
<link> elements.

e) The random pages come from Yahoo[1]. I haven't tried analyzing what 
bias the sample has. The spider[2] itself is a Bash program. It uses 
curl to make the HTTP requests and wls[3] to extract links.

[1] http://random.yahoo.com/fast/ryl
[2] http://www.w3.org/People/Bos/Robots/countstylemimetypes.sh
[3] http://www.w3.org/Tools/HTML-XML-utils/



Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos                               W3C/ERCIM
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France

Received on Tuesday, 23 October 2007 14:13:44 UTC