- From: Bert Bos <bert@w3.org>
- Date: Tue, 23 Oct 2007 16:13:35 +0200
- To: www-style@w3.org
I decided to check what <link rel=stylesheet> points to in practice and
wrote a simple spider. I ran it on a random sample from Yahoo. (Thanks,
Yahoo, for a very useful service! Why don't other search engines
provide the same?)
MIME type # (1) (2)
-------------------------------------------
"text/css" 1234 94% 99%
"text/plain" 1 0% 0%
"text/html" 13 1% 1%
"application/octet-stream" 1 0% 0%
other 0 0% 0%
none 0 0% 0%
fail 60 5% NA
-------------------------------------------
TOTAL 1309 100% 100%
(1) Percentage of all tested <link> elements, including broken links
(2) Percentage of the successful requests
Some notes:
a) I downloaded 2000 random pages and extracted all the links with
rel="stylesheet" or "alternate stylesheet." Some pages didn't have such
links, others had several. As the table shows, I found 1309 such links.
I didn't check if the pages were valid and my parser isn't as smart as
some browsers', so I may have missed links (or found spurious ones) if
the pages were particularly bad. My parser also can't find links that
are hidden in JavaScript. However, the goal wasn't to check how many
links there are, but what the links point to. A few missed links
probably don't change the picture.
b) I ran the spider a few times more to better understand
the "text/plain" links. There aren't many, but they seem to be mostly
from incorrectly configured servers, because looking at the contents of
the files, most are clearly CSS. I had expected this number to be much
higher. The low number (0%) means that MIME types can be relied upon.
Which is good news.
c) Less than half of the "text/html" links are incorrectly labeled CSS
files. The rest is HTML pages. Many of those seem to be either error
pages (but returned with status 200 instead of 404) or redirects to
some HTML page in a (misguided) attempt to avoid returning 404s.
d) I didn't find any XSL or XSLT links. That may be because most such
links are in <?xml-stylesheet?> processing instructions instead of
<link> elements.
e) The random pages come from Yahoo[1]. I haven't tried analyzing what
bias the sample has. The spider[2] itself is a Bash program. It uses
curl to make the HTTP requests and wls[3] to extract links.
[1] http://random.yahoo.com/fast/ryl
[2] http://www.w3.org/People/Bos/Robots/countstylemimetypes.sh
[3] http://www.w3.org/Tools/HTML-XML-utils/
Bert
--
Bert Bos ( W 3 C ) http://www.w3.org/
http://www.w3.org/people/bos W3C/ERCIM
bert@w3.org 2004 Rt des Lucioles / BP 93
+33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France
Received on Tuesday, 23 October 2007 14:13:44 UTC