- From: Brian Wilson <bloo@blooberry.com>
- Date: Sat, 08 Mar 2008 01:08:32 +0100
- To: www-validator@w3.org
Nikita The Spider The Spider wrote: > On Thu, Mar 6, 2008 at 2:18 PM, Frank Ellermann > <nobody@xyzzy.claranet.de> wrote: >> Nikita The Spider The Spider wrote: >> >> >>> And the following doctypes: >> > 2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> >> > 830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> [...] >> >>> 34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN" >> > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> >> [...] >> >>> And the following media types: >> > 5121: text/html >> >> Not a single application/xhtml+xml, XHTML 1.0 is alive and kicking. > > I get a few of them, though not many. For instance, another sample of > 300 sites/265872 pages gives this distribution of media types: > 264246: text/html > 1636: application/xhtml+xml MAMA (the name of my tool) found ~420 URLs out of about 3.5 million tried with xhtml 1.0 and application/xhtml+xml. Not nearly as many as your above URL space found. > I think Nikita may see fewer application/xhtml+xml pages than are in > the wild. One reason is that her user agent string is simply "Nikita > the Spider (http://NikitaTheSpider.com/)" -- no "mozilla-compatible" > or other strings aimed at influencing code that sniffs user agents. > The other reason is that she sends an Accept header of "*/*" which of > course doesn't exclude application/xhtml+xml, but neither does it > explicitly mention it. I would guess that many servers that > conditionally send application/xhtml+xml explicitly check for the > presence of that string in the Accept header and if it isn't present > fall back to text/html. I'd like to check for UA and other types of HTTP request header discrimination in a future crawl. I think it could yield some interesting results. I chose to emulate the Opera HTTP request headers as closely as possible, since the original goal of the project was to get an Opera's eye view of what the web looked like. It may be informative to do the same with "different eyes" (request headers). Opera's headers explicitly include "application/xhtml+xml", and as you can see that didn't seem to overly skew the result content-type upwards in the result set. -Brian
Received on Saturday, 8 March 2008 00:08:57 UTC