Re: Bug 85/4494 (keeping track of validation statistics for various purposes)

Nikita The Spider The Spider wrote:
> On Thu, Mar 6, 2008 at 2:18 PM, Frank Ellermann
> <nobody@xyzzy.claranet.de> wrote:
>>  Nikita The Spider The Spider wrote:
>>
>>
>>> And the following doctypes:
>>  >   2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>>  > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>>  >    830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>  [...]
>>
>>>     34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN"
>>  > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
>>  [...]
>>
>>> And the following media types:
>>  >   5121: text/html
>>
>>  Not a single application/xhtml+xml, XHTML 1.0 is alive and kicking.
> 
> I get a few of them, though not many. For instance, another sample of
> 300 sites/265872 pages gives this distribution of media types:
>     264246: text/html
>      1636: application/xhtml+xml

MAMA (the name of my tool) found ~420 URLs out of about 3.5 million 
tried with xhtml 1.0 and application/xhtml+xml. Not nearly as many as 
your above URL space found.

> I think Nikita may see fewer application/xhtml+xml pages than are in
> the wild. One reason is that her user agent string is simply "Nikita
> the Spider (http://NikitaTheSpider.com/)" -- no "mozilla-compatible"
> or other strings aimed at influencing code that sniffs user agents.
> The other reason is that she sends an Accept header of "*/*" which of
> course doesn't exclude application/xhtml+xml, but neither does it
> explicitly mention it. I would guess that many servers that
> conditionally send application/xhtml+xml explicitly check for the
> presence of that string in the Accept header and if it isn't present
> fall back to text/html.

I'd like to check for UA and other types of HTTP request header 
discrimination in a future crawl. I think it could yield some 
interesting results. I chose to emulate the Opera HTTP request headers 
as closely as possible, since the original goal of the project was to 
get an Opera's eye view of what the web looked like. It may be 
informative to do the same with "different eyes" (request headers).
Opera's headers explicitly include "application/xhtml+xml", and as you 
can see that didn't seem to overly skew the result content-type upwards 
in the result set.

-Brian

Received on Saturday, 8 March 2008 00:08:57 UTC