Re: Bug 85/4494 (keeping track of validation statistics for various purposes)

On Thu, Mar 6, 2008 at 2:18 PM, Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:
>
>  Nikita The Spider The Spider wrote:
>
>
> > And the following doctypes:
>  >   2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>  > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>  >    830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>  [...]
>
> >     34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN"
>  > "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
>  [...]
>
> > And the following media types:
>  >   5121: text/html
>
>  Not a single application/xhtml+xml, XHTML 1.0 is alive and kicking.

I get a few of them, though not many. For instance, another sample of
300 sites/265872 pages gives this distribution of media types:
    264246: text/html
     1636: application/xhtml+xml


I think Nikita may see fewer application/xhtml+xml pages than are in
the wild. One reason is that her user agent string is simply "Nikita
the Spider (http://NikitaTheSpider.com/)" -- no "mozilla-compatible"
or other strings aimed at influencing code that sniffs user agents.
The other reason is that she sends an Accept header of "*/*" which of
course doesn't exclude application/xhtml+xml, but neither does it
explicitly mention it. I would guess that many servers that
conditionally send application/xhtml+xml explicitly check for the
presence of that string in the Accept header and if it isn't present
fall back to text/html.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Received on Friday, 7 March 2008 02:13:02 UTC