Re: Bug 85/4494 (keeping track of validation statistics for various purposes)

On Sat, Mar 1, 2008 at 10:35 AM, Nikita The Spider The Spider
<nikitathespider@gmail.com> wrote:
>
> On Wed, Feb 27, 2008 at 10:09 PM, Brian Wilson <bloo@blooberry.com> wrote:
>  >
>  >  Nikita The Spider The Spider wrote:
>  >  > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote:
>  >  >> On Wed, 6 Feb 2008, olivier Thereaux wrote:
>  >  >>
>  >  >>> * stats on the documents themselves. Doctype, mime type, charset.
>  >  >>> Ideally, whether charset is in HTTP, XML decl, meta. There are
>  >  >>> existing studies about these, but another study made on a different
>  >  >>> sample would bring more perspective.

Hi Brian et al,
Here's a sample of the data I can pull out of Nikita. These are the
aggregate stats for 50 sites that Nikita crawled late last year. I
truncated the list of validation messages for brevity since this is
just an example. Are these the kind of data in which you're
interested?

Nikita deals with a wide variety of sites, and some are much bigger
than others. The smallest sites have just a few pages and the largest
have tens of thousands of pages. I can already see that this is
skewing the stats -- 1755 of the validation errors were for the
specific ID attribute "undefined_2" which probably all came from one
site. I guess version 2 of this statistics collector program would
base its data for each site on a random sample of, say, 50 pages from
each site, ignoring all sites with fewer than 50 pages.


In 5120 pages, Nikita found these 2111 errors:
     6276 (2.97)%: required attribute "alt" not specified
     3750 (1.78)%: required attribute "ALT" not specified
     3636 (1.72)%: reference to entity "cat2" for which no system
identifier could be generated
     2746 (1.30)%: reference to entity "node" for which no system
identifier could be generated
     1976 (0.94)%: required attribute "TYPE" not specified
     1755 (0.83)%: ID "undefined_2" already defined
     1427 (0.68)%: document type does not allow element "META" here
     1274 (0.60)%: end tag for element "td" which is not open
     1256 (0.59)%: end tag for "tr" omitted, but OMITTAG NO was specified
     1250 (0.59)%: end tag for element "tr" which is not open
     1221 (0.58)%: end tag for "table" omitted, but OMITTAG NO was specified
     1205 (0.57)%: an attribute value specification must be an
attribute value literal unless SHORTTAG YES is specified
     1197 (0.57)%: end tag for "td" omitted, but OMITTAG NO was specified
      958 (0.45)%: reference to entity "cat1" for which no system
identifier could be generated
      935 (0.44)%: reference not terminated by REFC delimiter
      925 (0.44)%: reference to external entity in attribute value

-- truncated for brevity --

 And the following encodings:
     3495: iso-8859-1
     2543: utf-8
       50: windows-1251
       31: iso-8859-15
       10: windows-1252

 From the following encoding sources:
     4585: META HTTP-equiv tag
     1310: HTTP response header
      234: Fallback to default

 And the following doctypes:
     2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
      830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
      158: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
       84: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
       67: <!DOCTYPE HTML PUBLIC "http://www.w3.org/W3C//DTD HTML 4.01
Transitional//EN">
       38: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"
"http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
       34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
       31: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
       11: <!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">
        2: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//NO" "http://www.w3.org/TR/html4/loose.dtd">
        1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN">
        1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
        0: None

 And the following media types:
     5121: text/html




-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Received on Thursday, 6 March 2008 15:08:45 UTC