- From: Nikita The Spider The Spider <nikitathespider@gmail.com>
- Date: Thu, 6 Mar 2008 10:08:33 -0500
- To: "Brian Wilson" <bloo@blooberry.com>
- Cc: www-validator@w3.org
On Sat, Mar 1, 2008 at 10:35 AM, Nikita The Spider The Spider <nikitathespider@gmail.com> wrote: > > On Wed, Feb 27, 2008 at 10:09 PM, Brian Wilson <bloo@blooberry.com> wrote: > > > > Nikita The Spider The Spider wrote: > > > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote: > > >> On Wed, 6 Feb 2008, olivier Thereaux wrote: > > >> > > >>> * stats on the documents themselves. Doctype, mime type, charset. > > >>> Ideally, whether charset is in HTTP, XML decl, meta. There are > > >>> existing studies about these, but another study made on a different > > >>> sample would bring more perspective. Hi Brian et al, Here's a sample of the data I can pull out of Nikita. These are the aggregate stats for 50 sites that Nikita crawled late last year. I truncated the list of validation messages for brevity since this is just an example. Are these the kind of data in which you're interested? Nikita deals with a wide variety of sites, and some are much bigger than others. The smallest sites have just a few pages and the largest have tens of thousands of pages. I can already see that this is skewing the stats -- 1755 of the validation errors were for the specific ID attribute "undefined_2" which probably all came from one site. I guess version 2 of this statistics collector program would base its data for each site on a random sample of, say, 50 pages from each site, ignoring all sites with fewer than 50 pages. In 5120 pages, Nikita found these 2111 errors: 6276 (2.97)%: required attribute "alt" not specified 3750 (1.78)%: required attribute "ALT" not specified 3636 (1.72)%: reference to entity "cat2" for which no system identifier could be generated 2746 (1.30)%: reference to entity "node" for which no system identifier could be generated 1976 (0.94)%: required attribute "TYPE" not specified 1755 (0.83)%: ID "undefined_2" already defined 1427 (0.68)%: document type does not allow element "META" here 1274 (0.60)%: end tag for element "td" which is not open 1256 (0.59)%: end tag for "tr" omitted, but OMITTAG NO was specified 1250 (0.59)%: end tag for element "tr" which is not open 1221 (0.58)%: end tag for "table" omitted, but OMITTAG NO was specified 1205 (0.57)%: an attribute value specification must be an attribute value literal unless SHORTTAG YES is specified 1197 (0.57)%: end tag for "td" omitted, but OMITTAG NO was specified 958 (0.45)%: reference to entity "cat1" for which no system identifier could be generated 935 (0.44)%: reference not terminated by REFC delimiter 925 (0.44)%: reference to external entity in attribute value -- truncated for brevity -- And the following encodings: 3495: iso-8859-1 2543: utf-8 50: windows-1251 31: iso-8859-15 10: windows-1252 From the following encoding sources: 4585: META HTTP-equiv tag 1310: HTTP response header 234: Fallback to default And the following doctypes: 2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 158: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 84: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 67: <!DOCTYPE HTML PUBLIC "http://www.w3.org/W3C//DTD HTML 4.01 Transitional//EN"> 38: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd"> 34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> 31: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"> 11: <!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en"> 2: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//NO" "http://www.w3.org/TR/html4/loose.dtd"> 1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"> 1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 0: None And the following media types: 5121: text/html -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more
Received on Thursday, 6 March 2008 15:08:45 UTC