- From: Nikita The Spider The Spider <nikitathespider@gmail.com>
- Date: Thu, 6 Mar 2008 10:08:33 -0500
- To: "Brian Wilson" <bloo@blooberry.com>
- Cc: www-validator@w3.org
On Sat, Mar 1, 2008 at 10:35 AM, Nikita The Spider The Spider
<nikitathespider@gmail.com> wrote:
>
> On Wed, Feb 27, 2008 at 10:09 PM, Brian Wilson <bloo@blooberry.com> wrote:
> >
> > Nikita The Spider The Spider wrote:
> > > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote:
> > >> On Wed, 6 Feb 2008, olivier Thereaux wrote:
> > >>
> > >>> * stats on the documents themselves. Doctype, mime type, charset.
> > >>> Ideally, whether charset is in HTTP, XML decl, meta. There are
> > >>> existing studies about these, but another study made on a different
> > >>> sample would bring more perspective.
Hi Brian et al,
Here's a sample of the data I can pull out of Nikita. These are the
aggregate stats for 50 sites that Nikita crawled late last year. I
truncated the list of validation messages for brevity since this is
just an example. Are these the kind of data in which you're
interested?
Nikita deals with a wide variety of sites, and some are much bigger
than others. The smallest sites have just a few pages and the largest
have tens of thousands of pages. I can already see that this is
skewing the stats -- 1755 of the validation errors were for the
specific ID attribute "undefined_2" which probably all came from one
site. I guess version 2 of this statistics collector program would
base its data for each site on a random sample of, say, 50 pages from
each site, ignoring all sites with fewer than 50 pages.
In 5120 pages, Nikita found these 2111 errors:
6276 (2.97)%: required attribute "alt" not specified
3750 (1.78)%: required attribute "ALT" not specified
3636 (1.72)%: reference to entity "cat2" for which no system
identifier could be generated
2746 (1.30)%: reference to entity "node" for which no system
identifier could be generated
1976 (0.94)%: required attribute "TYPE" not specified
1755 (0.83)%: ID "undefined_2" already defined
1427 (0.68)%: document type does not allow element "META" here
1274 (0.60)%: end tag for element "td" which is not open
1256 (0.59)%: end tag for "tr" omitted, but OMITTAG NO was specified
1250 (0.59)%: end tag for element "tr" which is not open
1221 (0.58)%: end tag for "table" omitted, but OMITTAG NO was specified
1205 (0.57)%: an attribute value specification must be an
attribute value literal unless SHORTTAG YES is specified
1197 (0.57)%: end tag for "td" omitted, but OMITTAG NO was specified
958 (0.45)%: reference to entity "cat1" for which no system
identifier could be generated
935 (0.44)%: reference not terminated by REFC delimiter
925 (0.44)%: reference to external entity in attribute value
-- truncated for brevity --
And the following encodings:
3495: iso-8859-1
2543: utf-8
50: windows-1251
31: iso-8859-15
10: windows-1252
From the following encoding sources:
4585: META HTTP-equiv tag
1310: HTTP response header
234: Fallback to default
And the following doctypes:
2840: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
830: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
158: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
84: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
67: <!DOCTYPE HTML PUBLIC "http://www.w3.org/W3C//DTD HTML 4.01
Transitional//EN">
38: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"
"http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
34: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
31: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
11: <!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">
2: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//NO" "http://www.w3.org/TR/html4/loose.dtd">
1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN">
1: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
0: None
And the following media types:
5121: text/html
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Received on Thursday, 6 March 2008 15:08:45 UTC