- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Mon, 17 Sep 2012 22:43:45 +0100
- To: liam@w3.org
- Cc: public-xml-core-wg <public-xml-core-wg@w3.org>
Liam R E Quin writes:
> On Fri, 2012-09-14 at 10:15 +0100, Henry S. Thompson wrote:
>> Does anyone have, or have a pointer to, a database of URIs of XML
>> documents visible on the Web? I don't have the time right now to
>> find, install and configure a crawler, but if someone else has already
>> done so. . .
>
> The University of Amsterdam XML on th Web Corpus is available -- I
> mentioned it in my Balisage talk at
> http://www.w3.org/2012/Talks/08-quin-xml-web-corpus/
Thanks!
Some preliminary results:
I looked in some detail at the first ~100MB == 3389 files.
392 of these files had one or more xml-stylesheet PIs (377 had 1, 15
had 2).
Of the 407 PIs:
321 type= text/xsl
82 type= text/css
2 type= text/xml href= ...xsl
1 type= text/html href= ...xsl
1 [no type=] href= ...css
All 407 PIs had href= ...
Of the 392 files with PIs, 40 were not well-formed (that is, 10.2%),
with the following problems as reported by rxp [1]:
Error: Document ends too soon
Error: EOE in PI [3 of these]
Error: Expected ; after entity name, but got = [4 of these]
Error: Expected > at end of entity declaration, but got -
Error: Expected name, but got & for entity
Error: Expected whitespace or tag end in start tag
Error: Input error: Illegal UTF-8 byte 2 <0x20>
Error: Input error: Illegal UTF-8 byte 2 <0x20>
Error: Input error: Illegal UTF-8 byte 2 <0x2e>
Error: Input error: Illegal UTF-8 byte 2 <0x65>
Error: Input error: Illegal UTF-8 start byte <0xa0>
Error: Input error: Illegal character <0x0> [11 of these]
Error: Mismatched end tag: expected </abbr>, got </a>
Error: Unknown declared encoding GB2312
Error: Unknown declared encoding ISO8859-1
Error: Unknown declared encoding TIS-620
Error: Unknown declared encoding gb2312
Error: Unknown declared encoding uft-8 [2 of these]
Error: Unknown declared encoding windows-1251 [2 of these]
Error: Unknown declared encoding windows-1252 [2 of these]
Error: Unknown declared encoding x-user-defined
The document elements of the 40 were as follows:
20 feed
6 rss
5 html
1 urlset
1 ttarmb
1 response
1 encart
1 doc
1 article
1 Materia
1 LISTE
1 Auftragsbuch
No surprises in the first two lines :-).
Of these 40, the 'feed's, the Materia and the Auftragsbuch had only
css stylesheets, the rest (rss, html and the other singletons) had only xsl.
Looking again at all 392 files (whether or not they are well-formed),
for the 320 which reference an xsl stylesheet (or, in one case, 2) we
find the following document elements:
116 urlset
69 rss
15 html
13 feed
6 ead
3 timetable
3 rdf:RDF
2 sitemapindex
2 page_xml
2 page
2 j:jelly
2 ficha
2 document
2 doc
2 dataroot
2 article
2 TIPOPUBLICACION
1 [a further 75 distinct tags]
whereas for the 82 which reference a css stylesheet we find
48 feed
24 rss
2 rdf:RDF
2 html
2 document
1 history
1 Materia
1 Auftragsbuch
1 ASSOCIAZIONI
The subset of the above which occur when _both_ xsl and css are referenced have
11 rss
2 feed
So, net-net: xml-stylesheet is used by approximately 11.6% of the
first 3389 files in the Amsterdam XML Corpus [2] (and since Liam has
shown that not all of those are actually XML, that gives as a _lower_
bound estimate), with the distribution being 305:69:13:5 == 78%:18%:3%:1%
for xsl-only:css-only:both:other
This is, it seems to me, relevant input to the HTML5 editors. . .
On another front, 20 of the 3389 began with a byte-order mark. 19 of
the 20 were well-formed.
ht
[1] http://www.ltg.ed.ac.uk/~richard/rxp.html
[2] http://data.politicalmashup.nl/xmlweb/
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Monday, 17 September 2012 21:44:24 UTC