- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Mon, 17 Sep 2012 22:43:45 +0100
- To: liam@w3.org
- Cc: public-xml-core-wg <public-xml-core-wg@w3.org>
Liam R E Quin writes: > On Fri, 2012-09-14 at 10:15 +0100, Henry S. Thompson wrote: >> Does anyone have, or have a pointer to, a database of URIs of XML >> documents visible on the Web? I don't have the time right now to >> find, install and configure a crawler, but if someone else has already >> done so. . . > > The University of Amsterdam XML on th Web Corpus is available -- I > mentioned it in my Balisage talk at > http://www.w3.org/2012/Talks/08-quin-xml-web-corpus/ Thanks! Some preliminary results: I looked in some detail at the first ~100MB == 3389 files. 392 of these files had one or more xml-stylesheet PIs (377 had 1, 15 had 2). Of the 407 PIs: 321 type= text/xsl 82 type= text/css 2 type= text/xml href= ...xsl 1 type= text/html href= ...xsl 1 [no type=] href= ...css All 407 PIs had href= ... Of the 392 files with PIs, 40 were not well-formed (that is, 10.2%), with the following problems as reported by rxp [1]: Error: Document ends too soon Error: EOE in PI [3 of these] Error: Expected ; after entity name, but got = [4 of these] Error: Expected > at end of entity declaration, but got - Error: Expected name, but got & for entity Error: Expected whitespace or tag end in start tag Error: Input error: Illegal UTF-8 byte 2 <0x20> Error: Input error: Illegal UTF-8 byte 2 <0x20> Error: Input error: Illegal UTF-8 byte 2 <0x2e> Error: Input error: Illegal UTF-8 byte 2 <0x65> Error: Input error: Illegal UTF-8 start byte <0xa0> Error: Input error: Illegal character <0x0> [11 of these] Error: Mismatched end tag: expected </abbr>, got </a> Error: Unknown declared encoding GB2312 Error: Unknown declared encoding ISO8859-1 Error: Unknown declared encoding TIS-620 Error: Unknown declared encoding gb2312 Error: Unknown declared encoding uft-8 [2 of these] Error: Unknown declared encoding windows-1251 [2 of these] Error: Unknown declared encoding windows-1252 [2 of these] Error: Unknown declared encoding x-user-defined The document elements of the 40 were as follows: 20 feed 6 rss 5 html 1 urlset 1 ttarmb 1 response 1 encart 1 doc 1 article 1 Materia 1 LISTE 1 Auftragsbuch No surprises in the first two lines :-). Of these 40, the 'feed's, the Materia and the Auftragsbuch had only css stylesheets, the rest (rss, html and the other singletons) had only xsl. Looking again at all 392 files (whether or not they are well-formed), for the 320 which reference an xsl stylesheet (or, in one case, 2) we find the following document elements: 116 urlset 69 rss 15 html 13 feed 6 ead 3 timetable 3 rdf:RDF 2 sitemapindex 2 page_xml 2 page 2 j:jelly 2 ficha 2 document 2 doc 2 dataroot 2 article 2 TIPOPUBLICACION 1 [a further 75 distinct tags] whereas for the 82 which reference a css stylesheet we find 48 feed 24 rss 2 rdf:RDF 2 html 2 document 1 history 1 Materia 1 Auftragsbuch 1 ASSOCIAZIONI The subset of the above which occur when _both_ xsl and css are referenced have 11 rss 2 feed So, net-net: xml-stylesheet is used by approximately 11.6% of the first 3389 files in the Amsterdam XML Corpus [2] (and since Liam has shown that not all of those are actually XML, that gives as a _lower_ bound estimate), with the distribution being 305:69:13:5 == 78%:18%:3%:1% for xsl-only:css-only:both:other This is, it seems to me, relevant input to the HTML5 editors. . . On another front, 20 of the 3389 began with a byte-order mark. 19 of the 20 were well-formed. ht [1] http://www.ltg.ed.ac.uk/~richard/rxp.html [2] http://data.politicalmashup.nl/xmlweb/ -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Monday, 17 September 2012 21:44:24 UTC