Re: XML on the web

Liam R E Quin writes:

> On Fri, 2012-09-14 at 10:15 +0100, Henry S. Thompson wrote:
>> Does anyone have, or have a pointer to, a database of URIs of XML
>> documents visible on the Web?  I don't have the time right now to
>> find, install and configure a crawler, but if someone else has already
>> done so. . .
>
> The University of Amsterdam XML on th Web Corpus is available -- I
> mentioned it in my Balisage talk at
> http://www.w3.org/2012/Talks/08-quin-xml-web-corpus/

Thanks!

Some preliminary results:

I looked in some detail at the first ~100MB == 3389 files.

392 of these files had one or more xml-stylesheet PIs (377 had 1, 15
had 2).

Of the 407 PIs:

 321 type= text/xsl
  82 type= text/css
   2 type= text/xml  href= ...xsl
   1 type= text/html href= ...xsl
   1 [no type=]      href= ...css

All 407 PIs had href= ...

Of the 392 files with PIs, 40 were not well-formed (that is, 10.2%),
with the following problems as reported by rxp [1]:

   Error: Document ends too soon
   Error: EOE in PI [3 of these]
   Error: Expected ; after entity name, but got = [4 of these]
   Error: Expected > at end of entity declaration, but got -
   Error: Expected name, but got & for entity
   Error: Expected whitespace or tag end in start tag
   Error: Input error: Illegal UTF-8 byte 2 <0x20>
   Error: Input error: Illegal UTF-8 byte 2 <0x20>
   Error: Input error: Illegal UTF-8 byte 2 <0x2e>
   Error: Input error: Illegal UTF-8 byte 2 <0x65>
   Error: Input error: Illegal UTF-8 start byte <0xa0>
   Error: Input error: Illegal character <0x0> [11 of these]
   Error: Mismatched end tag: expected </abbr>, got </a>
   Error: Unknown declared encoding GB2312
   Error: Unknown declared encoding ISO8859-1
   Error: Unknown declared encoding TIS-620
   Error: Unknown declared encoding gb2312
   Error: Unknown declared encoding uft-8 [2 of these]
   Error: Unknown declared encoding windows-1251 [2 of these]
   Error: Unknown declared encoding windows-1252 [2 of these]
   Error: Unknown declared encoding x-user-defined

The document elements of the 40 were as follows:

     20 feed
      6 rss
      5 html
      1 urlset
      1 ttarmb
      1 response
      1 encart
      1 doc
      1 article
      1 Materia
      1 LISTE
      1 Auftragsbuch

No surprises in the first two lines :-).  

Of these 40, the 'feed's, the Materia and the Auftragsbuch had only
css stylesheets, the rest (rss, html and the other singletons) had only xsl.

Looking again at all 392 files (whether or not they are well-formed),
for the 320 which reference an xsl stylesheet (or, in one case, 2) we
find the following document elements:

    116 urlset
     69 rss
     15 html
     13 feed
      6 ead
      3 timetable
      3 rdf:RDF
      2 sitemapindex
      2 page_xml
      2 page
      2 j:jelly
      2 ficha
      2 document
      2 doc
      2 dataroot
      2 article
      2 TIPOPUBLICACION
      1 [a further 75 distinct tags]

whereas for the 82 which reference a css stylesheet we find

     48 feed
     24 rss
      2 rdf:RDF
      2 html
      2 document
      1 history
      1 Materia
      1 Auftragsbuch
      1 ASSOCIAZIONI

The subset of the above which occur when _both_ xsl and css are referenced have

     11 rss
      2 feed

So, net-net: xml-stylesheet is used by approximately 11.6% of the
first 3389 files in the Amsterdam XML Corpus [2] (and since Liam has
shown that not all of those are actually XML, that gives as a _lower_
bound estimate), with the distribution being 305:69:13:5 == 78%:18%:3%:1%
for xsl-only:css-only:both:other

This is, it seems to me, relevant input to the HTML5 editors. . .

On another front, 20 of the 3389 began with a byte-order mark.  19 of
the 20 were well-formed.

ht

[1] http://www.ltg.ed.ac.uk/~richard/rxp.html
[2] http://data.politicalmashup.nl/xmlweb/
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

Received on Monday, 17 September 2012 21:44:24 UTC