- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Thu, 14 Feb 2008 17:28:33 +0000
- To: public-xml-core-wg <public-xml-core-wg@w3.org>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
So I did a little experiment: Using a moderately random collection of
http: URIs created elsewhere for a web language corpus, I looked at
10404 web pages.  Of these 1520 pages, from 758 distinct hosts,
contained references (22408 in total) to URIs which included #
followed by a digit.  Of the references 2939 were local, i.e. of the
form "#[0-9]...".  Of the 13024 _unique_ fragments 11453 were actually
integers, i.e. of the form #[0-9]+ and a further 66 were decimals,
i.e. of the form #[0-9]+.[0-9]*
I then refetched the same pages (139 didn't make it the second time,
so the total was down to 10265), and found 992 pages, from 636
distinct hosts, which contained anchors (<a href= or <a id=) which
began with a digit.
So, there are a _lot_ of ostensibly broken fragments and anchors out
there.
No, I did not check what percentage of the data was XML, I'll do that.
ht
- -- 
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iD8DBQFHtHpBkjnJixAXWBoRAtAsAJ4tq3a5og9CS5zrOdOdEkiGsKGPWwCfYt8Z
vZ++bT7RBvZw1KTIuEY/6Rk=
=cbLR
-----END PGP SIGNATURE-----
Received on Thursday, 14 February 2008 17:28:44 UTC