- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Mon, 31 Dec 2007 14:12:23 +0000
- To: Karl Dubost <karl@w3.org>
- CC: HTMLWG List <public-html@w3.org>
Karl Dubost wrote: > 1. Would it be possible to extract the metaname GENERATOR of these pages > containing the u. I have seen for example: > > <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> > <meta name="GENERATOR" content="Microsoft FrontPage 4.0"> > <meta name="GENERATOR" content="Mozilla/4.61 [en] (Win98; I) > [Netscape]"> > <meta name="generator" content="WordPress 2.2.1" /> Using a slightly different collection of pages (still from dmoz.org, but twice as many and downloaded a few weeks ago), with some analysis to (hopefully) avoid misleading results, I get http://philip.html5.org/data/underline-generators.txt Some possible conclusions: * 15K pages isn't enough to get particularly accurate results here - the statistical noise hides most of the differences that might exist. If anyone cares much, I could quite easily repeat this with maybe four times as many pages (so the results should be twice as accurate). * Pages [that are listed on dmoz.org] with some generator are more than twice as likely to contain <u>, compared to pages without a generator. * Pages from Yahoo PageBuilder are much more likely to contain <u> than pages from any other generator. * Pages from Microsoft FrontPage are more likely to contain <u> than typical pages with generators. * Pages from WordPress and TYPO3 are less likely to contain <u> than typical pages with generators. * Pages from Adobe GoLive are less likely to contain <u> than pages with no generator at all. and that's about it. > 2. What are the contextual constructs of u: nested and nesting? For > example I have seen a page in your sample where it is used for spam but > not only for spam. > > http://www.wordtree.com/ > <p> This ends the coupon part, but there are some other > things to think about. To return to the beginning of this Wordtree > website, click <a href="#top">here</a> > <u style="display: none"> > <a href="http://www.i-mortgage-rates.com/">mortgage > calculator mortgage rates</a> > </u>: > </p> > > The contextual construct for this one would be p/u/a I'm not quite sure what you're asking for here... The parent element and not any further ancestors (i.e. "p/u..." rather than "html/body/table/tbody/tr/td/div/center/font/p/u...")? The child element(s) and not any further descendants? What about (non-whitespace) text-node children? What about <u> elements with multiple children? I don't see how this data would be used, so I'm not sure exactly what would be useful to collect or how to summarise it and present it. But I think it should be fairly quick if I knew what was actually wanted :-) -- Philip Taylor pjt47@cam.ac.uk
Received on Monday, 31 December 2007 14:12:48 UTC