Re: Underline element.

Karl Dubost wrote:
> 1. Would it be possible to extract the metaname GENERATOR of these pages 
> containing the u. I have seen for example:
> 
>    <meta name="GENERATOR" content="Microsoft FrontPage 6.0">
>    <meta name="GENERATOR" content="Microsoft FrontPage 4.0">
>    <meta name="GENERATOR" content="Mozilla/4.61 [en] (Win98; I) 
> [Netscape]">
>    <meta name="generator" content="WordPress 2.2.1" />

Using a slightly different collection of pages (still from dmoz.org, but 
twice as many and downloaded a few weeks ago), with some analysis to 
(hopefully) avoid misleading results, I get 
http://philip.html5.org/data/underline-generators.txt

Some possible conclusions:

* 15K pages isn't enough to get particularly accurate results here - the 
statistical noise hides most of the differences that might exist. If 
anyone cares much, I could quite easily repeat this with maybe four 
times as many pages (so the results should be twice as accurate).

* Pages [that are listed on dmoz.org] with some generator are more than 
twice as likely to contain <u>, compared to pages without a generator.

* Pages from Yahoo PageBuilder are much more likely to contain <u> than 
pages from any other generator.

* Pages from Microsoft FrontPage are more likely to contain <u> than 
typical pages with generators.

* Pages from WordPress and TYPO3 are less likely to contain <u> than 
typical pages with generators.

* Pages from Adobe GoLive are less likely to contain <u> than pages with 
no generator at all.

and that's about it.


> 2. What are the contextual constructs of u: nested and nesting? For 
> example I have seen a page in your sample where it is used for spam but 
> not only for spam.
> 
>    http://www.wordtree.com/
>    <p>&nbsp; This ends the coupon part, but there are some other
>       things to think about. To return to the beginning of this Wordtree
>       website, click <a href="#top">here</a>
>       <u style="display: none">
>          <a href="http://www.i-mortgage-rates.com/">mortgage
>          calculator mortgage rates</a>
>       </u>:
>    </p>
> 
>   The contextual construct for this one would be p/u/a

I'm not quite sure what you're asking for here... The parent element and 
not any further ancestors (i.e. "p/u..." rather than 
"html/body/table/tbody/tr/td/div/center/font/p/u...")? The child 
element(s) and not any further descendants? What about (non-whitespace) 
text-node children? What about <u> elements with multiple children?

I don't see how this data would be used, so I'm not sure exactly what 
would be useful to collect or how to summarise it and present it. But I 
think it should be fairly quick if I knew what was actually wanted :-)

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Monday, 31 December 2007 14:12:48 UTC