Re: Re: several messages about New Vocabularies in text/html from Ian Hickson on 2008-04-03 (public-html@w3.org from April 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 3 Apr 2008 22:34:51 +0000 (UTC)
To: Neil Soiffer <Neils@dessci.com>
Cc: public-html@w3.org, www-math@w3.org
Message-ID: <Pine.LNX.4.62.0804032212190.18949@hixie.dreamhostps.com>

On Thu, 3 Apr 2008, Neil Soiffer wrote:
>
> Unfortunately, I don't think your data is valid.  As others have asked, 
> do your numbers include xhtml pages?

Yes.

If we look just at application/xhtml+xml, the following elements with 
MathML2 local names were found. I've included rough counts but these 
numbers are very approximate -- there simply weren't enough XHTML pages 
in my sample to get good numbers. Maybe a bigger set of pages could get 
more reliable results.

  math         about 2000 pages
  mi      
  mo    
  mrow         about 1900 pages
  mn    
  mfrac        about 1300 pages
  msup    
  mtext   
  msub         about 1100 pages
  annotation   about 800 pages
  mstyle    
  msqrt    
  mtable   
  mtd    
  mtr    
  msubsup      about 500 pages
  munderover    
  mover        about 200 pages
  munder    
  ci           about 150 pages
  apply    
  cn   

> Eg, did your search include [1] from an online MIT course on calculus?

I have no idea which pages specifically it included. It was a sample of 
seven billion documents, weighted by some metric of importance that is 
intended to exclude "spam" pages and to favour pages that people are more 
likely to be interested in.

> Also, it is clear you missed some MathML in HTML pages.

Naturally. It was merely a scan of a sample of seven billion pages, not a 
scan of the entire Web, which would be prohibitively expensive.

> As I remarked when I presented my numbers, the wolfram.com website has a 
> large number of pages with content MathML.

It has a large number of pages with text that represents MathML content; 
it doesn't actually contain any MathML content itself as far as I can 
tell. Pages including escaped markup like this were not counted.

> If I do a search on +mfrac +mi +mo +mml:semantics [note the mml: 
> namespace prefix, which I didn't include in my previous searches]
> 
> Google says that there are "about 7,440" hits.  If I just look for
> mml:semantics, the number is 19,300.  That's more than the numbers you
> found.

Sure, a Google search is seaching orders of magnitude more documents than 
I scanned.

> This search seems to turn up hits that are virtually all MathML
> "data", not pages discussing it.

Actually they all also contain escaped MathML, which is really just text, 
which is why Google finds them. That isn't MathML, though it can be, as 
you say, copied and pasted into MathML processors.

I don't see any evidence to suggest that people using the <semantics> 
element are more likely to write their pages in this weird "escaped MathML 
content inside text/html pages" manner than people who don't use 
<semantics>. Evidence to that effect is what would be needed to invalidate 
the results of the study. (i.e. you should show that the sample is somehow 
biased towards or against its conclusion, not that the sample is not 
complete, which is trivially true.)

Anyway, this is mostly moot as the proposal that is being converged on 
does support Content MathML in text/html:

   http://wiki.whatwg.org/wiki/New_Vocabularies_Solution

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 3 April 2008 22:35:37 UTC