Re: several messages about New Vocabularies in text/html from Neil Soiffer on 2008-04-03 (www-math@w3.org from April 2008)

From: Neil Soiffer <Neils@dessci.com>
Date: Thu, 3 Apr 2008 14:51:28 -0700
To: "Ian Hickson" <ian@hixie.ch>
Cc: "David Carlisle" <davidc@nag.co.uk>, "Sam Ruby" <rubys@us.ibm.com>, "Henri Sivonen" <hsivonen@iki.fi>, "Simon Pieters" <simonp@opera.com>, "Bruce Miller" <bruce.miller@nist.gov>, jg307@cam.ac.uk, public-html@w3.org, www-math@w3.org, "Julian Reschke" <julian.reschke@gmx.de>
Message-ID: <d98bce170804031451j5f17e6b5i4dacf6c1641f8f21@mail.gmail.com>
I think it is great that you brought some numbers to the table.
Unfortunately, I don't think your data is valid.  As others have asked, do
your numbers include xhtml pages?   I would think they represent a majority
of the pages with MathML in them.  It appears that xhtml is dark to search
engines (seems like there is an obvious analogy to dark matter, but I'm sure
that have been made before).  Eg, did your search include [1] from an online
MIT course on calculus?

Also, it is clear you missed some MathML in HTML pages.  As I remarked when
I presented my numbers, the wolfram.com website has a large number of pages
with content  MathML.  In fact, functions.wolfram.com alone has 307,715
pages with content MathML embedded in a semantics tag[2] (much more than the
4,000 or 5,000 you found).  These are presented as data for easy/cut paste,
so you probably did not see them when searching for tags.  None the less,
this is how at least one site tried to deal with the lack of MathML support
in HTML and is representative of what they would put out if they could.

If I do a search on
+mfrac +mi +mo +mml:semantics
[note the mml: namespace prefix, which I didn't include in my previous
searches]

Google says that there are "about 7,440" hits.  If I just look for
mml:semantics, the number is 19,300.  That's more than the numbers you
found.  This search seems to turn up hits that are virtually all MathML
"data", not pages discussing it.  Most appear to be pages with MathML only
on them, usually from a journal of some sort.  I assume that these are
stored in some CMS and assembled onto a page dynamically.  An example of
this kind of page is [3].  These are the pages that I mentioned that have a
semantics element, but no annotation or annotation-xml element.

With the vast data you have at Google, I hope you can go back and figure out
how to get more accurate numbers.  I'm sure you can get access to info that
I can't.

Neil Soiffer
Senior Scientist
Design Science, Inc.
www.dessci.com
~ Makers of Equation Editor, MathType, MathPlayer and MathFlow ~



[1]
http://ocw.mit.edu/ans7870/18/18.013a/textbook/MathML/chapter01/section02.xhtml
-- note that they present both HTML and XHTML versions of the course, so if
you do a search for text on that page, make sure your hit is the xhtml
version.

[2]  http://functions.wolfram.com/alphabeticalIndex.html -- I based my count
on the values here -- drill down on them and you will find what I am
referring to.

[3] http://www.physmathcentral.com/1754-0410/1/7/mathml/M15




On Wed, Apr 2, 2008 at 4:56 PM, Ian Hickson <ian@hixie.ch> wrote:

>
> I did some research on actual usage of MathML on the Web.
>
> I scanned about 7 billion pages, and in each page, after parsing it with
> an HTML5 parser, looked for elements that, after stripping any leading
> prefix, had an element with the local name "math" and had, in addition,
> one of the following:
>
> * At least one of the following: maction maligngroup malignmark menclose
>  merror mfenced mfrac mglyph mi mlabeledtr mmultiscripts mn mo mover
>  mpadded mphantom mprescripts mroot mrow ms mspace msqrt mstyle msub
>  msubsup msup mtable mtd mtext mtr munder munderover none
>
> * At least two of the following: abs and apply approx arccos arccosh
>  arccot arccoth arccsc arccsch arcsec arcsech arcsin arcsinh arctan
>  arctanh arg bvar card cartesianproduct ceiling ci cn codomain complexes
>  compose condition conjugate cos cosh cot coth csc csch csymbol curl
>  declare degree determinant diff divergence divide domain
>  domainofapplication emptyset eq equivalent eulergamma exists exp
>  exponentiale factorial factorof false floor fn forall gcd geq grad gt
>  ident image imaginary imaginaryi implies in infinity int integers
>  intersect interval inverse lambda laplacian lcm leq limit list ln log
>  logbase lowlimit lt matrix matrixrow max mean median min minus mode
>  moment momentabout naturalnumbers neq not notanumber notin notprsubset
>  notsubset or otherwise outerproduct partialdiff pi piece piecewise plus
>  power primes product prsubset quotient rationals real reals reln rem
>  root scalarproduct sdev sec sech selector sep set setdiff sin sinh
>  subset sum tan tanh tendsto times transpose true union uplimit variance
>  vector vectorproduct xor
>
> * At least one of the following: annotation annotation-xml semantics
>
> The results I found are as follows:
>
> 200000 pages containing one or more from the first list above
>       (pages using Presentational MathML).
>
>  50000 pages containing only <math> and none of the above (or at most 1
>       from the Content MathML list above).
>
>  5000 pages containing two or more from the second list above
>       (pages using Content MathML).
>
>  4000 pages containing at least one from the third list above.
>
>  3000 pages containing at least one from the first list above and two
>       from the second list above (containing both Presentational and
>       Content MathML).
>
> This suggests that Content MathML use is nowhere near as frequently used
> as has been previously suggested.
>
> The most common MathML elements in the sample were:
>
>   ELEMENT   ROUGH COUNT   PERCENTAGE
>   rem          8500000      0.122%
>   image        1000000      0.014%
>   set           450000      0.006%
>   abs           400000      0.005%
>   root          300000      0.004%
>   math          250000      0.003%
>   mi            250000      0.003%
>   true          200000      0.003%
>   mo            200000      0.003%
>   none          200000      0.003%
>   ms            200000      0.002%
>   mrow          200000      0.002%
>   mn            200000      0.002%
>   list          150000      0.002%
>   sec           150000      0.002%
>   mfrac         150000      0.002%
>   msub          150000      0.002%
>   product       100000      0.001%
>
> (The <rem>, <image>, <set>, <abs>, and <root> elements are the reason
> why the sample needed _two_ Content MathML elements to count as MathML
> -- those elements, it turns out, are common in other contexts. <image>,
> for example, is a synonym for <img> in HTML.)
>
> This study could probably be done in various different ways. In
> particular, I didn't do anything to check namespaces, which could be a
> better indicator of MathML content. I counted pages, rather than sites,
> thus biasing towards large publishers instead of smaller ones. I didn't
> check that the MathML elements used on a page where descendants of the
> <math> element on the page. I didn't check that the prefixes matched
> throughout.
>
> These factors add together to bias the numbers towards the wide use of
> unnamespaced and Presentational-MathML-only MathML on, in particular,
> freepatentsonline.com. On the other hand, that's a whole lot of MathML
> that we would instantly be supporting if we added this to HTML5, so maybe
> it's not an unfair bias.
>
>
>
> On Wed, 2 Apr 2008, David Carlisle wrote:
> >
> > > Yes, people keep saying that, but I've yet to see a detailed proposal
> > > that is workable. I've tried coming up with many different ideas, but
> > > all had some fatal flaw that wouldn't work on the Web.
> >
> > Since people have been placing content mathml (and openmath and other)
> > annotations on the web for the last ten years or so, it clearly is
> > possible to make this work on the web, it may not work in html5 as
> > currently specified, but I understand that one if the aims of html5 is
> > to codify existing practice and allow things that work now to keep
> > working.
>
> Well, MathML in XHTML will of course continue to be fully supported. We
> are only talking about MathML in text/html, which up til now has never
> been a valid or defined practice.
>
>
> > HTML since forever has had rules that allow unknown elements to be
> > parsed (with a default rendering of ignoring the element and processing
> > the content) The html parser has never had to "know" anything about them
> > has it?
>
> I'm not sure to what you refer here. Before HTML5, HTML has not had any
> defined error handling parsing rules, browsers just made it up as they
> went along, based on reverse-engineering each other.
>
>
> On Wed, 2 Apr 2008, Sam Ruby wrote:
> > >
> > >    http://wiki.whatwg.org/wiki/Extensions
> >
> > I have now contributed to that page.  Feel free to identify where the
> > proposal is not detailed enough or to identify any flaws that may, or
> > may not, prove fatal.
>
> The proposal seems to be "do what Microsoft documented in their namespaces
> whitepaper as being the IE8 Beta 1 behaviour". However, the whitepaper
> doesn't actually say what the processing model is, and IE8 beta 1 doesn't
> seem to implement anything like what the whietpaper implies should happen
> anyway.
>
> If you could describe in your own words what the processing model you are
> proposing is, that would be something I could evaluate.
>
>
> On Wed, 2 Apr 2008, Henri Sivonen wrote:
> >
> > Could you please elaborate why the following won't work? In particular,
> > would the following breaks such a large mass of pages as to Break The
> > Web? (Especially if the rendering rules for MathML are adjusted so that
> > text children of <math> are rendered like text children of an HTML
> > <span>.)
> >
> > The following elements are defined as 'namespace-sensitive':
> >  <html>
> >  <svg>
> >  <math>
> >  <foreignObject>
> >  <annotation-xml encoding="application/xhtml+xml">
> >  <annotation-xml encoding="OpenMath">.
> >
> > Namespace-sensitive elements have two namespace URIs associated with
> > them: self and scope.
> >
> > Thus:
> >  <html>
> >    self: http://www.w3.org/1999/xhtml
> >    scope: http://www.w3.org/1999/xhtml
> >  <svg>
> >    self: http://www.w3.org/2000/svg
> >    scope: http://www.w3.org/2000/svg
> >  <math>
> >    self: http://www.w3.org/1998/Math/MathML
> >    scope: http://www.w3.org/1998/Math/MathML
> >  <foreignObject>
> >    self: http://www.w3.org/2000/svg
> >    scope: http://www.w3.org/1999/xhtml
> >  <annotation-xml encoding="application/xhtml+xml">
> >    self: http://www.w3.org/1998/Math/MathML
> >    scope: http://www.w3.org/1999/xhtml
> >  <annotation-xml encoding="OpenMath">
> >    self: http://www.w3.org/1998/Math/MathML
> >    scope: http://www.openmath.org/OpenMath
> >
> > The namespace of a element node to be inserted is determined as follows:
> >  1) If the node to be inserted is an namespace-sensitive element, use
> the
> > value for 'self' in the above list and abort these steps.
> >  2) Let 'node' be the current node on the stack of open elements.
> >  3) If 'node' is a namespace-sensitive element, use the value for
> 'scope' in
> > the above list and abort these steps.
> >  4) Let 'node' be the next node on the stack of open elements towards
> the root
> > element.
> >  5) Go back to step 3.
> >
> > (Of course, the repeated stack walking should be optimized away.)
> >
> > The /> empty element syntax should be supported on start tag tokens
> (node
> > popped immediately) whose namespace doesn't resolve to
> > http://www.w3.org/1999/xhtml according to the above rule.
> >
> > When the stack is pushed/popped, the namespace of the current node must
> > be inspected. If it is http://www.w3.org/1999/xhtml, the tokenizer must
> > be set not to support CDATA sections. Otherwise, the tokenizer must be
> > set to support CDATA sections.
>
> This is an interesting proposal, far more concrete than anything anyone
> else has proposed so far. Thank you.
>
> It doesn't work because it breaks the handling of pages that exist today
> that use the elements you list above. For example, take this page:
>
>
> http://www.cip.es/aecan/ver_anuncio.asp?idioma=Aleman&cod_anuncio=ARC100&acceso=Busqueda
>
> ...which contains this markup:
>
>   <td width="27%" bgcolor="#FFFFFF">0 <math>m<sup>2</sup></td>
>
>
> It also fails in the case where someone (author A) using a new browser
> writes a page that uses this feature, and then someone (author B) using an
> old browser copies and pastes from A's page into his page, accidentally
> including a stray <svg> tag or <math> tag. His page looks fine to most
> users, but to the users of the new browser, the page is now horked.
>
>
> On Wed, 2 Apr 2008, Simon Pieters wrote:
> >
> > Until I see actual pages that contain non-MathML in <math> or non-SVG in
> > <svg>, I'm not convinced that Henri's scoped parsing proposal[1] doesn't
> > work. Do you perhaps have such data at hand so I can take a look and be
> > convinced? :-)
>
> Most pages that use <math> when not using MathML seem to put LaTeX-like
> markup inside the element. Here are some that put elements in <math>,
> though:
>
>   http://www.emis.de/journals/FPM/eng/k00/k001/k00126h.htm
>   http://www.freepatentsonline.com/EP0693743.html
>   http://apmath.kku.ac.kr/~seokko/notes/mathcon.htm<http://apmath.kku.ac.kr/%7Eseokko/notes/mathcon.htm>
>   http://www.ioffe.rssi.ru/cp866/journals/jtf/2003/12/page-1.html.ru
>   http://www.kougensha.net/blosxom/blosxom.cgi/tech/freebsd/index.html
>
>
>
> > If there are a non-trivial amount of pages that have HTML elements in
> > <math> or <svg> (not nested in <foreignObject>/<annotation-xml>), then
> > wouldn't it be possible to special-case HTML elements in <math>/<svg>
> > and let the rest be handled as "unknown" elements in the MathML/SVG
> > namespaces (so that, e.g., <math><foo><b> is interpreted as
> > <mml:math><mml:foo><html:b>)?
>
> This wouldn't work well for SVG, where we have name clashes already (i.e.
> where some element names are used in both SVG and HTML).
>
>
> > Also, on a slightly different note, I think that for copy-pastability of
> > SVG in text/html, the parser needs to make /> self-close elements, since
> > e.g. <circle> can have contents (e.g. animation stuff, I think) and Sam
> > Ruby said that some tools emit <defs/> and <g/>. [2]
>
> Yes. I'm not conviced that we'll be able to get the ability to copy and
> paste image/svg+xml content into text/html.
>
>
> On Wed, 2 Apr 2008, Henri Sivonen wrote:
> >
> > The existing content landscape for <svg> may be very different from
> > random junk in <math> out there, since cargo-cult semanticists may come
> > up with <math> own but <svg> is more unlikely to occur without trying to
> > do SVG. So while scope plus HTML blacklist may be the best option for
> > MathML subtrees, scope plus camelCase-fixing whitelist may be the most
> > robust solution for SVG subtrees.
>
> I'm not sure exactly what you mean here.
>
> I will see about doing a more detailed study to examine the feasibilty of
> what you propose (especially for the SVG side).
>
>
> > Finally, breaking a handful of legacy pages isn't yet a "fatal" flaw.
>
> I believe it is.
>
>
> On Wed, 2 Apr 2008, Bruce Miller wrote:
> >
> > _Surely_, no one out there is writing HTML using <whatevertag/> when
> > they _dont_ mean to close the element?!?!?! (rolling my eyes :> )
>
> Yeah, it's used all over the place actually, with the pages relying on the
> tag not closing.
>
>
> On Wed, 2 Apr 2008, David Carlisle wrote:
> >
> > It's odd that earlier in the the thread we were told that proper
> > handling of html5 would require a real html5 parser (of which several
> > ought to be available) but in the same thread there is the repeated
> > requirement that html5 "work" with the existing html4 parsers. (Which
> > presumably doesn't go as far as saying what the HTML spec (by reference
> > to sgml) says it should do for /> which is to treat the > as character
> > data.
>
> In practice what HTML5 defines today is pretty much what HTML4 browsers
> implement.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>
Received on Thursday, 3 April 2008 21:52:26 UTC