Re: several messages about New Vocabularies in text/html from Ian Hickson on 2008-04-02 (public-html@w3.org from April 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 2 Apr 2008 23:56:43 +0000 (UTC)
To: David Carlisle <davidc@nag.co.uk>, Sam Ruby <rubys@us.ibm.com>, Henri Sivonen <hsivonen@iki.fi>, Simon Pieters <simonp@opera.com>, Bruce Miller <bruce.miller@nist.gov>
Cc: Neil Soiffer <Neils@dessci.com>, jg307@cam.ac.uk, public-html@w3.org, www-math@w3.org, Julian Reschke <julian.reschke@gmx.de>
Message-ID: <Pine.LNX.4.62.0804022211480.24456@hixie.dreamhostps.com>
I did some research on actual usage of MathML on the Web.

I scanned about 7 billion pages, and in each page, after parsing it with 
an HTML5 parser, looked for elements that, after stripping any leading 
prefix, had an element with the local name "math" and had, in addition, 
one of the following:

* At least one of the following: maction maligngroup malignmark menclose 
  merror mfenced mfrac mglyph mi mlabeledtr mmultiscripts mn mo mover 
  mpadded mphantom mprescripts mroot mrow ms mspace msqrt mstyle msub 
  msubsup msup mtable mtd mtext mtr munder munderover none

* At least two of the following: abs and apply approx arccos arccosh 
  arccot arccoth arccsc arccsch arcsec arcsech arcsin arcsinh arctan 
  arctanh arg bvar card cartesianproduct ceiling ci cn codomain complexes 
  compose condition conjugate cos cosh cot coth csc csch csymbol curl 
  declare degree determinant diff divergence divide domain 
  domainofapplication emptyset eq equivalent eulergamma exists exp 
  exponentiale factorial factorof false floor fn forall gcd geq grad gt 
  ident image imaginary imaginaryi implies in infinity int integers 
  intersect interval inverse lambda laplacian lcm leq limit list ln log 
  logbase lowlimit lt matrix matrixrow max mean median min minus mode 
  moment momentabout naturalnumbers neq not notanumber notin notprsubset 
  notsubset or otherwise outerproduct partialdiff pi piece piecewise plus 
  power primes product prsubset quotient rationals real reals reln rem 
  root scalarproduct sdev sec sech selector sep set setdiff sin sinh 
  subset sum tan tanh tendsto times transpose true union uplimit variance 
  vector vectorproduct xor

* At least one of the following: annotation annotation-xml semantics

The results I found are as follows:

200000 pages containing one or more from the first list above 
       (pages using Presentational MathML).

 50000 pages containing only <math> and none of the above (or at most 1 
       from the Content MathML list above).

  5000 pages containing two or more from the second list above 
       (pages using Content MathML).

  4000 pages containing at least one from the third list above.

  3000 pages containing at least one from the first list above and two 
       from the second list above (containing both Presentational and 
       Content MathML).

This suggests that Content MathML use is nowhere near as frequently used 
as has been previously suggested.

The most common MathML elements in the sample were:

   ELEMENT   ROUGH COUNT   PERCENTAGE
   rem          8500000      0.122%
   image        1000000      0.014%
   set           450000      0.006%
   abs           400000      0.005%
   root          300000      0.004%
   math          250000      0.003%
   mi            250000      0.003%
   true          200000      0.003%
   mo            200000      0.003%
   none          200000      0.003%
   ms            200000      0.002%
   mrow          200000      0.002%
   mn            200000      0.002%
   list          150000      0.002%
   sec           150000      0.002%
   mfrac         150000      0.002%
   msub          150000      0.002%
   product       100000      0.001%

(The <rem>, <image>, <set>, <abs>, and <root> elements are the reason
why the sample needed _two_ Content MathML elements to count as MathML
-- those elements, it turns out, are common in other contexts. <image>, 
for example, is a synonym for <img> in HTML.)

This study could probably be done in various different ways. In 
particular, I didn't do anything to check namespaces, which could be a 
better indicator of MathML content. I counted pages, rather than sites, 
thus biasing towards large publishers instead of smaller ones. I didn't 
check that the MathML elements used on a page where descendants of the 
<math> element on the page. I didn't check that the prefixes matched 
throughout.

These factors add together to bias the numbers towards the wide use of 
unnamespaced and Presentational-MathML-only MathML on, in particular, 
freepatentsonline.com. On the other hand, that's a whole lot of MathML 
that we would instantly be supporting if we added this to HTML5, so maybe 
it's not an unfair bias.

  

On Wed, 2 Apr 2008, David Carlisle wrote:
> 
> > Yes, people keep saying that, but I've yet to see a detailed proposal 
> > that is workable. I've tried coming up with many different ideas, but 
> > all had some fatal flaw that wouldn't work on the Web.
> 
> Since people have been placing content mathml (and openmath and other) 
> annotations on the web for the last ten years or so, it clearly is 
> possible to make this work on the web, it may not work in html5 as 
> currently specified, but I understand that one if the aims of html5 is 
> to codify existing practice and allow things that work now to keep 
> working.

Well, MathML in XHTML will of course continue to be fully supported. We 
are only talking about MathML in text/html, which up til now has never 
been a valid or defined practice.


> HTML since forever has had rules that allow unknown elements to be 
> parsed (with a default rendering of ignoring the element and processing 
> the content) The html parser has never had to "know" anything about them 
> has it?

I'm not sure to what you refer here. Before HTML5, HTML has not had any 
defined error handling parsing rules, browsers just made it up as they 
went along, based on reverse-engineering each other.


On Wed, 2 Apr 2008, Sam Ruby wrote:
> >
> >    http://wiki.whatwg.org/wiki/Extensions
> 
> I have now contributed to that page.  Feel free to identify where the 
> proposal is not detailed enough or to identify any flaws that may, or 
> may not, prove fatal.

The proposal seems to be "do what Microsoft documented in their namespaces 
whitepaper as being the IE8 Beta 1 behaviour". However, the whitepaper 
doesn't actually say what the processing model is, and IE8 beta 1 doesn't 
seem to implement anything like what the whietpaper implies should happen 
anyway.

If you could describe in your own words what the processing model you are 
proposing is, that would be something I could evaluate.


On Wed, 2 Apr 2008, Henri Sivonen wrote:
> 
> Could you please elaborate why the following won't work? In particular, 
> would the following breaks such a large mass of pages as to Break The 
> Web? (Especially if the rendering rules for MathML are adjusted so that 
> text children of <math> are rendered like text children of an HTML 
> <span>.)
> 
> The following elements are defined as 'namespace-sensitive':
>  <html>
>  <svg>
>  <math>
>  <foreignObject>
>  <annotation-xml encoding="application/xhtml+xml">
>  <annotation-xml encoding="OpenMath">.
> 
> Namespace-sensitive elements have two namespace URIs associated with 
> them: self and scope.
> 
> Thus:
>  <html>
>    self: http://www.w3.org/1999/xhtml
>    scope: http://www.w3.org/1999/xhtml
>  <svg>
>    self: http://www.w3.org/2000/svg
>    scope: http://www.w3.org/2000/svg
>  <math>
>    self: http://www.w3.org/1998/Math/MathML
>    scope: http://www.w3.org/1998/Math/MathML
>  <foreignObject>
>    self: http://www.w3.org/2000/svg
>    scope: http://www.w3.org/1999/xhtml
>  <annotation-xml encoding="application/xhtml+xml">
>    self: http://www.w3.org/1998/Math/MathML
>    scope: http://www.w3.org/1999/xhtml
>  <annotation-xml encoding="OpenMath">
>    self: http://www.w3.org/1998/Math/MathML
>    scope: http://www.openmath.org/OpenMath
> 
> The namespace of a element node to be inserted is determined as follows:
>  1) If the node to be inserted is an namespace-sensitive element, use the
> value for 'self' in the above list and abort these steps.
>  2) Let 'node' be the current node on the stack of open elements.
>  3) If 'node' is a namespace-sensitive element, use the value for 'scope' in
> the above list and abort these steps.
>  4) Let 'node' be the next node on the stack of open elements towards the root
> element.
>  5) Go back to step 3.
> 
> (Of course, the repeated stack walking should be optimized away.)
> 
> The /> empty element syntax should be supported on start tag tokens (node
> popped immediately) whose namespace doesn't resolve to
> http://www.w3.org/1999/xhtml according to the above rule.
> 
> When the stack is pushed/popped, the namespace of the current node must 
> be inspected. If it is http://www.w3.org/1999/xhtml, the tokenizer must 
> be set not to support CDATA sections. Otherwise, the tokenizer must be 
> set to support CDATA sections.

This is an interesting proposal, far more concrete than anything anyone 
else has proposed so far. Thank you.

It doesn't work because it breaks the handling of pages that exist today 
that use the elements you list above. For example, take this page:

   http://www.cip.es/aecan/ver_anuncio.asp?idioma=Aleman&cod_anuncio=ARC100&acceso=Busqueda

...which contains this markup:

   <td width="27%" bgcolor="#FFFFFF">0 <math>m<sup>2</sup></td>


It also fails in the case where someone (author A) using a new browser 
writes a page that uses this feature, and then someone (author B) using an 
old browser copies and pastes from A's page into his page, accidentally 
including a stray <svg> tag or <math> tag. His page looks fine to most 
users, but to the users of the new browser, the page is now horked.


On Wed, 2 Apr 2008, Simon Pieters wrote:
> 
> Until I see actual pages that contain non-MathML in <math> or non-SVG in 
> <svg>, I'm not convinced that Henri's scoped parsing proposal[1] doesn't 
> work. Do you perhaps have such data at hand so I can take a look and be 
> convinced? :-)

Most pages that use <math> when not using MathML seem to put LaTeX-like 
markup inside the element. Here are some that put elements in <math>, 
though:

   http://www.emis.de/journals/FPM/eng/k00/k001/k00126h.htm
   http://www.freepatentsonline.com/EP0693743.html
   http://apmath.kku.ac.kr/~seokko/notes/mathcon.htm
   http://www.ioffe.rssi.ru/cp866/journals/jtf/2003/12/page-1.html.ru
   http://www.kougensha.net/blosxom/blosxom.cgi/tech/freebsd/index.html



> If there are a non-trivial amount of pages that have HTML elements in 
> <math> or <svg> (not nested in <foreignObject>/<annotation-xml>), then 
> wouldn't it be possible to special-case HTML elements in <math>/<svg> 
> and let the rest be handled as "unknown" elements in the MathML/SVG 
> namespaces (so that, e.g., <math><foo><b> is interpreted as 
> <mml:math><mml:foo><html:b>)?

This wouldn't work well for SVG, where we have name clashes already (i.e. 
where some element names are used in both SVG and HTML).


> Also, on a slightly different note, I think that for copy-pastability of 
> SVG in text/html, the parser needs to make /> self-close elements, since 
> e.g. <circle> can have contents (e.g. animation stuff, I think) and Sam 
> Ruby said that some tools emit <defs/> and <g/>. [2]

Yes. I'm not conviced that we'll be able to get the ability to copy and 
paste image/svg+xml content into text/html.


On Wed, 2 Apr 2008, Henri Sivonen wrote:
> 
> The existing content landscape for <svg> may be very different from 
> random junk in <math> out there, since cargo-cult semanticists may come 
> up with <math> own but <svg> is more unlikely to occur without trying to 
> do SVG. So while scope plus HTML blacklist may be the best option for 
> MathML subtrees, scope plus camelCase-fixing whitelist may be the most 
> robust solution for SVG subtrees.

I'm not sure exactly what you mean here.

I will see about doing a more detailed study to examine the feasibilty of 
what you propose (especially for the SVG side).


> Finally, breaking a handful of legacy pages isn't yet a "fatal" flaw.

I believe it is.


On Wed, 2 Apr 2008, Bruce Miller wrote:
> 
> _Surely_, no one out there is writing HTML using <whatevertag/> when 
> they _dont_ mean to close the element?!?!?! (rolling my eyes :> )

Yeah, it's used all over the place actually, with the pages relying on the 
tag not closing.


On Wed, 2 Apr 2008, David Carlisle wrote:
> 
> It's odd that earlier in the the thread we were told that proper 
> handling of html5 would require a real html5 parser (of which several 
> ought to be available) but in the same thread there is the repeated 
> requirement that html5 "work" with the existing html4 parsers. (Which 
> presumably doesn't go as far as saying what the HTML spec (by reference 
> to sgml) says it should do for /> which is to treat the > as character 
> data.

In practice what HTML5 defines today is pretty much what HTML4 browsers 
implement.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 2 April 2008 23:57:56 UTC