- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 2 Apr 2008 23:56:43 +0000 (UTC)
- To: David Carlisle <davidc@nag.co.uk>, Sam Ruby <rubys@us.ibm.com>, Henri Sivonen <hsivonen@iki.fi>, Simon Pieters <simonp@opera.com>, Bruce Miller <bruce.miller@nist.gov>
- Cc: Neil Soiffer <Neils@dessci.com>, jg307@cam.ac.uk, public-html@w3.org, www-math@w3.org, Julian Reschke <julian.reschke@gmx.de>
I did some research on actual usage of MathML on the Web.
I scanned about 7 billion pages, and in each page, after parsing it with
an HTML5 parser, looked for elements that, after stripping any leading
prefix, had an element with the local name "math" and had, in addition,
one of the following:
* At least one of the following: maction maligngroup malignmark menclose
merror mfenced mfrac mglyph mi mlabeledtr mmultiscripts mn mo mover
mpadded mphantom mprescripts mroot mrow ms mspace msqrt mstyle msub
msubsup msup mtable mtd mtext mtr munder munderover none
* At least two of the following: abs and apply approx arccos arccosh
arccot arccoth arccsc arccsch arcsec arcsech arcsin arcsinh arctan
arctanh arg bvar card cartesianproduct ceiling ci cn codomain complexes
compose condition conjugate cos cosh cot coth csc csch csymbol curl
declare degree determinant diff divergence divide domain
domainofapplication emptyset eq equivalent eulergamma exists exp
exponentiale factorial factorof false floor fn forall gcd geq grad gt
ident image imaginary imaginaryi implies in infinity int integers
intersect interval inverse lambda laplacian lcm leq limit list ln log
logbase lowlimit lt matrix matrixrow max mean median min minus mode
moment momentabout naturalnumbers neq not notanumber notin notprsubset
notsubset or otherwise outerproduct partialdiff pi piece piecewise plus
power primes product prsubset quotient rationals real reals reln rem
root scalarproduct sdev sec sech selector sep set setdiff sin sinh
subset sum tan tanh tendsto times transpose true union uplimit variance
vector vectorproduct xor
* At least one of the following: annotation annotation-xml semantics
The results I found are as follows:
200000 pages containing one or more from the first list above
(pages using Presentational MathML).
50000 pages containing only <math> and none of the above (or at most 1
from the Content MathML list above).
5000 pages containing two or more from the second list above
(pages using Content MathML).
4000 pages containing at least one from the third list above.
3000 pages containing at least one from the first list above and two
from the second list above (containing both Presentational and
Content MathML).
This suggests that Content MathML use is nowhere near as frequently used
as has been previously suggested.
The most common MathML elements in the sample were:
ELEMENT ROUGH COUNT PERCENTAGE
rem 8500000 0.122%
image 1000000 0.014%
set 450000 0.006%
abs 400000 0.005%
root 300000 0.004%
math 250000 0.003%
mi 250000 0.003%
true 200000 0.003%
mo 200000 0.003%
none 200000 0.003%
ms 200000 0.002%
mrow 200000 0.002%
mn 200000 0.002%
list 150000 0.002%
sec 150000 0.002%
mfrac 150000 0.002%
msub 150000 0.002%
product 100000 0.001%
(The <rem>, <image>, <set>, <abs>, and <root> elements are the reason
why the sample needed _two_ Content MathML elements to count as MathML
-- those elements, it turns out, are common in other contexts. <image>,
for example, is a synonym for <img> in HTML.)
This study could probably be done in various different ways. In
particular, I didn't do anything to check namespaces, which could be a
better indicator of MathML content. I counted pages, rather than sites,
thus biasing towards large publishers instead of smaller ones. I didn't
check that the MathML elements used on a page where descendants of the
<math> element on the page. I didn't check that the prefixes matched
throughout.
These factors add together to bias the numbers towards the wide use of
unnamespaced and Presentational-MathML-only MathML on, in particular,
freepatentsonline.com. On the other hand, that's a whole lot of MathML
that we would instantly be supporting if we added this to HTML5, so maybe
it's not an unfair bias.
On Wed, 2 Apr 2008, David Carlisle wrote:
>
> > Yes, people keep saying that, but I've yet to see a detailed proposal
> > that is workable. I've tried coming up with many different ideas, but
> > all had some fatal flaw that wouldn't work on the Web.
>
> Since people have been placing content mathml (and openmath and other)
> annotations on the web for the last ten years or so, it clearly is
> possible to make this work on the web, it may not work in html5 as
> currently specified, but I understand that one if the aims of html5 is
> to codify existing practice and allow things that work now to keep
> working.
Well, MathML in XHTML will of course continue to be fully supported. We
are only talking about MathML in text/html, which up til now has never
been a valid or defined practice.
> HTML since forever has had rules that allow unknown elements to be
> parsed (with a default rendering of ignoring the element and processing
> the content) The html parser has never had to "know" anything about them
> has it?
I'm not sure to what you refer here. Before HTML5, HTML has not had any
defined error handling parsing rules, browsers just made it up as they
went along, based on reverse-engineering each other.
On Wed, 2 Apr 2008, Sam Ruby wrote:
> >
> > http://wiki.whatwg.org/wiki/Extensions
>
> I have now contributed to that page. Feel free to identify where the
> proposal is not detailed enough or to identify any flaws that may, or
> may not, prove fatal.
The proposal seems to be "do what Microsoft documented in their namespaces
whitepaper as being the IE8 Beta 1 behaviour". However, the whitepaper
doesn't actually say what the processing model is, and IE8 beta 1 doesn't
seem to implement anything like what the whietpaper implies should happen
anyway.
If you could describe in your own words what the processing model you are
proposing is, that would be something I could evaluate.
On Wed, 2 Apr 2008, Henri Sivonen wrote:
>
> Could you please elaborate why the following won't work? In particular,
> would the following breaks such a large mass of pages as to Break The
> Web? (Especially if the rendering rules for MathML are adjusted so that
> text children of <math> are rendered like text children of an HTML
> <span>.)
>
> The following elements are defined as 'namespace-sensitive':
> <html>
> <svg>
> <math>
> <foreignObject>
> <annotation-xml encoding="application/xhtml+xml">
> <annotation-xml encoding="OpenMath">.
>
> Namespace-sensitive elements have two namespace URIs associated with
> them: self and scope.
>
> Thus:
> <html>
> self: http://www.w3.org/1999/xhtml
> scope: http://www.w3.org/1999/xhtml
> <svg>
> self: http://www.w3.org/2000/svg
> scope: http://www.w3.org/2000/svg
> <math>
> self: http://www.w3.org/1998/Math/MathML
> scope: http://www.w3.org/1998/Math/MathML
> <foreignObject>
> self: http://www.w3.org/2000/svg
> scope: http://www.w3.org/1999/xhtml
> <annotation-xml encoding="application/xhtml+xml">
> self: http://www.w3.org/1998/Math/MathML
> scope: http://www.w3.org/1999/xhtml
> <annotation-xml encoding="OpenMath">
> self: http://www.w3.org/1998/Math/MathML
> scope: http://www.openmath.org/OpenMath
>
> The namespace of a element node to be inserted is determined as follows:
> 1) If the node to be inserted is an namespace-sensitive element, use the
> value for 'self' in the above list and abort these steps.
> 2) Let 'node' be the current node on the stack of open elements.
> 3) If 'node' is a namespace-sensitive element, use the value for 'scope' in
> the above list and abort these steps.
> 4) Let 'node' be the next node on the stack of open elements towards the root
> element.
> 5) Go back to step 3.
>
> (Of course, the repeated stack walking should be optimized away.)
>
> The /> empty element syntax should be supported on start tag tokens (node
> popped immediately) whose namespace doesn't resolve to
> http://www.w3.org/1999/xhtml according to the above rule.
>
> When the stack is pushed/popped, the namespace of the current node must
> be inspected. If it is http://www.w3.org/1999/xhtml, the tokenizer must
> be set not to support CDATA sections. Otherwise, the tokenizer must be
> set to support CDATA sections.
This is an interesting proposal, far more concrete than anything anyone
else has proposed so far. Thank you.
It doesn't work because it breaks the handling of pages that exist today
that use the elements you list above. For example, take this page:
http://www.cip.es/aecan/ver_anuncio.asp?idioma=Aleman&cod_anuncio=ARC100&acceso=Busqueda
...which contains this markup:
<td width="27%" bgcolor="#FFFFFF">0 <math>m<sup>2</sup></td>
It also fails in the case where someone (author A) using a new browser
writes a page that uses this feature, and then someone (author B) using an
old browser copies and pastes from A's page into his page, accidentally
including a stray <svg> tag or <math> tag. His page looks fine to most
users, but to the users of the new browser, the page is now horked.
On Wed, 2 Apr 2008, Simon Pieters wrote:
>
> Until I see actual pages that contain non-MathML in <math> or non-SVG in
> <svg>, I'm not convinced that Henri's scoped parsing proposal[1] doesn't
> work. Do you perhaps have such data at hand so I can take a look and be
> convinced? :-)
Most pages that use <math> when not using MathML seem to put LaTeX-like
markup inside the element. Here are some that put elements in <math>,
though:
http://www.emis.de/journals/FPM/eng/k00/k001/k00126h.htm
http://www.freepatentsonline.com/EP0693743.html
http://apmath.kku.ac.kr/~seokko/notes/mathcon.htm
http://www.ioffe.rssi.ru/cp866/journals/jtf/2003/12/page-1.html.ru
http://www.kougensha.net/blosxom/blosxom.cgi/tech/freebsd/index.html
> If there are a non-trivial amount of pages that have HTML elements in
> <math> or <svg> (not nested in <foreignObject>/<annotation-xml>), then
> wouldn't it be possible to special-case HTML elements in <math>/<svg>
> and let the rest be handled as "unknown" elements in the MathML/SVG
> namespaces (so that, e.g., <math><foo><b> is interpreted as
> <mml:math><mml:foo><html:b>)?
This wouldn't work well for SVG, where we have name clashes already (i.e.
where some element names are used in both SVG and HTML).
> Also, on a slightly different note, I think that for copy-pastability of
> SVG in text/html, the parser needs to make /> self-close elements, since
> e.g. <circle> can have contents (e.g. animation stuff, I think) and Sam
> Ruby said that some tools emit <defs/> and <g/>. [2]
Yes. I'm not conviced that we'll be able to get the ability to copy and
paste image/svg+xml content into text/html.
On Wed, 2 Apr 2008, Henri Sivonen wrote:
>
> The existing content landscape for <svg> may be very different from
> random junk in <math> out there, since cargo-cult semanticists may come
> up with <math> own but <svg> is more unlikely to occur without trying to
> do SVG. So while scope plus HTML blacklist may be the best option for
> MathML subtrees, scope plus camelCase-fixing whitelist may be the most
> robust solution for SVG subtrees.
I'm not sure exactly what you mean here.
I will see about doing a more detailed study to examine the feasibilty of
what you propose (especially for the SVG side).
> Finally, breaking a handful of legacy pages isn't yet a "fatal" flaw.
I believe it is.
On Wed, 2 Apr 2008, Bruce Miller wrote:
>
> _Surely_, no one out there is writing HTML using <whatevertag/> when
> they _dont_ mean to close the element?!?!?! (rolling my eyes :> )
Yeah, it's used all over the place actually, with the pages relying on the
tag not closing.
On Wed, 2 Apr 2008, David Carlisle wrote:
>
> It's odd that earlier in the the thread we were told that proper
> handling of html5 would require a real html5 parser (of which several
> ought to be available) but in the same thread there is the repeated
> requirement that html5 "work" with the existing html4 parsers. (Which
> presumably doesn't go as far as saying what the HTML spec (by reference
> to sgml) says it should do for /> which is to treat the > as character
> data.
In practice what HTML5 defines today is pretty much what HTML4 browsers
implement.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 2 April 2008 23:57:56 UTC