- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 2 Apr 2008 23:56:43 +0000 (UTC)
- To: David Carlisle <davidc@nag.co.uk>, Sam Ruby <rubys@us.ibm.com>, Henri Sivonen <hsivonen@iki.fi>, Simon Pieters <simonp@opera.com>, Bruce Miller <bruce.miller@nist.gov>
- Cc: Neil Soiffer <Neils@dessci.com>, jg307@cam.ac.uk, public-html@w3.org, www-math@w3.org, Julian Reschke <julian.reschke@gmx.de>
I did some research on actual usage of MathML on the Web. I scanned about 7 billion pages, and in each page, after parsing it with an HTML5 parser, looked for elements that, after stripping any leading prefix, had an element with the local name "math" and had, in addition, one of the following: * At least one of the following: maction maligngroup malignmark menclose merror mfenced mfrac mglyph mi mlabeledtr mmultiscripts mn mo mover mpadded mphantom mprescripts mroot mrow ms mspace msqrt mstyle msub msubsup msup mtable mtd mtext mtr munder munderover none * At least two of the following: abs and apply approx arccos arccosh arccot arccoth arccsc arccsch arcsec arcsech arcsin arcsinh arctan arctanh arg bvar card cartesianproduct ceiling ci cn codomain complexes compose condition conjugate cos cosh cot coth csc csch csymbol curl declare degree determinant diff divergence divide domain domainofapplication emptyset eq equivalent eulergamma exists exp exponentiale factorial factorof false floor fn forall gcd geq grad gt ident image imaginary imaginaryi implies in infinity int integers intersect interval inverse lambda laplacian lcm leq limit list ln log logbase lowlimit lt matrix matrixrow max mean median min minus mode moment momentabout naturalnumbers neq not notanumber notin notprsubset notsubset or otherwise outerproduct partialdiff pi piece piecewise plus power primes product prsubset quotient rationals real reals reln rem root scalarproduct sdev sec sech selector sep set setdiff sin sinh subset sum tan tanh tendsto times transpose true union uplimit variance vector vectorproduct xor * At least one of the following: annotation annotation-xml semantics The results I found are as follows: 200000 pages containing one or more from the first list above (pages using Presentational MathML). 50000 pages containing only <math> and none of the above (or at most 1 from the Content MathML list above). 5000 pages containing two or more from the second list above (pages using Content MathML). 4000 pages containing at least one from the third list above. 3000 pages containing at least one from the first list above and two from the second list above (containing both Presentational and Content MathML). This suggests that Content MathML use is nowhere near as frequently used as has been previously suggested. The most common MathML elements in the sample were: ELEMENT ROUGH COUNT PERCENTAGE rem 8500000 0.122% image 1000000 0.014% set 450000 0.006% abs 400000 0.005% root 300000 0.004% math 250000 0.003% mi 250000 0.003% true 200000 0.003% mo 200000 0.003% none 200000 0.003% ms 200000 0.002% mrow 200000 0.002% mn 200000 0.002% list 150000 0.002% sec 150000 0.002% mfrac 150000 0.002% msub 150000 0.002% product 100000 0.001% (The <rem>, <image>, <set>, <abs>, and <root> elements are the reason why the sample needed _two_ Content MathML elements to count as MathML -- those elements, it turns out, are common in other contexts. <image>, for example, is a synonym for <img> in HTML.) This study could probably be done in various different ways. In particular, I didn't do anything to check namespaces, which could be a better indicator of MathML content. I counted pages, rather than sites, thus biasing towards large publishers instead of smaller ones. I didn't check that the MathML elements used on a page where descendants of the <math> element on the page. I didn't check that the prefixes matched throughout. These factors add together to bias the numbers towards the wide use of unnamespaced and Presentational-MathML-only MathML on, in particular, freepatentsonline.com. On the other hand, that's a whole lot of MathML that we would instantly be supporting if we added this to HTML5, so maybe it's not an unfair bias. On Wed, 2 Apr 2008, David Carlisle wrote: > > > Yes, people keep saying that, but I've yet to see a detailed proposal > > that is workable. I've tried coming up with many different ideas, but > > all had some fatal flaw that wouldn't work on the Web. > > Since people have been placing content mathml (and openmath and other) > annotations on the web for the last ten years or so, it clearly is > possible to make this work on the web, it may not work in html5 as > currently specified, but I understand that one if the aims of html5 is > to codify existing practice and allow things that work now to keep > working. Well, MathML in XHTML will of course continue to be fully supported. We are only talking about MathML in text/html, which up til now has never been a valid or defined practice. > HTML since forever has had rules that allow unknown elements to be > parsed (with a default rendering of ignoring the element and processing > the content) The html parser has never had to "know" anything about them > has it? I'm not sure to what you refer here. Before HTML5, HTML has not had any defined error handling parsing rules, browsers just made it up as they went along, based on reverse-engineering each other. On Wed, 2 Apr 2008, Sam Ruby wrote: > > > > http://wiki.whatwg.org/wiki/Extensions > > I have now contributed to that page. Feel free to identify where the > proposal is not detailed enough or to identify any flaws that may, or > may not, prove fatal. The proposal seems to be "do what Microsoft documented in their namespaces whitepaper as being the IE8 Beta 1 behaviour". However, the whitepaper doesn't actually say what the processing model is, and IE8 beta 1 doesn't seem to implement anything like what the whietpaper implies should happen anyway. If you could describe in your own words what the processing model you are proposing is, that would be something I could evaluate. On Wed, 2 Apr 2008, Henri Sivonen wrote: > > Could you please elaborate why the following won't work? In particular, > would the following breaks such a large mass of pages as to Break The > Web? (Especially if the rendering rules for MathML are adjusted so that > text children of <math> are rendered like text children of an HTML > <span>.) > > The following elements are defined as 'namespace-sensitive': > <html> > <svg> > <math> > <foreignObject> > <annotation-xml encoding="application/xhtml+xml"> > <annotation-xml encoding="OpenMath">. > > Namespace-sensitive elements have two namespace URIs associated with > them: self and scope. > > Thus: > <html> > self: http://www.w3.org/1999/xhtml > scope: http://www.w3.org/1999/xhtml > <svg> > self: http://www.w3.org/2000/svg > scope: http://www.w3.org/2000/svg > <math> > self: http://www.w3.org/1998/Math/MathML > scope: http://www.w3.org/1998/Math/MathML > <foreignObject> > self: http://www.w3.org/2000/svg > scope: http://www.w3.org/1999/xhtml > <annotation-xml encoding="application/xhtml+xml"> > self: http://www.w3.org/1998/Math/MathML > scope: http://www.w3.org/1999/xhtml > <annotation-xml encoding="OpenMath"> > self: http://www.w3.org/1998/Math/MathML > scope: http://www.openmath.org/OpenMath > > The namespace of a element node to be inserted is determined as follows: > 1) If the node to be inserted is an namespace-sensitive element, use the > value for 'self' in the above list and abort these steps. > 2) Let 'node' be the current node on the stack of open elements. > 3) If 'node' is a namespace-sensitive element, use the value for 'scope' in > the above list and abort these steps. > 4) Let 'node' be the next node on the stack of open elements towards the root > element. > 5) Go back to step 3. > > (Of course, the repeated stack walking should be optimized away.) > > The /> empty element syntax should be supported on start tag tokens (node > popped immediately) whose namespace doesn't resolve to > http://www.w3.org/1999/xhtml according to the above rule. > > When the stack is pushed/popped, the namespace of the current node must > be inspected. If it is http://www.w3.org/1999/xhtml, the tokenizer must > be set not to support CDATA sections. Otherwise, the tokenizer must be > set to support CDATA sections. This is an interesting proposal, far more concrete than anything anyone else has proposed so far. Thank you. It doesn't work because it breaks the handling of pages that exist today that use the elements you list above. For example, take this page: http://www.cip.es/aecan/ver_anuncio.asp?idioma=Aleman&cod_anuncio=ARC100&acceso=Busqueda ...which contains this markup: <td width="27%" bgcolor="#FFFFFF">0 <math>m<sup>2</sup></td> It also fails in the case where someone (author A) using a new browser writes a page that uses this feature, and then someone (author B) using an old browser copies and pastes from A's page into his page, accidentally including a stray <svg> tag or <math> tag. His page looks fine to most users, but to the users of the new browser, the page is now horked. On Wed, 2 Apr 2008, Simon Pieters wrote: > > Until I see actual pages that contain non-MathML in <math> or non-SVG in > <svg>, I'm not convinced that Henri's scoped parsing proposal[1] doesn't > work. Do you perhaps have such data at hand so I can take a look and be > convinced? :-) Most pages that use <math> when not using MathML seem to put LaTeX-like markup inside the element. Here are some that put elements in <math>, though: http://www.emis.de/journals/FPM/eng/k00/k001/k00126h.htm http://www.freepatentsonline.com/EP0693743.html http://apmath.kku.ac.kr/~seokko/notes/mathcon.htm http://www.ioffe.rssi.ru/cp866/journals/jtf/2003/12/page-1.html.ru http://www.kougensha.net/blosxom/blosxom.cgi/tech/freebsd/index.html > If there are a non-trivial amount of pages that have HTML elements in > <math> or <svg> (not nested in <foreignObject>/<annotation-xml>), then > wouldn't it be possible to special-case HTML elements in <math>/<svg> > and let the rest be handled as "unknown" elements in the MathML/SVG > namespaces (so that, e.g., <math><foo><b> is interpreted as > <mml:math><mml:foo><html:b>)? This wouldn't work well for SVG, where we have name clashes already (i.e. where some element names are used in both SVG and HTML). > Also, on a slightly different note, I think that for copy-pastability of > SVG in text/html, the parser needs to make /> self-close elements, since > e.g. <circle> can have contents (e.g. animation stuff, I think) and Sam > Ruby said that some tools emit <defs/> and <g/>. [2] Yes. I'm not conviced that we'll be able to get the ability to copy and paste image/svg+xml content into text/html. On Wed, 2 Apr 2008, Henri Sivonen wrote: > > The existing content landscape for <svg> may be very different from > random junk in <math> out there, since cargo-cult semanticists may come > up with <math> own but <svg> is more unlikely to occur without trying to > do SVG. So while scope plus HTML blacklist may be the best option for > MathML subtrees, scope plus camelCase-fixing whitelist may be the most > robust solution for SVG subtrees. I'm not sure exactly what you mean here. I will see about doing a more detailed study to examine the feasibilty of what you propose (especially for the SVG side). > Finally, breaking a handful of legacy pages isn't yet a "fatal" flaw. I believe it is. On Wed, 2 Apr 2008, Bruce Miller wrote: > > _Surely_, no one out there is writing HTML using <whatevertag/> when > they _dont_ mean to close the element?!?!?! (rolling my eyes :> ) Yeah, it's used all over the place actually, with the pages relying on the tag not closing. On Wed, 2 Apr 2008, David Carlisle wrote: > > It's odd that earlier in the the thread we were told that proper > handling of html5 would require a real html5 parser (of which several > ought to be available) but in the same thread there is the repeated > requirement that html5 "work" with the existing html4 parsers. (Which > presumably doesn't go as far as saying what the HTML spec (by reference > to sgml) says it should do for /> which is to treat the > as character > data. In practice what HTML5 defines today is pretty much what HTML4 browsers implement. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 2 April 2008 23:57:56 UTC