Re: Exploring new vocabularies for HTML from James Graham on 2008-03-31 (public-html@w3.org from March 2008)

From: James Graham <jg307@cam.ac.uk>
Date: Mon, 31 Mar 2008 10:39:32 +0100
To: David Carlisle <davidc@nag.co.uk>
CC: hsivonen@iki.fi, public-html@w3.org, www-math@w3.org
Message-ID: <47F0B154.4000903@cam.ac.uk>
David Carlisle wrote:
>> The right way to do either is to run an HTML5 parser.
> 
> I don't see how that is likely to happen while the "html parser" is
> simply that, with so many hard coded rules for html elements.
> If the parsing was abstracted away from html and then some  schema
> language was used to specify html5 in terms od that abstraction,
> perhaps other languages could least consider whether they wanted to
> offfer lax "html-style" parsing in addition to xml. This is essentially
> how John Cowan's tag soup works. Now it may be that you've looked at
> existing behaviour and decided the only way to model that is build in
> special rules everywhere, if that's the case, so be it, but that
> severely limits the usefulness of such a parser in a non-html context.

I'm really uncertain why you think that running an HTML parser to 
construct an in-memory representation of the HTML in the same in memory 
format as that used for XML is the wrong way to import HTML content into 
an application that currently imports only XML.

>> We can ask browsers to use the XML serialization for clipboad export  
>> on platforms that have pre-existing deployed XML-based clipboard  
>> flavor for MathML
> 
> yes and you would also need to ask all editing systems not to generate
> <math>1+2=3</math> so that what they produce could be used as mathml
> without having to pass it to a browser and cut it out. The simplest way
> to ensure that editors don't produce such corruption is not to imply
> that it is legal in the first place. It offers very little benefit to
> anyone, and massive oportunities for incompatiblity with the past and
> corruption of data (where the system does not imply the element
> structure the author expected) in the future.

The supposed benefit is not to MathML editors but to authors using text 
editors. I have tried writing MathML-in-XHTML using only a text editor 
and the experience was painful to say the least. I found that the 
verbosity made it difficult to enter and then difficult to fix when I 
had made a mistake. The sensible solution might have been to use 
something like itex2MML to keep the source equations in human-readable 
form but that would have involved keeping two seperate representations 
of the document, with all the associated problems that that causes.

In my experience the verbosity of MathML is a serious problem and 
impediment to authoring. However, I'm not sure that introducing a whole 
slew of rules for tag inference is the right approach. I think authors 
have a hard time understanding where tags can be inferred and I think, 
with the exception of tbody (which I think is actually a case of authors 
not understanding the table model enough to realise a tbody is needed), 
the tag inference of HTML 4 is used only by the most expert authors. Any 
system that allows authors to write half of their content in a 
tag-inferred form but requires the other half to be written out fully, 
according to the limitations of the inference scheme, is going to be 
very difficult to grasp in full.

An alternative to the tag inference idea would be to make (optional) use 
of the wiki-serialization of MathML previously discussed. Specifically 
we could allow either a <math> subtree containing normal MathML but with 
text/html compatible error handling, or a <wikimath> (strawman name) 
element that took only the human-editable serialization and converted it 
to a MathML DOM tree in-memory. This would make predicting the DOM tree 
for style and scripting harder but would make editing easy enough to 
more than make up for those problems.

> 
> David
> 
> ________________________________________________________________________
> The Numerical Algorithms Group Ltd is a company registered in England
> and Wales with company number 1249803. The registered office is:
> Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.
> 
> This e-mail has been scanned for all viruses by Star. The service is
> powered by MessageLabs. 
> ________________________________________________________________________
> 


-- 
"Mixed up signals
Bullet train
People snuffed out in the brutal rain"
--Conner Oberst
Received on Monday, 31 March 2008 09:40:27 UTC