Re: What problem is this task force trying to solve and why? from David Carlisle on 2011-01-04 (public-html-xml@w3.org from January 2011)

From: David Carlisle <davidc@nag.co.uk>
Date: Tue, 04 Jan 2011 14:34:20 +0000
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-html-xml@w3.org
Message-ID: <4D232FEC.3040301@nag.co.uk>
Henri,

Thanks for responding at length to everyone's points.
It seems a bit adversarial at present, with everyone basically querying 
the html5 design aims and you defending them, but I hope that it isn't 
really, that way, just an initial phase were people are stating their 
initial positions. It's not clear to me yet what I expect the outcome to 
be, changes to html5, changes to xml, or changes to neither but better 
documenting/understanding where the differences lie and how to bridge 
the gaps when needed.

>
>> This behaviour really has no justification.
>
> It sure has. Hixie ran an analysis over a substantial quantity of Web
> pages in Google's index and found existing text/html content that
> contained an<svg>  tag or a<math>  tag. The justification is making
> the algorithm not break a substantial quantity of pages like that.

"no justification" was perhaps overstating it but I think that this is 
very weak justification, especially as it isn't even preserving the old 
behaviour. If math-encoded-as-html was previously wrapped in a math tag 
there was presumably some reason for that wrapping, styling with css, or 
accesing it with javascript or something, neither of which would still 
work, although admittedly the fallback behaviour is improved by the 
current html5 model. But this must be such a small (and I would guess 
vanishingly small with a html5 doctype) fraction of math tag usage. 
Whenever you add a new feature to html5 there is a possibility of 
breaking existing pages that used that markup previously, even though 
that markup had no predefined meaning. If you guarantee to never break 
anything by adding new elements, you guarantee never to add anything.


> If you want to contest Hixie's findings, I suggest running and
> publishing your own analysis of a substantial corpus of Web content.
> (By running the current HTML parsing algorithm instrumented to count
> how often<math>  and<svg>  are seen and how often the break-out
> behavior is triggered.) Unfortunately, people who don't have access
> to Google's or Bing's index are not in as good a position to run such
> analyses as Hixie, but running an analysis over dotbot data
> (http://www.dotnetdotcom.org/) goes a long way further than mere
> conjecture.

I could do, but what's a lot, how many pages do I need to find before it 
is significant?
10, 100, 10000, 1000000?

Also as I say, even when the breakout behaviour is triggered, that is 
not at all evidence that the existing behaviour is being preserved 
unless you also check that no css or javascript is assuming the html is 
inside the math element.

>
> (To be clear, I haven't independently verified Hixie's findings, but
> I presume them to be true in the absence of evidence to the
> contrary.)
>

> Which tools? Is the plural really justified or is this about one
> Emacs mode?

well personally I tend to use emacs, so I'm not aware of what parser 
other html tools use to drive context dependent support, but certainly 
I've seen tools with buttons to run locally installed (nsgmls based) 
validators on the generated documents.


> The way legacy content comes into being is this: 1) A person writes
> some markup. 2) The person mashes the markup until it looks good
> enough in his/her browser. 3) The person publishes the markup. 4) The
> person stops maintaining the markup.
>
> At step #1, the person might *think* (s)he wrote something that
> results in an element with no children. However, at step #2 (s)he
> might have concluded that the rendering that actually arises from
> subsequent markup causing the element to have children looks "right".
> When that happens, the markup published at step #3 could break if
> browser started actually making the element have no children.

I agree that this is a problem, but as James Clark commented earlier
such quirks could have been constrained to a parsing mode that was not 
used for <!doctype html> The fact that you're trying to makes quirks 
mode just affect css rather than parsing as far as possible is not 
altogether unreasonable, but not something that could not clearly have 
been done differently.



> More precisely, my (I'm hesitant to claim this as a general HTML5
> world view) world view says that using vocabularies that the
> receiving software doesn't understand is a worse way of communicating
> than using vocabularies that the receiving software understands. (And
> if you ship a JavaScript or XSLT program to the recipient, the
> interface of communication to consider isn't the input to your
> program but its output. For example, if you serve FooML
> with<?xml-stylesheet?>  that transforms it to HTML, you are
> effectively communicating with the recipient in HTML--not in FooML.)

I'd agree with that, but here the input to javascript (or xslt or 
whatever) isn't the whole document served as application/xml it's a 
fragment of a text/html document inside annotation-xml or svg foreign 
content (or for that matter an html div) the current html5 rules make 
making such a fragment in a way that can be safely parsed very much 
harder than it could have been, in particular because it can't just be 
deferred to an html5 serialiser. Even if an html5 serialiser were added 
to xslt for example, it would be unlikely (I would guess) to be able to 
do anything to avoid these local name clashes in foreign content. So the 
generation itself will have to be programmed in each case to avoid this. 
So to give the example that I gave earlier, while annotating an 
expression with docbook should be trivial, annotating in a way that may 
be parsed by html5 is rather harder.


> Likewise, since only HTML, MathML and SVG are being integrated--not
> arbitrary other vocabularies--only HTML, MathML and SVG children of
> annotation-xml are catered for.

This is of course true, and I suppose in a way all the comments that you 
have fielded from different people are just special cases of queries 
where this restriction is seen as unnecessarily restrictive.



> It turns out that we already have that! It's called XHTML5 and the
> mode switch is the Content-Type: application/xhtml+xml HTTP header.
> Even better than some yet-to-be-defined HTML.next mode, it's already
> supported by the latest versions of the top browsers (if you count
> IE9 as the latest version of IE).


Traditionally of course using application/xhtml+xml has been problematic 
due to lack of support in IE (and that will still be the case in 
practice for some time as it takes a while for old IEs to die).

But even if we assume IE9, there are many other reasons why text/html is 
more convenient. I have a blog on blogger for example, as far as I know 
I can't control the mime type used for pages there. Many other content 
management systems are similarly (and not at all unreasonably) text/html 
based. There is (and will continue to be) a natural desire to use the 
xml toolset to generate content that is served as text/html.
Currently this is rather fragile and error prone, unless you use a 
dedicated html serialiser at the end of the chain. It's probably not 
much worse in fact than html4, but it could (perhaps) have been made 
better if that had been part of the html5 design criterion (which I 
suspect it wasn't).  It seems to be very common to use xhtml syntax on 
pages served as text/html


http://www.w3.org/

for example or

http://www.drupal.org.uk/

or ...

Currently this is just an error waiting to happen (for example try 
mouse-ing over paragraphs in

http://www.w3.org/TR/2009/REC-MathML2-20090303/chapter1.html#intro.notation

)

Making such pages work rather than documenting how to avoid doing that 
seems a reasonable aim even if, as I noted at the start, the final 
conclusion of the discussion here is that the aim is not achievable.


David




________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________
Received on Tuesday, 4 January 2011 14:36:51 UTC