parsing html, xhtml, xml (svg and mathml) serialisations into a DOM from Ben Boyle on 2008-04-02 (public-html@w3.org from April 2008)

From: Ben Boyle <benjamins.boyle@gmail.com>
Date: Wed, 2 Apr 2008 21:49:46 +1000
To: "HTML WG" <public-html@w3.org>
Message-ID: <5f37426b0804020449u2348971cq7c760939f79b746e@mail.gmail.com>

Oh no, not another tangent/thread on this topic! My apologies... but I
couldn't work out which one to reply to. Plus I have a question that
is quite separate first. Something I don't quite understand about
"HTML5" yet.

I understand there are two possible serialisations: html and xhtml (xml).
I understand xhtml parsing is xml parsing, with draconian error
handling (which is not really a new thing).
I understand html parsing is a new thing (replacing sgml parsing),
documenting what browsers must do to produce a valid DOM, including
handling of non-conforming markup. This html parsing will also support
(i.e. we will document) syntax that is not well-formed xml (what we
know and love as "classic html") and it shall be considered to be
conforming.

I am not quite sure where the line between the two is... there is (for
me) a blurry grey area around well-formed xhtml source code, which
could be parsed - successfully - as either xhtml OR html (assumably
producing the same dom?). How will a UA decide whether to use html
parsing? Is it triggered by doctype, mime type, xmlns or something
else? This question may seem moot (if the same dom is produced, who
cares?) ... until one introduces an error into the markup.

The reason I ask (well aside from just wanting to understand it
better) is that I was discussing the math/svg in html serialisations
thing, and the fact that html5 does define and support xhtml/xml
parsing (there is still confusion over this - people think "html5"
represents W3C abandoning xhtml activity) ... and I was describing one
of the options being looser html-style parsing of svg/mathml markup -
parsing that embraces error recovery rather than draconian error
handling.

My mate's question was: "so there'd be like a switch, so I could opt
into super-god-mode parsing?"

I thought it was interesting. Something I would be interested in.
Being able to choose between html or xml "well-formedness". Being able
to choose between draconian error handling and html error recovery.
Because I would really like to author - to the best of my ability -
well-formed and valid xhtml+math+svg BUT I would prefer to have
browsers present those documents using error recovery rather than
draconian error handling ... so if I make (or import) any mistakes,
well, something is still presented. And maybe I don't want to get hung
up on whether a bit of "classic html" syntax works its way into the
mix. I'd rather focus on making the content clear and easy to read,
and the navigation sensible, than worry too much about markup syntax.

I don't know if this is useful to the current discussion, but there you have it.
And it does beg the question: could the work undertaken to define
"html to dom" parsing be applicable to parsing all xml (e.g. on the
server side, to send a html document through XSLT for example... the
html parser could produce the required dom without requiring the
source document be reworked into xml well-formedness first) ... but
that's probably a much bigger question better asked later.

cheers
Ben

Received on Wednesday, 2 April 2008 11:50:25 UTC