Re: Revised HTML/XML Task Force Report from Eric J. Bowman on 2011-07-18 (www-tag@w3.org from July 2011)

From: Eric J. Bowman <eric@bisonsystems.net>
Date: Sun, 17 Jul 2011 21:23:09 -0600
To: Robin Berjon <robin@berjon.com>
Cc: Larry Masinter <masinter@adobe.com>, "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <20110717212309.c1f4009b.eric@bisonsystems.net>
Robin Berjon wrote:
>
> >> 
> >> Likewise polyglot can be useful in some cases. But it's not a
> >> general solution today, and if we're going to commit to the cost
> >> of making it possible for it to be a general solution then it
> >> better bring value that makes that work worth it compared to the
> >> existing solution which is to just grab a parser off GitHub.
> > 
> > Disagree.  HTML wrapped in Atom serialized as XML to allow Xpath
> > access into the wrapped content, is quite a common design pattern
> > and requires the markup to be polyglot.
> 
> The first thing I'd like to note here is that this is not the use
> case that we were discussing. We had been looking at "How can an XML
> toolchain be used to consume HTML?" while this is "How can islands of
> HTML be embedded in XML?" It's an interesting use case though, but I
> think that which solution to pick boils down to how general you need
> it to be.
> 

It covers both cases; my thinking is that with general polyglot, I'm not
concerned about being limited to an XML-parsable subset of HTML, or any
other unintended consequences. (I think the polyglot document could
advise using a line or two of JS to style <noscript> to display:none,
instead of advising against the use of <noscript> which is a required
accessibility checkpoint.)

What I'm doing, is transforming an Atom document w/ embedded HTML, into
an HTML document.  I need to be able to validate that output document
against my RELAX NG schema for QA/QC purposes, which is defeated if
there's an HTML5 parser in the loop performing silent error correction;
how do I detect errors which throw no flags, which when corrected lead
to unexpected (but valid) output?  Adding an HTML parser *defeats* my
toolchain; seems it would be easier to fix markup-generation errors in
my system by spotting the problem directly without having to reverse-
engineer the HTML parser to get a handle on them.

>
> For instance, I have a decent amount of HTML that contains fragments
> like the following:
> 
>     <script type='application/foo-template'
> id='foo'><foo><dahut/></foo></script>
> 
> I don't see much of an easy way of making that polyglot.
>

I've never understood this design pattern.  I mean, I know it's out
there, but it seems like a hack to overload the <script> tag in order
to avoid the existing solution for embedding one markup language within
another, which is namespaces and application/xhtml+xml (or application/
xml for IE < 9).

I understand HTML5 has rejected namespaces (I disagree with both the
decision and the flawed assumptions behind it), un-solving this problem
and forcing it to be solved in the polyglot spec (which is why it
should be a TR not a Note).  I have a couple of just-as-hacky ideas, so
I don't see it as insoluble, if a use-case comes up.

> 
> > I've been advocating the polyglot approach for a long time, now
> > (just not calling it that).  My advice to everyday Web developers
> > is to use XHTML 1.1 and application/xhtml+xml to develop sites, and
> > a few lines of XSLT to convert to HTML 4.01 or XHTML 1.0 for
> > publication as text/ html.
> 
> I think that it depends heavily on the type of site that one is
> developing. I used polyglot+XSLT on some sites that are primarily
> static and content oriented. I've found it to be less useful for
> application-oriented sites. YMMV.
> 

It does.  There's no reason not to prototype any application under the
strictest rules available; far too many application-oriented sites are
chock full of @style and otherwise mix content with presentation, but
not because it's technically required by application orientation.  The
reason to validate a prototype against schema which disallow @style
(and such) is to enforce separation of concerns, regardless of the type
of project, for the sake of long-term maintenance ease.

>
> > I don't see where an HTML parser needs to enter into it, except of
> > course for the browser
> 
> But the use case you've described is very specific. I'm certainly not
> saying that people should be forced to use an HTML parser whenever
> they see something that vaguely smell of HTML. If polyglot's limits
> work for you, then I don't think there actually is a problem to
> solve. I'm simply saying that trying to lift the current limits on
> polyglotism is a major undertaking, and one in which I see very
> limited value.
> 

In the battle of duelling opinions, putting HTML parsers in front of
XML toolchains is also a major undertaking, with a missing value
proposition when those toolchains are expecting to operate on
unadulterated input with strict error reporting.  The act of validation
shouldn't result in transformation, one of the things I don't like
about XSD, which seems unavoidable if an HTML parser is used.

> 
> >  Polyglot makes sense, as I'm hardly alone in using Atom as
> > a wrapper for HTML content, serialized as XHTML so I don't lose
> > Xpath access into that content.
> 
> Which is fine, but only works because processing the subset of HTML
> that you are using as XML doesn't break it.
> 

It's precisely that language which worries me.  What subset of HTML has
been designed which willfully breaks XML processing?  Is it really
necessary to do so, making HTML crippleware when embedded in Atom unless
it's escaped?  This wasn't the case before...

> 
> But you're describing a set up that works today, so I'm having
> trouble figuring out what problem you're complaining about. 
> 

I want my setup to also work tomorrow, which I'm not convinced it will,
depending on how much of HTML5 can't be serialized as XML, particularly
if the task force is promoting silent error correction be added at the
front of validation sequences -- I'm not happy with sloppy output which
requires error correction, but how can I avoid it?  IOW, where the TF
report speaks of adding an HTML processor to my toolchain, I read
"you'll be needing a new toolchain" which includes HTML5 lint instead
of just a validator, which seems silly for angle-bracket markup most of
which has been around for a long time as XML and can be validated using
multiple technologies (DTD/XSD/RELAX NG) already.

I'm not "complaining" about anything, just pushing back against the
notion that polyglot has "limited applicability" -- by that logic, the
fact that so few HTML documents are valid makes validation of limited
applicability.  Which may be the view of the TF, given that borking
validation isn't addressed as a concern of using HTML parsers in XML
toolchains in the report?

-Eric
Received on Monday, 18 July 2011 03:23:47 UTC