Re: Publication of specifications as HTML5 from Aryeh Gregor on 2011-08-19 (spec-prod@w3.org from July to September 2011)

From: Aryeh Gregor <ayg@aryeh.name>
Date: Fri, 19 Aug 2011 11:42:13 -0400
To: Ian Jacobs <ij@w3.org>, David Carlisle <davidc@nag.co.uk>, Richard Ishida <ishida@w3.org>
Cc: Karl Dubost <karl+w3c@la-grange.net>, Doug Schepers <schepers@w3.org>, Spec Prod <spec-prod@w3.org>, Philippe Le Hegaret <plh@w3.org>
Message-ID: <CAKA+Ax=rHJFU1ezxfbJpJvx4EjjfDAvH9BrMoK-EdFSwt0=rYw@mail.gmail.com>
On Thu, Aug 18, 2011 at 11:20 PM, Ian Jacobs <ij@w3.org> wrote:
> I had understood "conforms to http://www.w3.org/TR/html-polyglot/"
>
> For XML processors.

Polyglot is not targeted at XML processors.  The idea of a polyglot
document is that the same file should work the same in a *browser*
whether it's served as text/html or an XML MIME type.  In practice,
however, this isn't useful, because all browsers support text/html, so
there's no need to serve with two MIME types.

If we're concerned about non-browser XML processors, we shouldn't need
polyglot.  All we should need is to make an XML serialization of the
spec available, or just make a text/html-to-XML converter available.
Then existing XML toolchains could process the document by just adding
one extra conversion step.  If you have html5lib installed, a
text/html-to-XML converter should take <10 lines to write and take a
negligible amount of time to run, less than fetching the file from the
network.

The key difference here is that a polyglot document tries to be
equivalent text/html and XML the the *same file*, *and* they try to
produce the same DOM (or almost) when parsed either way.  This is
actually very nontrivial, and it's not necessary if we only want to
support XML processing.

On Fri, Aug 19, 2011 at 7:09 AM, David Carlisle <davidc@nag.co.uk> wrote:
> What may (or may not?) be needed are content model restrictions on using
> or not using new "html5" structural features. Could a normative version
> of the spec use canvas for example?

This question is not specific to the HTML markup.  A spec could also
conceivably use CSS or JavaScript that's not supported by all
browsers, like localStorage or such.  It could even use features that
are in RECs but aren't universally supported.  For instance, you could
write a page that works perfectly in any browser that supports HTML
4.01 and CSS 2.1, but which is totally unreadable in IE6 and 7.
That's about 13% of browsers by market share that can't read the page
(using Wikimedia's statistics).  Likewise, HTML5 uses some Unicode
characters that display as boxes on my computer -- that doesn't break
any standard, but it's arguably a bad idea anyway, and certainly would
be if it were confusing.

I think we have to be pragmatic here and judge on a case-by-case
basis, based on real-world UA behavior rather than nominal maturity
levels.  The goal of a specification is to be read and understood,
after all.  As long as the markup used is such that it will be clearly
and accurately understood by pretty much any CSS-supporting browser
people are going to use -- say without JavaScript or plugins -- that
should be okay.

So if the spec author wants to include an example, which is clearly
marked as an example, which uses <canvas> and says "If your browser
supports <canvas>, you'll see a smiley face here:", such that if the
browser doesn't support <canvas> it instead displays fallback text
like "Your browser does not support canvas :(", then I think that's
not a problem.  Depending on <canvas> (or any other JS) for normative
text is obviously a non-starter, and also a bad idea if it's not
really clear what's happening in non-supporting browsers.

But all this is only realistically decidable on a case-by-case basis.
It should just be a corollary of "specifications have to be clearly
written".  I think it's quite a separate question from what formats we
should allow to begin with.  Obviously W3C specs should be published
in HTML+CSS+JS, not PDF or Flash or anything, nor using nonstandard
extensions.  But I don't see a reason to restrict the exact versions
used, provided they're standard or being standardized and the features
work in practice.

On Fri, Aug 19, 2011 at 7:25 AM, Richard Ishida <ishida@w3.org> wrote:
> [1] there are additional rules for polyglot documents to ensure that the
> document works as XML and HTML (for example, no XML declaration allowed,
> therefore encoding can only be utf-8 (or utf-16 but that was excluded from
> polyglot)).  So it's not just xml well-formedness. Having said that, I don't
> think there are many additional rules to worry about. That's what the
> polyglot spec describes: http://www.w3.org/TR/html-polyglot/

It's actually very hard to produce real polyglot documents
automatically.  For instance, there is no markup that will produce a
script tag with a single Text child that contains < or & that will
work in both text/html and XML.  <script><</script> works in
text/html, but is not XML.  <script>&lt;</script> works in XML, but
produces a different DOM as text/html ("&lt;" is treated as four
literal characters instead of one entity).  In practice you have to
use hacks like <script>/*<![CDATA[*/</*]]>*/</script> that more or
less work the same but don't actually produce the same DOM.  So we
should not be talking about polyglot unless we *really* mean polyglot,
rather than just "let's make a text/html-to-XML converter available".

> [2] there are features of HTML5 that are not yet widely supported.  I think
> that what's needed is a defined subset of HTML5 for editors to use that
> reflects what is currently supported on major browsers.  That subset should
> imo be revised as soon as new www.orfeatures become supported by major
> browsers, eg. the dir=auto value will hopefully be supported soon, but it
> isn't yet.  It also assumes a decision that we are happy that people may
> struggle with 'non-major' browsers that may not yet support html5 features,
> and may have to view with a different browser.  It also requires defining
> what consitutes a 'major' browser.

As noted, this is not specific to HTML5 -- it even applies to things
that are in CSS2.1 and haven't changed since CSS2.  I don't think we
can make a precise list, it should be more like guidelines whose
interpretation can change over time.
Received on Friday, 19 August 2011 15:43:06 UTC