Re: NU’s polyglot possibilities (Was: The non-polyglot elephant in the room) from Alex Russell on 2013-01-25 (public-html@w3.org from January 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Thu, 24 Jan 2013 19:44:54 -0500
To: David Sheets <kosmo.zb@gmail.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CANr5HFUMhNYzsZw2K-DG3jfLNY6iK5mBYq4N2wCHVYqyQjHpFg@mail.gmail.com>
On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com> wrote:

> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell <slightlyoff@google.com>
> wrote:
> > I find myself asking (without an obvious answer): who benefits from the
> > creation of polyglot documents?
>
> Polyglot consumers benefit from only needing an HTML parser *or* an
> XML parser for a single representation.
>

That's just a tautology. "People who wish to consume a set of documents
known to be in a single encoding only need one decoder". It doesn't
illuminate any of the questions about the boundaries between
producers/consumers that I posed.


> Polyglot producers benefit from only needing to produce a single
> representation for both HTML and XML consumers.


What's the value to them in this? Yes, producers want to enable wide
consumption of their content, but nearly ever computer sold can parse both
HTML and XML with off-the-shelf software. The marginal gain is...what?
Again, is this about production in a closed system or between
systems/groups/organizations? If the content is valuable, it is consumers
who invariably adapt. This is how the incentives and rewards in
time-delayed consumption are aligned.

Keep in mind that Postel's Law isn't a nice-to-have, it's a description of
invariably happens when any system hits scale. We've seen this over and
over and over again, including in XML parsing. Real-world RSS pipelines
deal with all manner of invalid and mis-formed documents, not because it's
fun, but because to not do so forecloses opportunities that are valuable.
Consumers who find more value in strictness than in leniency will deal with
a relatively small set of producers.

> If it's a closed ecosystem in which it's clear that all documents are XML
> > (but which might be sent to the "outside" as HTML), then I don't
> understand
> > why that ecosystem doesn't protect its borders by transforming HTML
> > documents (via an HTML parser->DOM->XML serialization) to XML.
>
> Why can't the publisher decide to allow both HTML and XML
> interpretation? Why does sending documents to the "outside" as HTML
> mean that they can no longer be well-formed XML?


Why must they be the same set of bytes?

I don't see you arguing that there's value to be gained for either
producers (who can just publish under one contract or both with a tool) or
consumers who are facing a corpus of content that could be any of: html,
xml, polyglot. Until they know what their content is, it's no easier to
make assumptions about how "easy" parsing it will be.

So let me rephrase: in the wild, what hope do we have that polyglot markup
will make life easier for interchange of documents between parties (not
just inside closed systems)?


> > Other possible users/producers seem even less compelling: if there's an
> open
> > ecosystem of documents that admit both HTML and XML, then it's always
> going
> > to be necessary for consuming software to support HTML parsing (and
> likely
> > also XML parsing).
>
> No. Only one of HTML or XML is necessary, AFAICT. Why do you *need* an
> HTML parser to consume a polyglot document?
>

How do you know it's polyglot?

I think that's the rub: you can advertise you're XML (with a .xml extension
or mimetype) or you can advertise easily that you're HTML. But what's the
contract for "I'm polyglot!"?


> You only need both if you are a general-purpose browser. Lots of
> consuming software is not a general-purpose browser.


That argument is strange for any # of reasons, but I'll try to be concise:

   1. Browsers need both because they'll encounter both types of content in
   the wild
   2. They wish to be able to consume both types of content because it's
   valuable for them to be able to do so; "Site X doesn't work in Browser Y"
   being a prime way to lose market share.
   3. Non-browser systems that wish to consume many forms of content
   include many parsers. Image processing software, email systems,
   spreadsheets...I could go on and on.

It's clear by inspection that this has nothing to do with "being a
browser". It's about wanting to consume content from multiple corpuses. And
that's an incentive that is nearly always serviced (in the short term) on
the client.

> If it's a world of HTML consumers that would like to
> > access XML documents...well, just publish as (legacy) XHTML, no?
>
> Generic legacy XHTML is not compatible with modern HTML. Defining the
> intersection is the point of polyglot.


Perhaps I don't understand...what modern HTML features defeat XHTML? Is it
only validation? If so...to be blunt, who cares?


> > What am I missing? Under what conditions can the expectations of
> producers
> > and consumers of polyglot documents be simplified by the addition of
> > polyglot markup to their existing world/toolchain?
>
> It is simpler to manage a single representation than two separate but
> similar representations (consider that the author may not have control
> of their HTTP publication).
>
> Use case:
> 1. Browsing documents in third-party-managed repo with HTML browser.
>

And that's a repo of....what? HTML documents? What's the contract that
publisher is conforming to?


> 2. Save a polyglot doc after viewing.
> 3. Put polyglot doc into XML system -- it works!
>

And assuming the XML system has a transforming front-end (the way browsers
do), that's work for HTML too. Or is this really about extending HTML via
XML extensions?


> The "addition" of polyglot is an extra set of invariants on the
> document that reduces some consumers' burden. Of course, if it doesn't
> suit you, you are free to not use it.
>
> David
>
> > On Thu, Jan 24, 2013 at 4:57 AM, Michael[tm] Smith <mike@w3.org> wrote:
> >>
> >> Leif Halvard Silli <xn--mlform-iua@målform.no>, 2013-01-24 01:23 +0100:
> >>
> >> > Michael[tm] Smith, Mon, 21 Jan 2013 23:47:40 +0900:
> >> > > In the simplest implementation, the validator would need to
> >> > > automatically parse and validate the document twice
> >> >
> >> > 1 Could you do that? Just guide the user through two steps:
> >> >   HTML-validation + XHTML-validation?
> >>
> >> Of course doable. But I think it should be clear from my previous
> messages
> >> that regardless of how feasible it is, I'm not interested in
> implementing
> >> it. I don't want to put time into adding a feature that's intended to
> help
> >> users more easily create conforming Polyglot documents, because I don't
> >> think it's a good idea to encourage authors to create Polyglot
> documents.
> >>
> >> >   The second step could also
> >> >   produce a comparison of the DOM produced by the two steps.
> >>
> >> That would require the validator to construct a DOM from the document.
> >> Twice. The validator by design currently doesn't do any DOM construction
> >> at
> >> all. It does streaming processing of documents, using SAX events.
> >>
> >> Anyway, with respect, I hope you can understand that I'm not very
> >> interested in continuing a discussion of hypothetical functional details
> >> for a feature that I'm not planning to ever implement.
> >>
> >> > 2 But if the author uses a good, XHTML5-aware authoring tool that
> >> >   keeps the code well-formed, then a *single* validation as
> >> >   text/html should already bring you quite far.
> >>
> >> True I guess, if you're actually serving the document as text/html.
> >>
> >> But really what would get you even farther if you're using XML tools to
> >> create your documents is to not try to check them as text/html at all
> but
> >> instead serve them with an XML mime type, in which case the validator
> will
> >> parse them as XML instead of text/html, and everything will work fine.
> >>
> >> Anyway, yeah, if somebody is manually using XML tools to create their
> >> documents then I would think they'd already know whether they're
> >> well-formed, and they'd not need to use the validator to tell them
> whether
> >> they're well-formed or not. But of course a lot of documents on the Web
> >> are
> >> not created manually that way but instead dynamically generated out of a
> >> CMS, and many CMSes that are capable of serving up XML don't always get
> it
> >> right and can produce non-well-formed XML.
> >>
> >> All that said, I don't know why anybody who's serving a document as
> >> text/html would normally care much, at the point where it's being served
> >> (as opposed to the point where it's being created and preprocessed or
> >> whatever), whether it's XML-well-formed or not.
> >>
> >> > 3 Finally, one very simple thing: polyglot dummy code! The NU
> >> >   validator’s Text Field contains a HTML5 dummy that validates,
> >> >   but only as HTML, since the namespace isn't declared. Bug
> >> >   20712 proposes to add a dummy for the XHTML5 presets as well.[1]
> >> > [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=20712
> >>
> >> Yeah, I suppose it's worth having the dummy document include the
> namespace
> >> declaration if you've selected one of the XHTML presets. I'll get around
> >> to
> >> adding it at some point, if Henri doesn't first.
> >>
> >> >   Such a dummy no doubt serves as a teachable moment for many. And
> >> >   as long as you just add the namespace and otherwise keep the
> >> >   current dummy document intact, it would also, without banging
> >> >   it into anyone’s head, be a polyglot example.
> >>
> >> True that simple document would be a conforming polyglot instance, but I
> >> doubt most users would realize it as such, or care. The value of it
> would
> >> just be for the simple user convenience of not needing to manually add
> the
> >> namespace declaration in order to avoid the error message you get now.
> >>
> >>   --Mike
> >>
> >> --
> >> Michael[tm] Smith http://people.w3.org/mike
> >>
> >
>
Received on Friday, 25 January 2013 00:45:56 UTC