Re: NU's polyglot possibilities (Was: The non-polyglot elephant in the room) from David Sheets on 2013-01-25 (public-html@w3.org from January 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Thu, 24 Jan 2013 20:46:54 -0800
To: Alex Russell <slightlyoff@google.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CAAWM5TwA2K9iY_YtuO=jYg4Eq8UPEGPd7q4=rcbZyWgTD_JB-w@mail.gmail.com>
On Thu, Jan 24, 2013 at 4:44 PM, Alex Russell <slightlyoff@google.com> wrote:
> On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com> wrote:
>>
>> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell <slightlyoff@google.com>
>> wrote:
>> > I find myself asking (without an obvious answer): who benefits from the
>> > creation of polyglot documents?
>>
>> Polyglot consumers benefit from only needing an HTML parser *or* an
>> XML parser for a single representation.
>
> That's just a tautology. "People who wish to consume a set of documents
> known to be in a single encoding only need one decoder". It doesn't
> illuminate any of the questions about the boundaries between
> producers/consumers that I posed.

"People who wish to consume a set of documents known to simultaneously
be in multiple equivalent encodings only need one of several
decoders."

That doesn't appear tautological to me. Check your cardinality. The
Axiom of Choice comes to mind.

>> Polyglot producers benefit from only needing to produce a single
>> representation for both HTML and XML consumers.
>
> What's the value to them in this? Yes, producers want to enable wide
> consumption of their content, but nearly ever computer sold can parse both
> HTML and XML with off-the-shelf software. The marginal gain is...what?

1. Smaller library dependency in software consumers
2. Wider interoperability with deployed systems
3. Choice of internal data model, choice of parse strategy

> Again, is this about production in a closed system or between
> systems/groups/organizations?

Nothing is closed. Communication requires two parties. It should not
be assumed that those parties co-operate. This applies even in a
"closed" system. Send and receive systems evolve independently. Your
distinction lacks a difference.

> If the content is valuable, it is consumers who invariably adapt.

Free software is often valuable but consumers do not "invariably
adapt" due to practical barriers. In much the same way, publishers may
have user bases that are best served by providing additional
guarantees on the well-formedness (resp. ease-of-use) of their
documents.

> This is how the incentives and rewards in time-delayed consumption are aligned.

Is it? Your market is perfectly efficient? Your market has perfect
information? Consumers experience no switching costs? Nobody has
lock-in or legacy? No field deployment? No corporate hegemony games?
Are you advocating O(N) where N = number of consumers adaptations
instead of O(1) where 1 = producer adaptation?

Or perhaps you regard O(N) = O(1) because the agency of the *average*
End User has been reduced to a choice between a handful of
general-purpose browsers?

> Keep in mind that Postel's Law isn't a nice-to-have, it's a description of
> invariably happens when any system hits scale.

Great! Why are you advocating censoring how to be more conservative in
what you emit? We have "hit scale" and, for some publishers, that
includes allowing for consumers which only understand XML or only
understand HTML.

"Be conservative in what you send, liberal in what you accept."

It takes both halves to make it work.

> We've seen this over and over
> and over again, including in XML parsing. Real-world RSS pipelines deal with
> all manner of invalid and mis-formed documents, not because it's fun, but
> because to not do so forecloses opportunities that are valuable. Consumers
> who find more value in strictness than in leniency will deal with a
> relatively small set of producers.

Conversely, producers who produce interoperable product will have the
largest set of potential consumers. Why encourage ignorance by
producers who wish to serve both simple and sophisticated consumers?

The harder you push for censorship of an up-to-date and official
polyglot standard and support a forked, authoritarian standardization
process, the fewer producers will be able to create polyglot
documents. Why not let publishers decide instead of trying to
legislate interoperability out of existence?

>> > If it's a closed ecosystem in which it's clear that all documents are
>> > XML
>> > (but which might be sent to the "outside" as HTML), then I don't
>> > understand
>> > why that ecosystem doesn't protect its borders by transforming HTML
>> > documents (via an HTML parser->DOM->XML serialization) to XML.
>>
>> Why can't the publisher decide to allow both HTML and XML
>> interpretation? Why does sending documents to the "outside" as HTML
>> mean that they can no longer be well-formed XML?
>
> Why must they be the same set of bytes?

I already addressed this below (authors may not have HTTP control,
fewer representations to manage, the text/html parser doesn't care).
Why *can't* they be the same sequence of bytes?

> I don't see you arguing that there's value to be gained for either producers
> (who can just publish under one contract or both with a tool) or consumers
> who are facing a corpus of content that could be any of: html, xml,
> polyglot. Until they know what their content is, it's no easier to make
> assumptions about how "easy" parsing it will be.

You don't see what you don't *want* to see.

1. Producers can provide a single artifact that satisfies multiple
constituencies.
2. Consumers may not be general-purpose devices that "face a corpus of
content that could be any of html, xml, polyglot".

The fact that you have put polyglot as a third category next to HTML
and XML is telling. Polyglot is simultaneously in *both* categories
and does not need special consideration by any kind of consumer which
is the whole point.

Consider compressed archives or CDs or transport formats without
content negotiation. There is value in being able to point to a blob
and say "it satisfies all of A's constraints and all of B's
constraints".

> So let me rephrase: in the wild, what hope do we have that polyglot markup
> will make life easier for interchange of documents between parties (not just
> inside closed systems)?

"Here is a hypermedia document."
"Oh, what format is it?"
"Just try it."
|---> tries text/html "It works!"
|---> tries application/xhtml+xml "It works!"

Compare to:

"Here is a hypermedia document."
"Oh, what format is it?"
"text/html"
"Darn, I wanted to use XML tools on it."

or

"Here is a hypermedia document."
"Oh, what format is it?"
"application/xhtml+xml"
"Eww... it uses namespaces and foreign vocabularies and weird entities
and IE can't understand it and I can't figure out how to edit it and
stay well-formed. :-("

Parties interchange. One party (producer) wants to make their document
as widely consumable as possible. They read a concise description of
how to do that. They implement the recommendations easily. Now their
product is more valuable to some population and they don't have
multiple, potentially diverging, artifacts to manage.

>> > Other possible users/producers seem even less compelling: if there's an
>> > open
>> > ecosystem of documents that admit both HTML and XML, then it's always
>> > going
>> > to be necessary for consuming software to support HTML parsing (and
>> > likely
>> > also XML parsing).
>>
>> No. Only one of HTML or XML is necessary, AFAICT. Why do you *need* an
>> HTML parser to consume a polyglot document?
>
> How do you know it's polyglot?

Why do you need to know?

> I think that's the rub: you can advertise you're XML (with a .xml extension
> or mimetype) or you can advertise easily that you're HTML. But what's the
> contract for "I'm polyglot!"?

Any metadata assertion indicating that the representation is either
text/html or application/xhtml+xml will be correct by design.

This ambivalence may satisfy many polyglot producers. I believe that
standard embedded metadata to declare alternative interpretations is
useful and will aid tooling.
<https://www.w3.org/Bugs/Public/show_bug.cgi?id=20767>

>> You only need both if you are a general-purpose browser. Lots of
>> consuming software is not a general-purpose browser.
>
> That argument is strange for any # of reasons, but I'll try to be concise:

Your rebuttal should demonstrate how it is "strange". No need for
labeling it as such...

> Browsers need both because they'll encounter both types of content in the
> wild
> They wish to be able to consume both types of content because it's valuable
> for them to be able to do so; "Site X doesn't work in Browser Y" being a
> prime way to lose market share.
> Non-browser systems that wish to consume many forms of content include many
> parsers. Image processing software, email systems, spreadsheets...I could go
> on and on.

Please don't. These tools are general domain viewers: image viewers,
email viewers, spreadsheet viewers. In our case, we are discussing
hypermedia documents and we happen to call such a general hypermedia
document domain viewers "browsers".

It turns out that there are other classes of consumers than this kind
of general end user software system. Some of them are quite general
and only include one parser: an XML parser.

Furthermore, existence of polyglot does not effect your general
browsers *at all* because no matter which language these systems
interpret, they should be able to interpret polyglot.

> It's clear by inspection that this has nothing to do with "being a browser".

Nothing? Browsers want to consume content from multiple corpuses and
are nearly required by End Users to contain interpreters for text/html
and application/xhtml+xml. So your assertion is neither true nor clear
by inspection.

> It's about wanting to consume content from multiple corpuses. And that's an
> incentive that is nearly always serviced (in the short term) on the client.

"We have the resources to handle the complexity so you shouldn't
attempt to produce content for those who do not have the resources to
handle the complexity."?

There are multiple corpuses which are XML only.
There are multiple corpuses which are HTML only.
There are interpreters for XML only.
There are interpreters for HTML only (html5lib).

Why are general-purpose, complexity-laden applications the only class
of consumers that publishers should consider?

This is especially true when the cost of polyglot publication for a
publisher is minimal because they already use XML on their back-end.
Why should they have to choose between losing invariants and
corrupting their output or giving up their automation?

If your features include "can interpret most anything" then it should
come as no surprise that you require multiple parsers.

>> > If it's a world of HTML consumers that would like to
>> > access XML documents...well, just publish as (legacy) XHTML, no?
>>
>> Generic legacy XHTML is not compatible with modern HTML. Defining the
>> intersection is the point of polyglot.
>
> Perhaps I don't understand...what modern HTML features defeat XHTML?

I did not realize that HTML and XHTML were doing battle. There is no
reason they cannot coexist peacefully save politics and power play.

Logically, it is not HTML features that foil XHTML but XHTML features
that HTML cannot interpret or that HTML interprets in divergent ways.
See:

<http://dev.w3.org/html5/html-xhtml-author-guide/#namespaces>
<http://dev.w3.org/html5/html-xhtml-author-guide/#disallowed-attributes>
<http://dev.w3.org/html5/html-xhtml-author-guide/#named-entity-references>

Additionally, as HTML keeps "Living" (perhaps "Evolving" would be
better?), it might adopt expressions without XML counterparts.

See: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2013-January/038632.html>

> Is it only validation?

No, it is divergence of meaning.

> If so...to be blunt, who cares?

Yeah, automated checking is worthless. I'm glad I can ignore all those
pointless compiler warnings.

"If I create valid XHTML, why does it have to be invalid HTML?"
"It doesn't."
"Why didn't someone tell me?"
"They tried."
"Why didn't I hear about it?"
"Google didn't see the point."
"Oh."

>> > What am I missing? Under what conditions can the expectations of
>> > producers
>> > and consumers of polyglot documents be simplified by the addition of
>> > polyglot markup to their existing world/toolchain?
>>
>> It is simpler to manage a single representation than two separate but
>> similar representations (consider that the author may not have control
>> of their HTTP publication).
>>
>> Use case:
>> 1. Browsing documents in third-party-managed repo with HTML browser.
>
> And that's a repo of....what? HTML documents?

A repo of polyglot documents which happen to be HTML documents (by definition).

> What's the contract that publisher is conforming to?

The publisher is conforming to both text/html and application/xhtml+xml.

>> 2. Save a polyglot doc after viewing.
>> 3. Put polyglot doc into XML system -- it works!
>
> And assuming the XML system has a transforming front-end (the way browsers
> do), that's work for HTML too.

This assumption is exactly the problem. As author, I did not wish to
require my consumers to have a transforming front-end. Perhaps the XML
system is not designed for XHTML but many different XML vocabularies
of which XHTML is one.

Why recommend publication of corrupt documents that require extra
software to clean? Why assume that every transforming front-end will
interpret my broken documents in identical ways?

If the publisher does not want to foist that burden onto their
consumers, why not help them understand the intersection of text/html
and application/xhtml+xml?

> Or is this really about extending HTML via XML extensions?

I think it's really about extending the leverage of Google and its
Mozilla protectorate over the public resource of the World Wide Web.

Why spend resources on "legacy" standards that some developers and
publishers want? Why encourage 10+ year ecosystem of existing software
that is no longer en vogue? We should just rebuild it all... again.
It'll only take a decade and we have lots of money and influence now!
All the cool kids are using JSON these days, anyway.

How do the votes work out? Does each browser vendor get votes
equivalent to three-fifths of their user base?

Oops... maybe provocative speculation as to others' motives isn't a
very wise idea. I apologize for my inappropriate behavior.

Sincerely,

David William Wallace Sheets
Received on Friday, 25 January 2013 04:47:29 UTC