Re: NU’s polyglot possibilities (Was: The non-polyglot elephant in the room) from Alex Russell on 2013-01-25 (public-html@w3.org from January 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Fri, 25 Jan 2013 14:48:52 -0500
To: David Sheets <kosmo.zb@gmail.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CANr5HFXa8ic1dvUqMrg7TfsxHnWo+C9OSJnTEZ1AbUyDKxNQJA@mail.gmail.com>
On Thu, Jan 24, 2013 at 11:46 PM, David Sheets <kosmo.zb@gmail.com> wrote:

> On Thu, Jan 24, 2013 at 4:44 PM, Alex Russell <slightlyoff@google.com>
> wrote:
> > On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com>
> wrote:
> >>
> >> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell <slightlyoff@google.com>
> >> wrote:
> >> > I find myself asking (without an obvious answer): who benefits from
> the
> >> > creation of polyglot documents?
> >>
> >> Polyglot consumers benefit from only needing an HTML parser *or* an
> >> XML parser for a single representation.
> >
> > That's just a tautology. "People who wish to consume a set of documents
> > known to be in a single encoding only need one decoder". It doesn't
> > illuminate any of the questions about the boundaries between
> > producers/consumers that I posed.
>
> "People who wish to consume a set of documents known to simultaneously
> be in multiple equivalent encodings only need one of several
> decoders."
>
> That doesn't appear tautological to me. Check your cardinality. The
> Axiom of Choice comes to mind.


It appears to me that you've skipped a step ahead of answering my question
and are dismissing it on an assumption I'm not making (hence you think it's
not a tautology).

You posit a group of consumers who have one preference or another (a hard
preference, at that) and wish me to treat this binary-seprable group as
uniform. You then posit a producer who would like to address this group of
consumers. You further wish me (AFAICT) wish me to assume that these
demanding consumers are fully aware of the polyglot nature of the
producer's content through unspecified means.

What I'm asking is this: does this happen in the real world? Under what
circumstances? How frequently? On the open web (where I expect that the
contract about what is and isn't XML are even more important), or inside
closed systems and organizations? I don't see that the TAG has any duty to
the latter, so it's an honest question.

My personal experience leads me away from assuming that this is common. I'm
looking for countering evidence in order to be able to form an informed
opinion. So the question is open (ISTM): who are the consumers that do not
adapt to publishers?

I observe many consumers that adapt and few producers who do (particularly
granted the time-shifted nature of produced content and the availability of
more transistors every year).


> >> Polyglot producers benefit from only needing to produce a single
> >> representation for both HTML and XML consumers.
> >
> > What's the value to them in this? Yes, producers want to enable wide
> > consumption of their content, but nearly ever computer sold can parse
> both
> > HTML and XML with off-the-shelf software. The marginal gain is...what?
>
> 1. Smaller library dependency in software consumers
>

But evidence suggests that valuable content is transformed by eager
producers, not rejected. Consuming code that yields more value (can consume
more content) does better in the market. How is the value manifested for
users of this code? And are we supposed to assume that disk space is more
limited year-over-year (vs the historical trend)?


> 2. Wider interoperability with deployed systems
>

But that hinges on assuming that consumers do not adapt, but rather that
producers do (and retroactively!?)


> 3. Choice of internal data model, choice of parse strategy


Who is this valuable to? And isn't that value preserved by transformation?


>  > Again, is this about production in a closed system or between
> > systems/groups/organizations?
>
> Nothing is closed. Communication requires two parties. It should not
> be assumed that those parties co-operate. This applies even in a
> "closed" system. Send and receive systems evolve independently. Your
> distinction lacks a difference.


I don't think it does. Cooperating parties are more likely to settle on
stricter, more complete contracts (even if only though shared, unstated
assumptions). Parties further away in space and time must find ways to
adapt. I'm noting that this has led most systems that scale beyond one
"sphere of control" to be more forgiving about what they accept over time,
not less.

Here at Google we run MASSIVE systems that communicate over very fiddly
protocols. We can do this because we control the entire ecosystem in which
these systems live...in theory. But even as we've evolved them, we've found
that we must build multiple parsers into our binaries for even
"restrictive" data encodings. It just seems to happen, no matter intention
or policy.


> > If the content is valuable, it is consumers who invariably adapt.
>
> Free software is often valuable but consumers do not "invariably
> adapt" due to practical barriers. In much the same way, publishers may
> have user bases that are best served by providing additional
> guarantees on the well-formedness (resp. ease-of-use) of their
> documents.


I'm trying to understand what the real-world costs are. Free software isn't
comprable, as it's not content per sae. A book or movie might be. Does the
free software make it easier to read the book or movie? That's the analog.


> > This is how the incentives and rewards in time-delayed consumption are
> aligned.
>
> Is it? Your market is perfectly efficient?


Of course not.


> Your market has perfect information?


My questions are all about how information-deprived consumers will get
through the day.


> Consumers experience no switching costs? Nobody has
> lock-in or legacy? No field deployment? No corporate hegemony games?
> Are you advocating O(N) where N = number of consumers adaptations
> instead of O(1) where 1 = producer adaptation?
>
> Or perhaps you regard O(N) = O(1) because the agency of the *average*
> End User has been reduced to a choice between a handful of
> general-purpose browsers?


I think at this point you've convinced me that you're not interested in
answering the question and, perhaps frustratingly for both of us, helped me
understand that Polyglot isn't a real-world concern (although, do feel free
to convince me otherwise with better arguments and data...I'm keenly
interested to see them).


> > Keep in mind that Postel's Law isn't a nice-to-have, it's a description
> of
> > invariably happens when any system hits scale.
>
> Great! Why are you advocating censoring how to be more conservative in
> what you emit? We have "hit scale" and, for some publishers, that
> includes allowing for consumers which only understand XML or only
> understand HTML.
>
> "Be conservative in what you send, liberal in what you accept."
>
> It takes both halves to make it work.


I was inaccurate. The first half of the law *is* a nice-to-have (by
definition). The second is a description of what happens when systems hit
scale, invariable. I should have been clearer. Apologies.

Regards


> > We've seen this over and over
> > and over again, including in XML parsing. Real-world RSS pipelines deal
> with
> > all manner of invalid and mis-formed documents, not because it's fun, but
> > because to not do so forecloses opportunities that are valuable.
> Consumers
> > who find more value in strictness than in leniency will deal with a
> > relatively small set of producers.
>
> Conversely, producers who produce interoperable product will have the
> largest set of potential consumers. Why encourage ignorance by
> producers who wish to serve both simple and sophisticated consumers?
>
> The harder you push for censorship of an up-to-date and official
> polyglot standard and support a forked, authoritarian standardization
> process, the fewer producers will be able to create polyglot
> documents. Why not let publishers decide instead of trying to
> legislate interoperability out of existence?
>
> >> > If it's a closed ecosystem in which it's clear that all documents are
> >> > XML
> >> > (but which might be sent to the "outside" as HTML), then I don't
> >> > understand
> >> > why that ecosystem doesn't protect its borders by transforming HTML
> >> > documents (via an HTML parser->DOM->XML serialization) to XML.
> >>
> >> Why can't the publisher decide to allow both HTML and XML
> >> interpretation? Why does sending documents to the "outside" as HTML
> >> mean that they can no longer be well-formed XML?
> >
> > Why must they be the same set of bytes?
>
> I already addressed this below (authors may not have HTTP control,
> fewer representations to manage, the text/html parser doesn't care).
> Why *can't* they be the same sequence of bytes?
>
> > I don't see you arguing that there's value to be gained for either
> producers
> > (who can just publish under one contract or both with a tool) or
> consumers
> > who are facing a corpus of content that could be any of: html, xml,
> > polyglot. Until they know what their content is, it's no easier to make
> > assumptions about how "easy" parsing it will be.
>
> You don't see what you don't *want* to see.
>
> 1. Producers can provide a single artifact that satisfies multiple
> constituencies.
> 2. Consumers may not be general-purpose devices that "face a corpus of
> content that could be any of html, xml, polyglot".
>
> The fact that you have put polyglot as a third category next to HTML
> and XML is telling. Polyglot is simultaneously in *both* categories
> and does not need special consideration by any kind of consumer which
> is the whole point.
>
> Consider compressed archives or CDs or transport formats without
> content negotiation. There is value in being able to point to a blob
> and say "it satisfies all of A's constraints and all of B's
> constraints".
>
> > So let me rephrase: in the wild, what hope do we have that polyglot
> markup
> > will make life easier for interchange of documents between parties (not
> just
> > inside closed systems)?
>
> "Here is a hypermedia document."
> "Oh, what format is it?"
> "Just try it."
> |---> tries text/html "It works!"
> |---> tries application/xhtml+xml "It works!"
>
> Compare to:
>
> "Here is a hypermedia document."
> "Oh, what format is it?"
> "text/html"
> "Darn, I wanted to use XML tools on it."
>
> or
>
> "Here is a hypermedia document."
> "Oh, what format is it?"
> "application/xhtml+xml"
> "Eww... it uses namespaces and foreign vocabularies and weird entities
> and IE can't understand it and I can't figure out how to edit it and
> stay well-formed. :-("
>
> Parties interchange. One party (producer) wants to make their document
> as widely consumable as possible. They read a concise description of
> how to do that. They implement the recommendations easily. Now their
> product is more valuable to some population and they don't have
> multiple, potentially diverging, artifacts to manage.
>
> >> > Other possible users/producers seem even less compelling: if there's
> an
> >> > open
> >> > ecosystem of documents that admit both HTML and XML, then it's always
> >> > going
> >> > to be necessary for consuming software to support HTML parsing (and
> >> > likely
> >> > also XML parsing).
> >>
> >> No. Only one of HTML or XML is necessary, AFAICT. Why do you *need* an
> >> HTML parser to consume a polyglot document?
> >
> > How do you know it's polyglot?
>
> Why do you need to know?
>
> > I think that's the rub: you can advertise you're XML (with a .xml
> extension
> > or mimetype) or you can advertise easily that you're HTML. But what's the
> > contract for "I'm polyglot!"?
>
> Any metadata assertion indicating that the representation is either
> text/html or application/xhtml+xml will be correct by design.
>
> This ambivalence may satisfy many polyglot producers. I believe that
> standard embedded metadata to declare alternative interpretations is
> useful and will aid tooling.
> <https://www.w3.org/Bugs/Public/show_bug.cgi?id=20767>
>
> >> You only need both if you are a general-purpose browser. Lots of
> >> consuming software is not a general-purpose browser.
> >
> > That argument is strange for any # of reasons, but I'll try to be
> concise:
>
> Your rebuttal should demonstrate how it is "strange". No need for
> labeling it as such...
>
> > Browsers need both because they'll encounter both types of content in the
> > wild
> > They wish to be able to consume both types of content because it's
> valuable
> > for them to be able to do so; "Site X doesn't work in Browser Y" being a
> > prime way to lose market share.
> > Non-browser systems that wish to consume many forms of content include
> many
> > parsers. Image processing software, email systems, spreadsheets...I
> could go
> > on and on.
>
> Please don't. These tools are general domain viewers: image viewers,
> email viewers, spreadsheet viewers. In our case, we are discussing
> hypermedia documents and we happen to call such a general hypermedia
> document domain viewers "browsers".
>
> It turns out that there are other classes of consumers than this kind
> of general end user software system. Some of them are quite general
> and only include one parser: an XML parser.
>
> Furthermore, existence of polyglot does not effect your general
> browsers *at all* because no matter which language these systems
> interpret, they should be able to interpret polyglot.
>
> > It's clear by inspection that this has nothing to do with "being a
> browser".
>
> Nothing? Browsers want to consume content from multiple corpuses and
> are nearly required by End Users to contain interpreters for text/html
> and application/xhtml+xml. So your assertion is neither true nor clear
> by inspection.
>
> > It's about wanting to consume content from multiple corpuses. And that's
> an
> > incentive that is nearly always serviced (in the short term) on the
> client.
>
> "We have the resources to handle the complexity so you shouldn't
> attempt to produce content for those who do not have the resources to
> handle the complexity."?
>
> There are multiple corpuses which are XML only.
> There are multiple corpuses which are HTML only.
> There are interpreters for XML only.
> There are interpreters for HTML only (html5lib).
>
> Why are general-purpose, complexity-laden applications the only class
> of consumers that publishers should consider?
>
> This is especially true when the cost of polyglot publication for a
> publisher is minimal because they already use XML on their back-end.
> Why should they have to choose between losing invariants and
> corrupting their output or giving up their automation?
>
> If your features include "can interpret most anything" then it should
> come as no surprise that you require multiple parsers.
>
> >> > If it's a world of HTML consumers that would like to
> >> > access XML documents...well, just publish as (legacy) XHTML, no?
> >>
> >> Generic legacy XHTML is not compatible with modern HTML. Defining the
> >> intersection is the point of polyglot.
> >
> > Perhaps I don't understand...what modern HTML features defeat XHTML?
>
> I did not realize that HTML and XHTML were doing battle. There is no
> reason they cannot coexist peacefully save politics and power play.
>
> Logically, it is not HTML features that foil XHTML but XHTML features
> that HTML cannot interpret or that HTML interprets in divergent ways.
> See:
>
> <http://dev.w3.org/html5/html-xhtml-author-guide/#namespaces>
> <http://dev.w3.org/html5/html-xhtml-author-guide/#disallowed-attributes>
> <http://dev.w3.org/html5/html-xhtml-author-guide/#named-entity-references>
>
> Additionally, as HTML keeps "Living" (perhaps "Evolving" would be
> better?), it might adopt expressions without XML counterparts.
>
> See: <
> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2013-January/038632.html
> >
>
> > Is it only validation?
>
> No, it is divergence of meaning.
>
> > If so...to be blunt, who cares?
>
> Yeah, automated checking is worthless. I'm glad I can ignore all those
> pointless compiler warnings.
>
> "If I create valid XHTML, why does it have to be invalid HTML?"
> "It doesn't."
> "Why didn't someone tell me?"
> "They tried."
> "Why didn't I hear about it?"
> "Google didn't see the point."
> "Oh."
>
> >> > What am I missing? Under what conditions can the expectations of
> >> > producers
> >> > and consumers of polyglot documents be simplified by the addition of
> >> > polyglot markup to their existing world/toolchain?
> >>
> >> It is simpler to manage a single representation than two separate but
> >> similar representations (consider that the author may not have control
> >> of their HTTP publication).
> >>
> >> Use case:
> >> 1. Browsing documents in third-party-managed repo with HTML browser.
> >
> > And that's a repo of....what? HTML documents?
>
> A repo of polyglot documents which happen to be HTML documents (by
> definition).
>
> > What's the contract that publisher is conforming to?
>
> The publisher is conforming to both text/html and application/xhtml+xml.
>
> >> 2. Save a polyglot doc after viewing.
> >> 3. Put polyglot doc into XML system -- it works!
> >
> > And assuming the XML system has a transforming front-end (the way
> browsers
> > do), that's work for HTML too.
>
> This assumption is exactly the problem. As author, I did not wish to
> require my consumers to have a transforming front-end. Perhaps the XML
> system is not designed for XHTML but many different XML vocabularies
> of which XHTML is one.
>
> Why recommend publication of corrupt documents that require extra
> software to clean? Why assume that every transforming front-end will
> interpret my broken documents in identical ways?
>
> If the publisher does not want to foist that burden onto their
> consumers, why not help them understand the intersection of text/html
> and application/xhtml+xml?
>
> > Or is this really about extending HTML via XML extensions?
>
> I think it's really about extending the leverage of Google and its
> Mozilla protectorate over the public resource of the World Wide Web.
>
> Why spend resources on "legacy" standards that some developers and
> publishers want? Why encourage 10+ year ecosystem of existing software
> that is no longer en vogue? We should just rebuild it all... again.
> It'll only take a decade and we have lots of money and influence now!
> All the cool kids are using JSON these days, anyway.
>
> How do the votes work out? Does each browser vendor get votes
> equivalent to three-fifths of their user base?
>
> Oops... maybe provocative speculation as to others' motives isn't a
> very wise idea. I apologize for my inappropriate behavior.
>
> Sincerely,
>
> David William Wallace Sheets
>
Received on Friday, 25 January 2013 19:49:50 UTC