Re: NU’s polyglot possibilities (Was: The non-polyglot elephant in the room) from Alex Russell on 2013-01-25 (public-html@w3.org from January 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Fri, 25 Jan 2013 17:11:56 -0500
To: David Sheets <kosmo.zb@gmail.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CANr5HFX-hvaJyfiGWkoZa1aU+UDMpwZ1-Y4JW=R-HeYMdexxNA@mail.gmail.com>
On Fri, Jan 25, 2013 at 4:16 PM, David Sheets <kosmo.zb@gmail.com> wrote:

> On Fri, Jan 25, 2013 at 11:48 AM, Alex Russell <slightlyoff@google.com>
> wrote:
> > On Thu, Jan 24, 2013 at 11:46 PM, David Sheets <kosmo.zb@gmail.com>
> wrote:
> >>
> >> On Thu, Jan 24, 2013 at 4:44 PM, Alex Russell <slightlyoff@google.com>
> >> wrote:
> >> > On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell <
> slightlyoff@google.com>
> >> >> wrote:
> >> >> > I find myself asking (without an obvious answer): who benefits from
> >> >> > the
> >> >> > creation of polyglot documents?
> >> >>
> >> >> Polyglot consumers benefit from only needing an HTML parser *or* an
> >> >> XML parser for a single representation.
> >> >
> >> > That's just a tautology. "People who wish to consume a set of
> documents
> >> > known to be in a single encoding only need one decoder". It doesn't
> >> > illuminate any of the questions about the boundaries between
> >> > producers/consumers that I posed.
> >>
> >> "People who wish to consume a set of documents known to simultaneously
> >> be in multiple equivalent encodings only need one of several
> >> decoders."
> >>
> >> That doesn't appear tautological to me. Check your cardinality. The
> >> Axiom of Choice comes to mind.
> >
> > It appears to me that you've skipped a step ahead of answering my
> question
> > and are dismissing it on an assumption I'm not making (hence you think
> it's
> > not a tautology).
>
> Let us find our common misunderstanding and resolve it.
>
> > You posit a group of consumers who have one preference or another (a hard
> > preference, at that) and wish me to treat this binary-seprable group as
> > uniform. You then posit a producer who would like to address this group
> of
> > consumers. You further wish me (AFAICT) wish me to assume that these
> > demanding consumers are fully aware of the polyglot nature of the
> producer's
> > content through unspecified means.
>
> Suppose you are publishing technical documentation. You already have a
> toolchain constructed to ensure very specific invariants on your
> output documents. Your consumers are savvy and may wish to script
> against your documentation. You perceive that for a small cost
> (reading polyglot spec and tweaking to emit it), you can simplify
> consumption for your user base.


This works with a single producer and consumer who have a fixed contract.
That's sort of the definition of a closed system...and it's not the web.
Why aren't they just publishing as one or the other? And if the tweaks are
so small (but necessary), why isn't this a job for software? Consumers who
want to process more than a single producer's content either have to:

   1. Have a reliable way to know that what they consume isn't going to be
   broken (as HTML in XML parsing is)
   2. Have a way of consuming a superset of any individual publisher's
   formats

Either work, but polyglot precludes #1 on the basis that #2 shouldn't have
to happen, against all the evidence of how this sort of thing is sorted out
every day by real world software.


> > What I'm asking is this: does this happen in the real world?
>
> Yes.
>
> > Under what circumstances?
>
> Structured document repositories
> Legal case files
> Digital archives
> Database views
> Email repositories
> Software specifications
> Anything projecting well-defined data structures into HTML
>

So "programs writing programs for programs".


> > How frequently?
>
> Every time a programmatic producer wishes to serve an XML consumer and
> an HTML consumer with fewer special cases.
>
> > On the open web (where I expect that the
> > contract about what is and isn't XML are even more important), or inside
> > closed systems and organizations?
>
> When you publicly publish something and declare your intent, you are
> on the "open web".


I think you'll struggle to get most W3C members to accept that definition.


>  > I don't see that the TAG has any duty to the latter, so it's an honest
> question.
>
> Even "closed" systems export data and use off-the-shelf browsers.
> Furthermore, many of these "closed" systems will be opening up in the
> future. The TAG has a responsibility to guide publishers and
> implementors who wish to support W3C standard formats in their systems
> that do or may interact with the web.


Our job is not to sell the web to a possible new audience -- it doesn't
need our help and we're the last group I can imagine being effective as
salespeople -- it's to help publishers understand how the rules work so
that they can join it and to help spec authors make sure the rules are sane
in the long-run.


> > My personal experience leads me away from assuming that this is common.
>
> Mine as well. I didn't realize that only the most common case deserves
> attention. What is your threshold for consideration?
>
> > I'm looking for countering evidence in order to be able to form an
> informed
> > opinion. So the question is open (ISTM): who are the consumers that do
> not
> > adapt to publishers?
>
> Why cannot publishers decide to publish content with maximal compatibility?
>

Why can't I publish a binary stream of bytes that's both a PNG and a BMP?

I'm honestly trying to understand the real-world harm in giving up on
polyglot. So far I don't sense that there's much to be lost that can't be
easily won again through common and well-understood strategies -- the sorts
of things that browsers and all sorts of other off-the-shelf software
already do.


> If niche publishers assume that consumers will adapt, they may find
> that the hassle of adaptation has hurt their reach.
>

What hassle? Seriously, if you're consuming from a single fixed producer
*you know what you're getting* and can build your software accordingly.
>From the producer's side, of course you're going to publish for the maximum
reach and capability *across the existing population of consumers*. If
transcoding is needed and can be automated (which it can here)...why is
this an issue?


> If it costs a publisher 1 hour of labor to tweak their systems to
> output polyglot and this offers their consumers access to a new
> ecosystem of tools and libraries, is it not worth it?
>

If they could spend that hour slotting in a transcoder that publishes in
the other one, addressing that same new market, is it not worth it?


> Should each consumer adapt individually? Should the producer generate
> and disseminate 2x the documents for XML vs. HTML consumers? A subset
> of the syntax and semantics are provably compatible.
>
> Suppose a niche publisher has 10 consumers. It costs the publisher k
> to ensure polyglot invariants on their product. It costs each consumer
> in excess of k to wire together a lenient parser. How is that
> efficient?
>
> I don't understand: how does polyglot burden you?


That's the the bar to be met. The question is: what's the value to the web
of demanding that we add it as a constraint on the development of HTML?


> How is it
> detrimental? If there is detriment, does it exceed the harmless desire
> of some producers to produce maximally compatible content?
>
> > I observe many consumers that adapt and few producers who do
> (particularly
> > granted the time-shifted nature of produced content and the availability
> of
> > more transistors every year).
>
> And so we must reinforce the status quo by vetoing publication of
> guidelines for maximal compatibility?


I'm not saying what i *wish* would happen, I'm saying this *does* happen
over and above the objections of system authors who loathe the additional
complexity and all the rest.


> >> >> Polyglot producers benefit from only needing to produce a single
> >> >> representation for both HTML and XML consumers.
> >> >
> >> > What's the value to them in this? Yes, producers want to enable wide
> >> > consumption of their content, but nearly ever computer sold can parse
> >> > both
> >> > HTML and XML with off-the-shelf software. The marginal gain is...what?
> >>
> >> 1. Smaller library dependency in software consumers
> >
> > But evidence suggests that valuable content is transformed by eager
> > producers, not rejected. Consuming code that yields more value (can
> consume
> > more content) does better in the market.
>
> A significant fraction of consuming code is not on the market.
>
> > How is the value manifested for users of this code?
>
> Invariants are preserved and can be relied on.
>
> Interpreted languages typically provide invariants regarding machine
> security that native executables do not. Declarative representations
> provide invariants regarding interpretation (termination) that
> imperative representations do not.
>
> Likewise, adherence to XML's syntax provides guarantees that
> interpretability by an HTML parser does not. This guarantee has value
> for consumers in the form of broader choice and faster time to
> construct consuming software.


So this is about welcoming our XML overlords?

I think that ship sailed (and sank).


>  > And are we supposed to assume that disk space is more
> > limited year-over-year (vs the historical trend)?
>
> "Smaller" in terms of complexity/cost/availability. It is cheaper for
> consumers to select one of XML or HTML vs. HTML only by definition. If
> the consumer wants to do front-end transformation as you describe,
> they now require BOTH XML and HTML which is larger and more
> complicated than either parser in isolation.
>
> >> 2. Wider interoperability with deployed systems
> >
> > But that hinges on assuming that consumers do not adapt, but rather that
> > producers do (and retroactively!?)
>
> Why should deployed consumers be forced to adapt if a producer can
> anticipate this need? I don't understand what you mean by
> retroactively in this context.
>
> >> 3. Choice of internal data model, choice of parse strategy
> >
> > Who is this valuable to? And isn't that value preserved by
> transformation?
>
> It is valuable to some consumer implementors. Requiring transformation
> denies consumers the ability to choose their transformation parse
> strategy which in turn denies consumers the ability to choose their
> intermediate representation. If your target repository gives you
> invariants, why increase the complexity of your intake system?
>
> >> > Again, is this about production in a closed system or between
> >> > systems/groups/organizations?
> >>
> >> Nothing is closed. Communication requires two parties. It should not
> >> be assumed that those parties co-operate. This applies even in a
> >> "closed" system. Send and receive systems evolve independently. Your
> >> distinction lacks a difference.
> >
> > I don't think it does. Cooperating parties are more likely to settle on
> > stricter, more complete contracts (even if only though shared, unstated
> > assumptions). Parties further away in space and time must find ways to
> > adapt.
>
> Producers can anticipate consumers needs in the future. Not all
> producers are careless to document quality.
>
> > I'm noting that this has led most systems that scale beyond one
> > "sphere of control" to be more forgiving about what they accept over
> time,
> > not less.
>
> I do not deny this. This does not legitimize enforcing ignorance of
> maximally compatible techniques for those producers who wish to use
> them.
>
> > Here at Google we run MASSIVE systems that communicate over very fiddly
> > protocols.
>
> That's nice.
>
> > We can do this because we control the entire ecosystem in which
> > these systems live...in theory. But even as we've evolved them, we've
> found
> > that we must build multiple parsers into our binaries for even
> "restrictive"
> > data encodings. It just seems to happen, no matter intention or policy.
>
> I understand this systemic tendency. Ideally, a consumer delegates
> this parsing concern to a module that handles discrepancies. Judging
> by the complexity of the complete HTML5 vs. the XML parser, XML
> parsers are easier to construct even with their quirks and input
> discrepancies. This suggests that XML parsers will be more widely
> available, of higher quality, and more standard in their output. Do
> you have evidence to suggest otherwise?
>
> >> > If the content is valuable, it is consumers who invariably adapt.
> >>
> >> Free software is often valuable but consumers do not "invariably
> >> adapt" due to practical barriers. In much the same way, publishers may
> >> have user bases that are best served by providing additional
> >> guarantees on the well-formedness (resp. ease-of-use) of their
> >> documents.
> >
> > I'm trying to understand what the real-world costs are.
>
> Costs of what? Adopting polyglot? They are minimal in my experience.
> Cost of consumer adaptation due to arbitrary documents? They are
> higher than the cost of consumer adaptation to well-formed documents
> as well-formed documents are a subset of arbitrary documents.
>
> > Free software isn't comprable, as it's not content per sae.
>
> I disagree. Free software provides a description of some useful
> computation or coordination. Similarly, many classes of content
> provide descriptions of structured data to be consumed by both humans
> and machines. You appear to be advocating for making it harder to
> produce structured documents to be easily consumed by machines
> (programmed by time-constrained humans).
>
> > A book or movie might be. Does the free software make it easier to read
> the book or movie? That's the analog.
>
> Is the body of US law a "book" or "free software"? It appears to share
> traits with both. Would publication of US law in a strict format make
> it easier or harder to consume by people and bots?
>
> >> > This is how the incentives and rewards in time-delayed consumption are
> >> > aligned.
> >>
> >> Your market has perfect information?
> >
> > My questions are all about how information-deprived consumers will get
> through the day.
>
> And mine are all about why you feel it is necessary to deprive
> producers of the knowledge to provide maximally compatible content to
> those voracious consumers.
>
> >> Consumers experience no switching costs? Nobody has
> >> lock-in or legacy? No field deployment? No corporate hegemony games?
> >> Are you advocating O(N) where N = number of consumers adaptations
> >> instead of O(1) where 1 = producer adaptation?
> >>
> >> Or perhaps you regard O(N) = O(1) because the agency of the *average*
> >> End User has been reduced to a choice between a handful of
> >> general-purpose browsers?
> >
> > I think at this point you've convinced me that you're not interested in
> > answering the question
>
> That's odd. My answering of your question should demonstrate my
> interest. If my answer is not to your satisfaction or you do not
> understand some tenet on which I base my answer, that is a different
> matter.
>
> > and, perhaps frustratingly for both of us, helped me
> > understand that Polyglot isn't a real-world concern
>
> Maximal compatibility is not a concern?
>
> > (although, do feel free
> > to convince me otherwise with better arguments and data...I'm keenly
> > interested to see them).
>
> I cannot force you to understand the utility of high-quality content.
> What kind of data and arguments would you find more convincing? How
> much evidence is necessary before it becomes OK to tell people about
> techniques for maximal compatibility?
>
> >> > Keep in mind that Postel's Law isn't a nice-to-have, it's a
> description
> >> > of
> >> > invariably happens when any system hits scale.
> >>
> >> Great! Why are you advocating censoring how to be more conservative in
> >> what you emit? We have "hit scale" and, for some publishers, that
> >> includes allowing for consumers which only understand XML or only
> >> understand HTML.
> >>
> >> "Be conservative in what you send, liberal in what you accept."
> >>
> >> It takes both halves to make it work.
> >
> > I was inaccurate. The first half of the law *is* a nice-to-have (by
> > definition). The second is a description of what happens when systems hit
> > scale, invariable. I should have been clearer. Apologies.
>
> The first half of the law is what producers who want to be maximally
> compatible do. These producers take communication failure very
> seriously and, to them, it is not simply "nice-to-have", it is a
> requirement. Just because general consumers must accept liberal input
> does not legitimize denying producers the information required to
> produce conservative output.
>
> David
>
Received on Friday, 25 January 2013 22:12:54 UTC