Re: NU's polyglot possibilities (Was: The non-polyglot elephant in the room) from David Sheets on 2013-01-25 (www-tag@w3.org from January 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Fri, 25 Jan 2013 13:16:44 -0800
To: Alex Russell <slightlyoff@google.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CAAWM5Ty8s0wArhbNdaDmuVwX3MfkQknhDzCguMf324qyJG=A9g@mail.gmail.com>
On Fri, Jan 25, 2013 at 11:48 AM, Alex Russell <slightlyoff@google.com> wrote:
> On Thu, Jan 24, 2013 at 11:46 PM, David Sheets <kosmo.zb@gmail.com> wrote:
>>
>> On Thu, Jan 24, 2013 at 4:44 PM, Alex Russell <slightlyoff@google.com>
>> wrote:
>> > On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com>
>> > wrote:
>> >>
>> >> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell <slightlyoff@google.com>
>> >> wrote:
>> >> > I find myself asking (without an obvious answer): who benefits from
>> >> > the
>> >> > creation of polyglot documents?
>> >>
>> >> Polyglot consumers benefit from only needing an HTML parser *or* an
>> >> XML parser for a single representation.
>> >
>> > That's just a tautology. "People who wish to consume a set of documents
>> > known to be in a single encoding only need one decoder". It doesn't
>> > illuminate any of the questions about the boundaries between
>> > producers/consumers that I posed.
>>
>> "People who wish to consume a set of documents known to simultaneously
>> be in multiple equivalent encodings only need one of several
>> decoders."
>>
>> That doesn't appear tautological to me. Check your cardinality. The
>> Axiom of Choice comes to mind.
>
> It appears to me that you've skipped a step ahead of answering my question
> and are dismissing it on an assumption I'm not making (hence you think it's
> not a tautology).

Let us find our common misunderstanding and resolve it.

> You posit a group of consumers who have one preference or another (a hard
> preference, at that) and wish me to treat this binary-seprable group as
> uniform. You then posit a producer who would like to address this group of
> consumers. You further wish me (AFAICT) wish me to assume that these
> demanding consumers are fully aware of the polyglot nature of the producer's
> content through unspecified means.

Suppose you are publishing technical documentation. You already have a
toolchain constructed to ensure very specific invariants on your
output documents. Your consumers are savvy and may wish to script
against your documentation. You perceive that for a small cost
(reading polyglot spec and tweaking to emit it), you can simplify
consumption for your user base.

> What I'm asking is this: does this happen in the real world?

Yes.

> Under what circumstances?

Structured document repositories
Legal case files
Digital archives
Database views
Email repositories
Software specifications
Anything projecting well-defined data structures into HTML

> How frequently?

Every time a programmatic producer wishes to serve an XML consumer and
an HTML consumer with fewer special cases.

> On the open web (where I expect that the
> contract about what is and isn't XML are even more important), or inside
> closed systems and organizations?

When you publicly publish something and declare your intent, you are
on the "open web".

> I don't see that the TAG has any duty to the latter, so it's an honest question.

Even "closed" systems export data and use off-the-shelf browsers.
Furthermore, many of these "closed" systems will be opening up in the
future. The TAG has a responsibility to guide publishers and
implementors who wish to support W3C standard formats in their systems
that do or may interact with the web.

> My personal experience leads me away from assuming that this is common.

Mine as well. I didn't realize that only the most common case deserves
attention. What is your threshold for consideration?

> I'm looking for countering evidence in order to be able to form an informed
> opinion. So the question is open (ISTM): who are the consumers that do not
> adapt to publishers?

Why cannot publishers decide to publish content with maximal compatibility?

If niche publishers assume that consumers will adapt, they may find
that the hassle of adaptation has hurt their reach.

If it costs a publisher 1 hour of labor to tweak their systems to
output polyglot and this offers their consumers access to a new
ecosystem of tools and libraries, is it not worth it?

Should each consumer adapt individually? Should the producer generate
and disseminate 2x the documents for XML vs. HTML consumers? A subset
of the syntax and semantics are provably compatible.

Suppose a niche publisher has 10 consumers. It costs the publisher k
to ensure polyglot invariants on their product. It costs each consumer
in excess of k to wire together a lenient parser. How is that
efficient?

I don't understand: how does polyglot burden you? How is it
detrimental? If there is detriment, does it exceed the harmless desire
of some producers to produce maximally compatible content?

> I observe many consumers that adapt and few producers who do (particularly
> granted the time-shifted nature of produced content and the availability of
> more transistors every year).

And so we must reinforce the status quo by vetoing publication of
guidelines for maximal compatibility?

>> >> Polyglot producers benefit from only needing to produce a single
>> >> representation for both HTML and XML consumers.
>> >
>> > What's the value to them in this? Yes, producers want to enable wide
>> > consumption of their content, but nearly ever computer sold can parse
>> > both
>> > HTML and XML with off-the-shelf software. The marginal gain is...what?
>>
>> 1. Smaller library dependency in software consumers
>
> But evidence suggests that valuable content is transformed by eager
> producers, not rejected. Consuming code that yields more value (can consume
> more content) does better in the market.

A significant fraction of consuming code is not on the market.

> How is the value manifested for users of this code?

Invariants are preserved and can be relied on.

Interpreted languages typically provide invariants regarding machine
security that native executables do not. Declarative representations
provide invariants regarding interpretation (termination) that
imperative representations do not.

Likewise, adherence to XML's syntax provides guarantees that
interpretability by an HTML parser does not. This guarantee has value
for consumers in the form of broader choice and faster time to
construct consuming software.

> And are we supposed to assume that disk space is more
> limited year-over-year (vs the historical trend)?

"Smaller" in terms of complexity/cost/availability. It is cheaper for
consumers to select one of XML or HTML vs. HTML only by definition. If
the consumer wants to do front-end transformation as you describe,
they now require BOTH XML and HTML which is larger and more
complicated than either parser in isolation.

>> 2. Wider interoperability with deployed systems
>
> But that hinges on assuming that consumers do not adapt, but rather that
> producers do (and retroactively!?)

Why should deployed consumers be forced to adapt if a producer can
anticipate this need? I don't understand what you mean by
retroactively in this context.

>> 3. Choice of internal data model, choice of parse strategy
>
> Who is this valuable to? And isn't that value preserved by transformation?

It is valuable to some consumer implementors. Requiring transformation
denies consumers the ability to choose their transformation parse
strategy which in turn denies consumers the ability to choose their
intermediate representation. If your target repository gives you
invariants, why increase the complexity of your intake system?

>> > Again, is this about production in a closed system or between
>> > systems/groups/organizations?
>>
>> Nothing is closed. Communication requires two parties. It should not
>> be assumed that those parties co-operate. This applies even in a
>> "closed" system. Send and receive systems evolve independently. Your
>> distinction lacks a difference.
>
> I don't think it does. Cooperating parties are more likely to settle on
> stricter, more complete contracts (even if only though shared, unstated
> assumptions). Parties further away in space and time must find ways to
> adapt.

Producers can anticipate consumers needs in the future. Not all
producers are careless to document quality.

> I'm noting that this has led most systems that scale beyond one
> "sphere of control" to be more forgiving about what they accept over time,
> not less.

I do not deny this. This does not legitimize enforcing ignorance of
maximally compatible techniques for those producers who wish to use
them.

> Here at Google we run MASSIVE systems that communicate over very fiddly
> protocols.

That's nice.

> We can do this because we control the entire ecosystem in which
> these systems live...in theory. But even as we've evolved them, we've found
> that we must build multiple parsers into our binaries for even "restrictive"
> data encodings. It just seems to happen, no matter intention or policy.

I understand this systemic tendency. Ideally, a consumer delegates
this parsing concern to a module that handles discrepancies. Judging
by the complexity of the complete HTML5 vs. the XML parser, XML
parsers are easier to construct even with their quirks and input
discrepancies. This suggests that XML parsers will be more widely
available, of higher quality, and more standard in their output. Do
you have evidence to suggest otherwise?

>> > If the content is valuable, it is consumers who invariably adapt.
>>
>> Free software is often valuable but consumers do not "invariably
>> adapt" due to practical barriers. In much the same way, publishers may
>> have user bases that are best served by providing additional
>> guarantees on the well-formedness (resp. ease-of-use) of their
>> documents.
>
> I'm trying to understand what the real-world costs are.

Costs of what? Adopting polyglot? They are minimal in my experience.
Cost of consumer adaptation due to arbitrary documents? They are
higher than the cost of consumer adaptation to well-formed documents
as well-formed documents are a subset of arbitrary documents.

> Free software isn't comprable, as it's not content per sae.

I disagree. Free software provides a description of some useful
computation or coordination. Similarly, many classes of content
provide descriptions of structured data to be consumed by both humans
and machines. You appear to be advocating for making it harder to
produce structured documents to be easily consumed by machines
(programmed by time-constrained humans).

> A book or movie might be. Does the free software make it easier to read the book or movie? That's the analog.

Is the body of US law a "book" or "free software"? It appears to share
traits with both. Would publication of US law in a strict format make
it easier or harder to consume by people and bots?

>> > This is how the incentives and rewards in time-delayed consumption are
>> > aligned.
>>
>> Your market has perfect information?
>
> My questions are all about how information-deprived consumers will get through the day.

And mine are all about why you feel it is necessary to deprive
producers of the knowledge to provide maximally compatible content to
those voracious consumers.

>> Consumers experience no switching costs? Nobody has
>> lock-in or legacy? No field deployment? No corporate hegemony games?
>> Are you advocating O(N) where N = number of consumers adaptations
>> instead of O(1) where 1 = producer adaptation?
>>
>> Or perhaps you regard O(N) = O(1) because the agency of the *average*
>> End User has been reduced to a choice between a handful of
>> general-purpose browsers?
>
> I think at this point you've convinced me that you're not interested in
> answering the question

That's odd. My answering of your question should demonstrate my
interest. If my answer is not to your satisfaction or you do not
understand some tenet on which I base my answer, that is a different
matter.

> and, perhaps frustratingly for both of us, helped me
> understand that Polyglot isn't a real-world concern

Maximal compatibility is not a concern?

> (although, do feel free
> to convince me otherwise with better arguments and data...I'm keenly
> interested to see them).

I cannot force you to understand the utility of high-quality content.
What kind of data and arguments would you find more convincing? How
much evidence is necessary before it becomes OK to tell people about
techniques for maximal compatibility?

>> > Keep in mind that Postel's Law isn't a nice-to-have, it's a description
>> > of
>> > invariably happens when any system hits scale.
>>
>> Great! Why are you advocating censoring how to be more conservative in
>> what you emit? We have "hit scale" and, for some publishers, that
>> includes allowing for consumers which only understand XML or only
>> understand HTML.
>>
>> "Be conservative in what you send, liberal in what you accept."
>>
>> It takes both halves to make it work.
>
> I was inaccurate. The first half of the law *is* a nice-to-have (by
> definition). The second is a description of what happens when systems hit
> scale, invariable. I should have been clearer. Apologies.

The first half of the law is what producers who want to be maximally
compatible do. These producers take communication failure very
seriously and, to them, it is not simply "nice-to-have", it is a
requirement. Just because general consumers must accept liberal input
does not legitimize denying producers the information required to
produce conservative output.

David
Received on Friday, 25 January 2013 21:17:13 UTC