Re: NU's polyglot possibilities (Was: The non-polyglot elephant in the room) from David Sheets on 2013-01-26 (public-html@w3.org from January 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Fri, 25 Jan 2013 16:36:26 -0800
To: Alex Russell <slightlyoff@google.com>
Cc: "Michael[tm] Smith" <mike@w3.org>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CAAWM5Tyr-6XXEyfa-rijiwuSODXGv103UYAzmoGKMt7g4voP_A@mail.gmail.com>
On Fri, Jan 25, 2013 at 2:11 PM, Alex Russell <slightlyoff@google.com> wrote:
>
> On Fri, Jan 25, 2013 at 4:16 PM, David Sheets <kosmo.zb@gmail.com> wrote:
>>
>> On Fri, Jan 25, 2013 at 11:48 AM, Alex Russell <slightlyoff@google.com>
>> wrote:
>> > On Thu, Jan 24, 2013 at 11:46 PM, David Sheets <kosmo.zb@gmail.com>
>> > wrote:
>> >>
>> >> On Thu, Jan 24, 2013 at 4:44 PM, Alex Russell <slightlyoff@google.com>
>> >> wrote:
>> >> > On Thu, Jan 24, 2013 at 6:29 PM, David Sheets <kosmo.zb@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Thu, Jan 24, 2013 at 2:14 PM, Alex Russell
>> >> >> <slightlyoff@google.com>
>> >> >> wrote:
>> >> >> > I find myself asking (without an obvious answer): who benefits
>> >> >> > from
>> >> >> > the
>> >> >> > creation of polyglot documents?
>> >> >>
>> >> >> Polyglot consumers benefit from only needing an HTML parser *or* an
>> >> >> XML parser for a single representation.
>> >> >
>> >> > That's just a tautology. "People who wish to consume a set of
>> >> > documents
>> >> > known to be in a single encoding only need one decoder". It doesn't
>> >> > illuminate any of the questions about the boundaries between
>> >> > producers/consumers that I posed.
>> >>
>> >> "People who wish to consume a set of documents known to simultaneously
>> >> be in multiple equivalent encodings only need one of several
>> >> decoders."
>> >>
>> >> That doesn't appear tautological to me. Check your cardinality. The
>> >> Axiom of Choice comes to mind.
>> >
>> > It appears to me that you've skipped a step ahead of answering my
>> > question
>> > and are dismissing it on an assumption I'm not making (hence you think
>> > it's
>> > not a tautology).
>>
>> Let us find our common misunderstanding and resolve it.
>>
>> > You posit a group of consumers who have one preference or another (a
>> > hard
>> > preference, at that) and wish me to treat this binary-seprable group as
>> > uniform. You then posit a producer who would like to address this group
>> > of
>> > consumers. You further wish me (AFAICT) wish me to assume that these
>> > demanding consumers are fully aware of the polyglot nature of the
>> > producer's
>> > content through unspecified means.
>>
>> Suppose you are publishing technical documentation. You already have a
>> toolchain constructed to ensure very specific invariants on your
>> output documents. Your consumers are savvy and may wish to script
>> against your documentation. You perceive that for a small cost
>> (reading polyglot spec and tweaking to emit it), you can simplify
>> consumption for your user base.
>
> This works with a single producer and consumer who have a fixed contract.

This works with any number of producers and consumers who have a
"fixed contract". For simplicity, let's called this "fixed contract" a
"standard".

> That's sort of the definition of a closed system...and it's not the web.

Any strictly standardized communication format is a closed system? The
internet isn't standardized? The web is closed? Clearly in every
large-scale system some emitters will be in error and some consumers
will be lenient. That doesn't obviate the need for standards or excuse
lack of quality control.

> Why aren't they just publishing as one or the other?

Why must they pick between broad compatibility and automation if both
are possible, trivially, in a single representation?

> And if the tweaks are so small (but necessary), why isn't this a job for software?

The tweaks are small because of a shared heritage which allows
significant intersection in conforming representations. After
publication of a Polyglot Recommendation, new systems which elect to
conform will not need tweaking.

Why do people write portable C? Why not write platform-specific C and
then write some software to make the small tweaks?

> Consumers who want to process more than a single producer's content either have to:
>
> Have a reliable way to know that what they consume isn't going to be broken
> (as HTML in XML parsing is)
> Have a way of consuming a superset of any individual publisher's formats
>
> Either work, but polyglot precludes #1 on the basis that #2 shouldn't have
> to happen, against all the evidence of how this sort of thing is sorted out
> every day by real world software.

I think you are mistaken in your belief that polyglot precludes #1.
This is like saying that writing portable C makes a strictly
conforming C compiler impossible or worthless.

There is lots of real world software working every day that supports a
superset of any individual publisher's format in various XML
vocabularies. You have some evidence of consuming systems that strive
to be maximally general. This evidence does not negate the evidence
that there are systems that produce and consume content that strictly
adheres to standards. Not everyone is an HTML parser implementor or
has easy access to a plug-in HTML parser. Not everyone wants or needs
to deal with broken representations. Not everyone holds their
consumers in such contempt as to force them to adopt HTML parsers.

Can the web have sub-communities using document standards or is it
Google's "good enough" way only?

Should W3C remain silent on how their standards interact?

>> > What I'm asking is this: does this happen in the real world?
>>
>> Yes.
>>
>> > Under what circumstances?
>>
>> Structured document repositories
>> Legal case files
>> Digital archives
>> Database views
>> Email repositories
>> Software specifications
>> Anything projecting well-defined data structures into HTML
>
> So "programs writing programs for programs".

HTML documents are programs now? I thought you were just arguing that
they shared nothing with free software?

And is the concept of "programs writing data for programs for humans"
so foreign to require indignation? Doesn't this describe essentially
every standard data format ever devised?

>> > How frequently?
>>
>> Every time a programmatic producer wishes to serve an XML consumer and
>> an HTML consumer with fewer special cases.
>>
>> > On the open web (where I expect that the
>> > contract about what is and isn't XML are even more important), or inside
>> > closed systems and organizations?
>>
>> When you publicly publish something and declare your intent, you are
>> on the "open web".
>
> I think you'll struggle to get most W3C members to accept that definition.

What definition do you suggest "most W3C members" would accept for
"open web"? Does "open web" exclude some transports? Some formats?
Perhaps we have different ideas on what "open" means.

>> > I don't see that the TAG has any duty to the latter, so it's an honest
>> > question.
>>
>> Even "closed" systems export data and use off-the-shelf browsers.
>> Furthermore, many of these "closed" systems will be opening up in the
>> future. The TAG has a responsibility to guide publishers and
>> implementors who wish to support W3C standard formats in their systems
>> that do or may interact with the web.
>
> Our job is not to sell the web to a possible new audience -- it doesn't need
> our help and we're the last group I can imagine being effective as
> salespeople

You are responding to a figment. I mentioned nothing of sales or
marketing. The publishers and implementors are already sold and "wish
to support W3C standard formats in their systems that do or may
interact with the web".

> -- it's to help publishers understand how the rules work so that
> they can join it and to help spec authors make sure the rules are sane in
> the long-run.

I believe that we agree here.

Do you feel that the polyglot document does not help publishers
understand the (X)HTML syntax?

I believe that the polyglot document serves precisely this purpose.

Do you feel that the polyglot document hurts long-term viability of
the standards?

I believe that the polyglot document decreases fragmentation and
guides spec authors to more sane rules.

Do you feel that the unelected, top-down structure of HTML
standardization should be given greater leeway to further fragment
implementations and introduce special cases? On what grounds?

>> > My personal experience leads me away from assuming that this is common.
>>
>> Mine as well. I didn't realize that only the most common case deserves
>> attention. What is your threshold for consideration?
>>
>> > I'm looking for countering evidence in order to be able to form an
>> > informed
>> > opinion. So the question is open (ISTM): who are the consumers that do
>> > not
>> > adapt to publishers?
>>
>> Why cannot publishers decide to publish content with maximal
>> compatibility?
>
> Why can't I publish a binary stream of bytes that's both a PNG and a BMP?

You probably can but probably not so that the representation is
simultaneously conforming. These formats are much farther apart than
HTML and XHTML. The HTML5 specification defines both HTML and XHTML
syntaxes in a single document with many overlapping concepts.

What do you find objectionable about publishers leveraging this fact?

> I'm honestly trying to understand the real-world harm in giving up on
> polyglot.

What is the real-world harm in giving up on XML?

Wasted labor, fragmented syntax, requisite reimplementation, no
suitable replacement...

> So far I don't sense that there's much to be lost that can't be
> easily won again through common and well-understood strategies -- the sorts
> of things that browsers and all sorts of other off-the-shelf software
> already do.

You are trading away a very cheap improvement that yields simplicity
benefits to some consumers for an expensive, global improvement and
the general adoption of data format fatalism and software to support
it. Why are you taking away publishers' choice?

>> If niche publishers assume that consumers will adapt, they may find
>> that the hassle of adaptation has hurt their reach.
>
> What hassle? Seriously, if you're consuming from a single fixed producer
> *you know what you're getting* and can build your software accordingly.

Until the producer changes their output and your hacky regexes don't
work or your assumption about their page structure becomes invalid. If
several producers instead say "here is a standard method that we
encourage you to use and share among our community" and they then
stick to this promise (through, say, publishing software that enforces
it), who are you to tell them "no"?

> From
> the producer's side, of course you're going to publish for the maximum reach
> and capability *across the existing population of consumers*. If transcoding
> is needed and can be automated (which it can here)...why is this an issue?

When the publisher's need to serve their consumers can be guaranteed
to be met by a single format, why necessitate transcoding?

>> If it costs a publisher 1 hour of labor to tweak their systems to
>> output polyglot and this offers their consumers access to a new
>> ecosystem of tools and libraries, is it not worth it?
>
> If they could spend that hour slotting in a transcoder that publishes in the
> other one, addressing that same new market, is it not worth it?

Why increase the number of representations published? Why deal with a
transcoder? If they can have a single pipeline that serves all their
users' needs, why force them to support multiple representations?

>> Should each consumer adapt individually? Should the producer generate
>> and disseminate 2x the documents for XML vs. HTML consumers? A subset
>> of the syntax and semantics are provably compatible.
>>
>> Suppose a niche publisher has 10 consumers. It costs the publisher k
>> to ensure polyglot invariants on their product. It costs each consumer
>> in excess of k to wire together a lenient parser. How is that
>> efficient?
>>
>> I don't understand: how does polyglot burden you?
>
> That's the the bar to be met. The question is: what's the value to the web
> of demanding that we add it as a constraint on the development of HTML?

How does it unreasonably constrain the development of HTML?

What's the value to the web of throwing away thousands of man-years of
effort into XML tooling?

What's the value to the web of mandating an increasingly baroque
language with an increasingly idiosyncratic and complex parser?

Who benefits from mandating complexity barriers? Is it the independent
developer or the public corporation?

>> How is it
>> detrimental? If there is detriment, does it exceed the harmless desire
>> of some producers to produce maximally compatible content?
>>
>> > I observe many consumers that adapt and few producers who do
>> > (particularly
>> > granted the time-shifted nature of produced content and the availability
>> > of
>> > more transistors every year).
>>
>> And so we must reinforce the status quo by vetoing publication of
>> guidelines for maximal compatibility?
>
> I'm not saying what i *wish* would happen, I'm saying this *does* happen
> over and above the objections of system authors who loathe the additional
> complexity and all the rest.

And you are using this generalization of "the wild" to justify a
formal stance of "striving for quality is pointless so we shouldn't
tell producers how" which will undermine existing communities with
minimal resources which, for purposes of self-preservation, already
favor adherence to standards.

Is there any place for those who wish to adhere to standards? Why does
HTML5 specify authoring constraints that are stricter than what
conformant HTML5 parsers will accept?

How are you so certain that we must dissuade producers from publishing
polyglot documents? What clear and present danger does Recommendation
of polyglot present to the technical architecture of the WWW?

>> >> >> Polyglot producers benefit from only needing to produce a single
>> >> >> representation for both HTML and XML consumers.
>> >> >
>> >> > What's the value to them in this? Yes, producers want to enable wide
>> >> > consumption of their content, but nearly ever computer sold can parse
>> >> > both
>> >> > HTML and XML with off-the-shelf software. The marginal gain
>> >> > is...what?
>> >>
>> >> 1. Smaller library dependency in software consumers
>> >
>> > But evidence suggests that valuable content is transformed by eager
>> > producers, not rejected. Consuming code that yields more value (can
>> > consume
>> > more content) does better in the market.
>>
>> A significant fraction of consuming code is not on the market.
>>
>> > How is the value manifested for users of this code?
>>
>> Invariants are preserved and can be relied on.
>>
>> Interpreted languages typically provide invariants regarding machine
>> security that native executables do not. Declarative representations
>> provide invariants regarding interpretation (termination) that
>> imperative representations do not.
>>
>> Likewise, adherence to XML's syntax provides guarantees that
>> interpretability by an HTML parser does not. This guarantee has value
>> for consumers in the form of broader choice and faster time to
>> construct consuming software.
>
> So this is about welcoming our XML overlords?

You are free not to use anything related to XML.

That said, I'd rather have the *option* of XML overlords than
corporate overlords deciding which guidelines for interoperability
between standards may be published.

> I think that ship sailed (and sank).

Perhaps in the mass market that is true. Many niche communities still
find significant value in a standard set of useful tools, derision by
pop culture programmers notwithstanding.

Do you have a replacement for XML you would like to offer? What harm
will you experience if polyglot becomes a REC?

David
Received on Saturday, 26 January 2013 00:36:58 UTC