Re: The non-polyglot elephant in the room from David Sheets on 2013-01-21 (www-tag@w3.org from January 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Mon, 21 Jan 2013 14:16:29 -0800
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: www-tag@w3.org
Message-ID: <CAAWM5Twqeyy4zxVCHPSUzLVz30x0Bi6fNV7mWC0HAwhyFQabgw@mail.gmail.com>
On Mon, Jan 21, 2013 at 1:25 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
> On 1/21/13 4:15 PM, David Sheets wrote:
>>
>> On Mon, Jan 21, 2013 at 11:47 AM, Kingsley Idehen
>> <kidehen@openlinksw.com> wrote:
>>>
>>> On 1/21/13 2:19 PM, Melvin Carvalho wrote:
>>>
>>> On 21 January 2013 20:13, Anne van Kesteren <annevk@annevk.nl> wrote:
>>>>
>>>> On Mon, Jan 21, 2013 at 7:24 PM, Kingsley Idehen
>>>> <kidehen@openlinksw.com>
>>>> wrote:
>>>>>
>>>>> Please correct me if my characterization is wrong, but it appears to me
>>>>> that
>>>>> this entire affair is about content-type (mime type) squatting i.e.,
>>>>> trying
>>>>> to squeeze (X)HTML into content-type: text/html. If this is true, why
>>>>> on
>>>>> earth would such an endeavor be encouraged by the W3C or its TAG?
>>
>> How is the definition of *a valid subset of text/html* squatting?
>
>
> Is XHTML now a subset of HTML? Is (X)HTML a subset of HTML? As I stated, as
> part of my open comments, what am I missing in my characterization?

It's not clear to me that they have that relation. There does exist a
subset of HTML that is also XHTML and vice versa.

>>>> Maybe because XML is listed quite prominently under "What is Web
>>>> architecture?" in http://www.w3.org/2004/10/27-tag-charter.html though
>>>> I would consider that particular part of the charter misguided. (It's
>>>> also not at all practiced these days.)
>>
>> This is plainly false. Existence of new XML vocabularies demonstrates
>> practice. It cannot also be true that it is "not at all practiced
>> these days".
>>
>>> This is a good point, imho.  In 2004 it was perhaps reasonable to make a
>>> 'bet' on XML.  However, favouring any one particular serialization
>>> potentially lacks future proofing.  However, favouring the principles
>>> behind
>>> XML, such as namespacing etc.,  makes more sense.
>>
>> Fragmentation is not future-proof.
>>
>>> Wikipedia has a reasonably nice write up on this topic:
>>>
>>> http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats
>>>
>>>>
>>>>
>>>> --
>>>> http://annevankesteren.nl/
>>>>
>>>
>>> At this juncture though, my main question is about XHTML or (X)HTML (the
>>> polyglot) being squeezed into content-type designation: text/html. In
>>> reality we have two content types with distinct characteristics which
>>> thereby entails two distinct content-types: text/html (for HTML) and
>>> application/xhtml+xml (for XHTML).
>>>
>>> Put differently, there is no content-type for the (X)HTML polyglot. Thus,
>>> we
>>> have the struggle right now which is all about trying to make text/html
>>> the
>>> designated content-type for the aforementioned polyglot.
>>
>> I was under the impression that an explicit goal of standardizing the
>> HTML5 parser was so that HTML consumers and producers could rely on a
>> de jure interpretation of nonsensical markup. While many consider
>> XML's restrictions nonsensical, it is prima facie absurd that
>> champions of HTML5's apologetic parser refuse to consider the subset
>> of HTML5 that is also valid XHTML5 as clearly important to a
>> population of authors.
>
> So this is the key point of contention i.e., XHTML5 (unlike other XHTML
> incarnations) is a genuine subset of HTML.

I don't believe they have this relation. There is a set of documents
that satisfies both standards, however.

>> >From my perspective, anti-polyglot proponents advocate global
>> text/html interpretation of nearly everything *except* XHTML.
>
> Can you point me to an example? I ask primarily for clarity.

I'm not sure which assertion you'd like an example for so I've made
some guesses.

<http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parsing>
describes a very lenient HTML parser which defines an interpretation
for many strange text/html documents ("Hello, <I><b>world</i>!").

Advocacy for this lenient parser has been prevalent for several years.
The same standard provides guidance on writing "good" HTML which does
not take advantage of all of the HTML parser's quirks. This is a
subset of HTML.

For evidence of resistance to the standard definition of the subset of
HTML that is also XHTML with the same meaning, you need only look to
this and previous threads on this topic.

>> XHTML is
>> stricter than HTML and polyglot serializations *should* exist for any
>> DOM (at least one would hope, what with the complexity burden of a
>> fully conformant HTML parser).
>>
>> Are there legitimate technical architecture objections to specifying
>> the set intersection of XHTML and HTML expressions?
>
>
> Potentially, once you attempt to write parsers for HTML5 resources that
> include Microdata and/or RDFa structured data islands.

How does the definition of a mutually compatible subset complicate
HTML5 parsers which include microdata/rdfa? By definition, the
polyglot subset must work in existing HTML5 parsers. If HTML5 semantic
markup cannot share syntax with XHTML, the document cannot be
serialized in a polyglot fashion. What have I missed?

>>
>> I believe that there are many who would be interested in such
>> guidelines who are typically underrepresented in these discussions.
>>
>> I am genuinely confused by arguments which appear to encourage liberal
>> emission and deride conservative emission. Are web standards no longer
>> concerned with robustness? HTML's new parser specification appears to
>> disagree...
>
> Once there's clarification on the issue of HTML and XHTML5 subset, the
> problems will become clear. All you have to do is attempt to use or write a
> parser for structured data (MicroData, Microformats, RDFa) embedded in an
> HTML5 document .

I began just this task and have since put it on hold due to the
complexity of the HTML5 parsing algorithm and personal time
constraints. If a document is both valid HTML and valid XHTML5, how is
handling this content harder than just handling HTML5 content?

There may certainly be important classes of documents for which no
polyglot serialization is possible. Unique HTML features, unique XHTML
features, and the HTML/XHTML overlap are all important for author
education. Aesthetically, I would like as few special cases for each
HTML and XHTML and greater syntax compatibility; but with a faction
advocating extensive bugwards compatibility and disinterest in
unification, I'm not holding my breath.

> In my experience, undue burden is being pushed on the developers of parsers.

Which parser developers does a polyglot spec burden? Polyglot should
be parsable by both HTML5 and XML parsers without modification.
Received on Monday, 21 January 2013 22:17:23 UTC