[Bug 11909] The principles of Polyglot Markup - validity? well-formed? DOM-equality? from bugzilla@jessica.w3.org on 2011-01-29 (public-html-bugzilla@w3.org from January 2011)

From: <bugzilla@jessica.w3.org>
Date: Sat, 29 Jan 2011 12:50:32 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1PjAG8-0001Us-3U@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11909

--- Comment #5 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-01-29 12:50:31 UTC ---
Following the discussion with David, I would reformulate and expand the my
suggested principles section like so:

]] 
Section I: Principle and base rules

HTML-compatible XHTML documents are, syntactically, XML documents that are
authored according to conditions that are set by the HTML DOM and scripted
according the limitatations defined by XML and where the HTML-parser is
triggered to use the most XML equivalent rendering mode (no-quirks mode) and
the same CSS can be used in both XML-mode and HTML-mode. Thus
HTML-compatibility means equivalence in the fields of DOM, CSS and scripting,
irrespective of HTML-parsing or XML-parsing. Conformance (validity) of an
HTML-compatible XHTML document is governed by the HTML-standard that the author
has followed - this document examplifies how to create HTML5-conforming
polyglot markup.

The above leads to the following sentences about what HTML-compatible XHTML is:

Polyglot Markup

1) is about how to replicate HTML's automatic DOM in XML;
2) follows a subset of well-formed XML where,
    HTML-conformance notwithstanding, it is the resulting 
    HTML DOM  which defines the XML-syntax rules. 
3) is scripted according to the rules of XML (no document.write)
4) triggers non-quirks mode in HTML parsers since this is most 
    equivalent to how XML mode rendering both with regard to
    DOM and CSS;
5) has some exceptions w.r.t. DOM-equivalence on attribute
    level due to some required XML namespace attributes.
6) rules out some HTML-elements because they are impossible
    to replicate in a XML parser;
7) results in the same encoding and the same language in both 
    HTML-mode and XML-mode.
8) is validated for conformance according to an applicable 
    HTML-standard - the HTML-conformance rules impacts
    on the DOM exceptionts  w.r.t. what inequality, that
    is tolerable.
9) does not not need to be XML-valid. XML-validity requires a
    DTD, but HTML (in particular HTML5) seeks to avoid DTDs
    as they have no effect in HTML-parsers. DTD-authoring advice.

<-- then I would outline those sentences/principles before, finally, describing
HTML5-conforming polyglots: -->

== 1. Replicating HTML's automatic DOM in XML ==

Extra rules from HTLM's point of view - but also from XML's point of view: 
In HTML, it is permitted to drop lots of syntax - as it get autocreated in the
DOM. In XML there is no such automation, thus the code must be written
explicitly. Thus one must use the "</p>", one muste use <hmtl>, <body>, <head>,
<colgroup> etc.  [Provide a list over the automated DOM-productions that HTML
offers - this list can be updated as HTML6 is specced and so on.]

Extra rule from HTML's point of view: Attribute normalization belongs here. 

Links to relevant sections in XML1 and HTML5.

== 2. Subset of well-formed XML - governed by the resulting HTML-DOM  ==

Describe exceptions from XML's POV: when <foo/> can be  used and when
<foo></foo> must be used. Etc. Without mixing conformance into the issue.

Descripe the (most important) extra rules from HTML's POV: escaping '<' and '&'
etc.

== 3. Scripting ==

Document.write is forbidden - etc.

== 4. No-quirks mode ==

Only no-quirks doctypes are permitted. Or else the page is rendered differently
in HTML vs XML. A no-quirks triggering doctype is also, for this reason,
required (except inside the @srcdoc attribute). Also, say that in some legacy
HTML-parsers, then <?xml version="1.0' ?> triggers quirks. The same also
happens (in IE6,IE7,IE8) if there is a <!--comment-->before the DOCTYPE.
no-quirks is an absolute requirement. If legacy user agents with such behavior
is not an issue, then neither the XML declaration or such comments are a
problem (however, HTML-conformance rules may forbid them).

== 5. Equality exceptions ==

xml:lang, xmlns etc are permitted despite that it results in a different DOM.
Justification: required by XML. Unlese these differences were accepted,
polyglots would not be possible.

== 6. Banning of some HTML elements ==

Some HTML-element can't be used in XML. E.g. Noscript, plaintext, etc.

== 7.  Internationalization ==

Polyglot Markup needs both xml:lang and lang, or else we get a language
difference. Polyglot Markup should use UTF-8, for such and such reasons: can be
detected by XML-parser, HTML5-conformance permits it and more. Polyglot Markup
which isn't UTF-8 or UTF-16 could use <?xml version="1.0" encoding="ISO-8859-1"
?>, however this could lead to non-polyglottness (quirks-mode) in some legacy
parsers as well as non-validity in HTML5 - if this is an issue, then - for
non-UTF8/16 encodings - authors *must* use an external HTTP header to set the
encoding. Polyglot Markup RECOMMENDS UTF-8.

== 8. Validation according to a HTML-standard ==

This specification does not say which HTML-standard to validate against, but
defines general rules. However, HTML5 is the basis for our thinking.
HTML5-validation is the only validation we are aware of which properly takes
the DOM into account - other validation services, such as XHTML1.0 validation
by W3C, is known for not taking into account the DOM. (That said,
HTML5-validation follows many rules that are not at all related to the DOM.)

== 9. XML-validity ==

XML-validity is only an issue if the DOCTPE contains a DTD. Some advice about
how to, eventually, author a DTD - say that @id should be CDATA and so on. Say
that @id in a polyglot is CDATA, and thus not subject to XML 1.0's name
production.

Section II: HTML5-specific examples

<!-- Here most of what is already in the spec can be used. -->

[[

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Saturday, 29 January 2011 12:50:33 UTC