- From: CVS User egraff <cvsmail@w3.org>
- Date: Fri, 16 May 2014 16:08:25 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/html-polyglot In directory roscoe:/tmp/cvs-serv28057 Added Files: CR-html-polyglot-20140530.html Log Message: Initial draft --- /sources/public/html5/html-polyglot/CR-html-polyglot-20140530.html 2014/05/16 16:08:25 NONE +++ /sources/public/html5/html-polyglot/CR-html-polyglot-20140530.html 2014/05/16 16:08:25 1.1 <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US" > <head> <title>Polyglot Markup: A robust profile of the HTML5 vocabulary</title> <meta charset="utf-8" /> <script class="remove" src="http://www.w3.org/Tools/respec/respec-w3c-common" async=""></script> <script class="remove"> var respecConfig = { specStatus: "CR", shortName: "html-polyglot", publishDate: "2014-05-07", previousPublishDate: "2010-10-19", // previousDiffURI: "http://htmlwg.org/heartbeat/WD-html-polyglot-20131008/", previousMaturity: "WD", edDraftURI: "http://dev.w3.org/html5/html-polyglot/html-polyglot.html", crEnd: "2014-08-30", implementationReportURI: "http://www.webplatform.org", editors: [ { name: "Eliot Graff", company: "Microsoft Corporation" }, { name: "Leif H. Silli", company: "<small>&</small>ᴍᴇᴛᴏᴅɪᴜꜱ ᴅᴀ"} ], wg: "HTML working group", wgURI: "http://www.w3.org/html/wg/", wgPublicList: "public-html", wgPatentURI: "http://www.w3.org/2004/01/pp-impl/40318/status" }; </script> <style>table.simple tr>*:first-child{text-align:right;} table.simple th code{color:yellow;font-weight:bold;font-size:larger;} table.simple [colspan="2"]{text-align:center;} table.simple [colspan="3"]{text-align:center;} ul.inline-list {white-space:normal} ul.inline-list li {display:inline;} ul.inline-list li:after {content:",";} ul.inline-list li:last-child:after {content:"";} </style> </head> <body> <section id="abstract"> A document that uses <a title="polyglot markup">polyglot markup</a> is a document that is a stream of bytes that parses into identical document trees (with some exceptions, as noted in the <a href="#introduction">Introduction</a>) when processed either as HTML or when processed as XML. Polyglot markup that meets a well-defined set of constraints is interpreted as compatible, regardless of whether it is processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on void elements, named entity references, and the use of scripts and style. <!--End section: Abstract--> </section> <section id="sotd"> <p> This specification summarizes design guidelines for authors who wish their XHTML or HTML documents to be conforming whether parsed as HTML or as XML. The document is intended to be useful to web authors, in particular those who want to serve receivers without concern for whether they have XML or HTML parsers available. Such concerns may, for instance, arise in content syndication or when receivers are on legacy systems. HTML polyglots <a href="http://www.w3.org/html/wg/drafts/html/master/document-metadata.html#charset-0">facilitate migration to and from XHTML</a>, including transition from XML 1.x to HTML5, and this document serves to accurately specify the requirements of a UTF-8 based profile for such documents. </p> <p> No recommendation is made in this document or by the W3C regarding whether or not to publish polyglot content. In general, authors are encouraged to publish HTML content using HTML5 syntax and media types (either HTML syntax and <code>text/html</code>, or XHTML syntax and <code>application/xhtml+xml</code>). </p> <p> This document is not a specification for user agents and creates no obligations on user agents. Note that this document does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type <code>text/html</code>. For user agent guidance and for these definitions, see [[!HTML5]] and [[!RFC2854]]. </p> <p> Please submit bugs for this document by using the W3C's public bug database (<a href="http://www.w3.org/Bugs/Public/"> http://www.w3.org/Bugs/Public/</a>) with the product set to <kbd>HTML WG</kbd> and the component set to <kbd>HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff)</kbd>. If you cannot access the bug database, submit comments by email to the mailing list noted below. </p> <!--End section: Status of This Document--> </section> <section id="conformance"></section> <!-- note: for principle section In <a>polyglot markup</a>, the strings that XML and HTML interpret differently are considered <dfn>ambiguous strings</dfn> and MUST NOT be used except when they are explicitly permitted (such as for the ambigous namespace prefix <code>xml:</code>, which is permitted as prefix for the <code>lang</code> in the XML namespace – <code>xml:lang</code>). --> <section id="introduction" class="informative"><h2>Introduction</h2> <p>It is sometimes valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. The language used to create documents that can be parsed by both HTML and XML parsers is called <a title="polyglot markup">polyglot markup</a>. <a title="polyglot markup">Polyglot markup</a> is the overlap language of documents that are both HTML5 documents and XML documents. It is recommended that these documents be served as either <code>text/html</code> (if the content is transmitted to an HTML-aware user agent) or <code>application/xhtml+xml</code> (if the content is transmitted to an XHTML-aware user agent). Other permissible MIME types are <code>text/xml</code>, <code>application/xml</code>, and any MIME type whose subtype ends with the four characters "<code>+xml</code>". [[!XML-MT]]</p> <!--end general--> <section id="scope"> <h3>Scope</h3> <p>Polylglot markup is a <em><a href="#dfn-robust-syntax">robust</a></em> – but entirely <em>optional</em> – profile of the HTML vocabulary. All web content need not be authored in <a>polyglot markup</a> and it is primarily an option for authors wanting increased <a href="#dfn-robust-syntax">robustness</a> of their documents. <a title="polyglot markup">Polyglot markup</a> works best, and can be a beneficial option, in controlled environments and for authoring tools.</p> <p><a title="polyglot markup">Polyglot markup</a> is ideal for publishing when there's a strong desire to serve both HTML and XML tool chains without simultaneously having to maintain dual copies of the content: one in HTML and a second in XHTML. In addition, a single <a>polyglot markup</a> output requires less infrastructure to produce than to produce both HTML and XHTML output for the same content. <a title="polyglot markup">Polyglot markup</a> is also be beneficial when lightweight processes—such as quick testing or even hand-authoring—are applied to content intended to be published both as HTML and XHTML, especially if that content is not sent through a tool chain.</p> <p class="note">XML-based HTML tools or systems intended for the most general contexts of use cannot <strong><em>depend</em></strong> on polyglot input: for maximum flexibility, such tools should use the technique of using an HTML parser that produces an XML-compatible DOM or event stream.</p> <!--End section: Scope--> </section> <section id="robust"> <h3>Robustness</h3> <p>The goal of <a title="polyglot markup">polyglot markup</a> is a syntax that is <a href="#dfn-robust-syntax">robust</a> the way the Web Content Accessibility Guidelines (WCAG) 2.0 describes it: ”<q cite="http://www.w3.org/TR/WCAG20/#ensure-compat">Maximize compatibility with current and future user agents, including assistive technologies.</q> [[WCAG20]] </p> <p>Authors need not understand the benefits of <a href="#dfn-robust-syntax">robustness</a> in order to benefit from the syntax of polyglot markup. However, in order to promote its benefits, it is necessary to understand that <a title="polyglot markup">polyglot markup</a> does not add semantics, and as such is not any more or less semantic than other flavors of HTML. Polyglot markup does, however, work to <em>preserve</em> semantics, including during the authoring process. Polyglot markup also does not ensure accessibility,as it does not add any accessibility requirements that other relevant specifications have not already added. But <a>polyglot markup</a> can work to <em>preserve</em> accessibility through adherence to required practices.</p> <p>Polyglot markup approaches <a href="#dfn-robust-syntax">robustness</a> by defining constraints on the serialization of a DOM tree in a manner that is likely to retain semantics when that serialization is reparsed using a variety of parsers, be they full featured and bug free HTML5 parsers, somewhat HTML-aware parsers, and even XML parsers.</p> <p> For the most part, <a title="polyglot markup">polyglot markup</a> is just a pure deduction of the validity constraints and syntax requirements that HTML and XHTML each dictate, many of which took "polyglotness" into consideration when they were added to HTML5. However, for reasons of <a href="#dfn-robust-syntax">robustness</a>, this specification sometimes goes further than the principle of the lowest common denominator would have required.</p> <p> For instance, included in the set of constraints on the serialization is the requirement to use the UTF-8 encoding. While not the only theoretical possibility, the choice of UTF-8 as the sole option is justified by the underlying principle of <a href="#dfn-robust-syntax">robustness</a>. E.g. if someone opted to use the <code>KOI8-R</code>, encoding, then, as a side-effect of HTML-conformance and XML well-formedness requirements, the author would be forced to rely on a higher protocol (such as MIME <code>Content-Type</code>) in order to support XML parsers. By requiring UTF-8, that side-effect is avoided.</p> <p>Using <a href="#dfn-robust-syntax">robust</a> syntax can enable documents to be parsed more reliable in less capable parsers. But even if the document can be expected to be parsed and validated by tools that fully conform to HTML5, <a title="polyglot markup">polyglot markup</a> adds <a href="#dfn-robust-syntax">robustness</a>. As an example, when serialized as HTML, the closing tag for the <code>p</code> element is entirely optional and will be inferred if not present. But inclusion of closings tags, as required by XML and, thus, by <a title="polyglot markup">polyglot markup</a>, cause no harm beyond a minor increase in transfer size (an increase often mitigated by compression), but does allow validators to detect situations where the implicit closing rules don't match what the author intended. </p> <p class="note"> Note that XML-based polyglot markup syntax is not the only way to increase <a href="#dfn-robust-syntax">robustness</a>. For instance, an HTML validator or an authoring tool could require all tags to be closed even if this is not required by the HTML syntax. </p> <!--End section: robust--> </section> <!-- end intro--> </section> <section id="syntax"> <h2>Syntax</h2> <section id="principles"><h3>Principles</h3> <p> <dfn id="dfn-polyglot-markup">Polyglot markup</dfn> results in: </p> <ul> <li>a valid HTML document. [[!HTML5]]</li> <li>a <a href="http://www.w3.org/TR/2008/PER-xml-20080205/#sec-well-formed">well-formed XML</a> document. [[!XML10]]</li> <li>identical DOMs when processed as HTML and when processed as XML, with some notable exceptions: HTML and XML parsers generate different DOMs for some <code>xml</code> (<code>xml:lang</code>, <code>xml:space</code>, and <code>xml:base</code>), <code>xmlns</code> (<code>xmlns=""</code> and <code>xmlns:xlink=""</code>), and <code>xlink</code> (such as <code>xlink:href</code>) attributes. XML requires and HTML5 permits these attributes in certain locations and the attributes are preserved by HTML parsers. The exception must not break the requirement to be a valid HTML document. </li> </ul> <p><a>Polyglot Markup</a> specifies a <dfn id="dfn-robust-syntax">Robust Syntax</dfn>, by which it is meant a syntax that maximizes support and minimizes authoring choice.</p> <p>Support is maximized:</p> <ul> <li>by supporting both HTML and XML parsing;</li> <li>by utilizing code that, as far as possible, results in DOM equivalent parsing in generic as well as specialized parsers, including challenged parsers of various kinds;</li> <li>because the code is ready to be reused/repurposed/redited/reparsed in any authoring tool or parser.</li></ul> <p>Auhoring choices are minimized</p> <ul><li>through strict syntax requirements partly dictated by the polyglot approach and partly motivated by the robust approach.</li> </ul> <p> <a title="polyglot markup">Polyglot markup</a> is not constrained: </p> <ul> <li>to be <a href="http://www.w3.org/TR/2008/PER-xml-20080205/#dt-valid">valid XML</a>. [[!XML10]]</li> <li>by conformance to any XML DTD.</li> </ul> <p> <a title="polyglot markup">Polyglot markup</a> is scripted according to the rules of XML (does not use <code>document.write</code>, for example) and excludes HTML elements that are impossible to replicate in an XML parser (does not use the <code>noscript</code> element, for example). <a title="polyglot markup">Polyglot markup</a> triggers non-quirks mode in HTML parsers, as non-quirks mode is closest to XML-mode rendering, in regard to both DOM and CSS. <a title="polyglot markup">Polyglot markup</a> results in the same encoding and the same language in both HTML-mode and XML-mode. </p> <p> <a title="polyglot markup">Polyglot markup</a>, itself being valid HTML5, supports extensibility as it is defined in <a href="http://www.w3.org/TR/html5/infrastructure.html#extensibility">Section 2.2.3 Extensibility</a> of HTML5, so long as the extension does not violate the rules of <a>polyglot markup</a>. [[!HTML5]] In addition, being well formed XML, <a>polyglot markup</a> can be extended when it is served as <code>application/xhtml+xml</code>. </p> </section> <!--End section: Syntax--> </section> <section id="writing"><h2>Writing HTML documents</h2> <section id="PI-and-xml" class="section"> <h3>Processing instructions and the XML declaration</h3> <p> Processing instructions and the XML declaration are both forbidden in <a>polyglot markup</a>. </p> <!--End section: Processing Instructions and the XML Declaration--> </section> <section id="character-encoding" class="section"> <h3>Specifying a document’s character encoding</h3> <p> <a title="polyglot markup">Polyglot markup</a> uses the UTF-8 character encoding, the only character encoding for which both HTML and XML require support. HTML requires UTF-8 to be explicitly declared to avoid <a href="http://www.w3.org/TR/html5/semantics.html#charset">fallback to a legacy encoding</a>. [[!HTML5]] </p> <p> For XML, UTF-8 is an <a href="http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding">encoding default</a>. Documents served with an XML content type therefore do not need to use any of the HTML encoding declaration methods, although if the document might be interpreted as <code>text/html</code> it SHOULD do so. </p> <p> <a title="polyglot markup">Polyglot markup</a> declares the UTF-8 character encoding in the following ways, which may be used separately or in combination (but note that there can only be a <em>single</em> <a title="HTML encoding declaration">HTML encoding declaration</a>): </p> <ul> <li>Within the document <ul> <li>By using the Byte Order Mark (BOM) character</li> <li>By using the <dfn>HTML encoding declaration</dfn> <ul><li><strong>either</strong> in its <code>charset</code> attribute form: <code><meta charset="UTF-8"/></code></li> <li><strong>or</strong> in its alternative form: <code><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/></code></li> </ul> </li> </ul> </li> <li>Outside the document <ul> <li>By adding <code>"charset=utf-8"</code> to the MIME/HTTP Content-Type header [[!HTTP11]], as the following examples show in HTML and XML, respectively: </li> </ul> <pre class="example"> <code>Content-type: text/html; charset=utf-8</code> </pre> <pre class="example"> <code>Content-type: application/xhtml+xml; charset=utf-8</code> </pre> Note that, when serving polyglot documents as XML, <code>charset=UTF-8</code> can safely be omitted, due to the UTF-8 encoding default of XML: <pre class="example"> <code>Content-type: application/xhtml+xml</code> </pre> </li> </ul> <p class="note"> Both XML and HTML parsers are required to support the byte order mark. The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8. </p> <p> The <a href="http://www.w3.org/International/questions/qa-html-encoding-declarations">W3C Internationalization (i18n) Group recommends</a> that one always include a visible encoding declaration in an HTML document, because it helps developers, testers, or translation production managers to check the encoding of a document visually. </p> <!--End section: Specifying a Document's Character Encoding--> </section> <section id="doctype" class="section"> <h3>The DOCTYPE</h3> <p> <a title="polyglot markup">Polyglot markup</a> uses a document type declaration (DOCTYPE) specified by <a href="http://www.w3.org/TR/html5/syntax.html#the-doctype">section 8.1.1</a> of [[!HTML5]]. In addition, the DOCTYPE conforms to the following rules: </p> <ul> <li>The string <code>DOCTYPE</code> is in uppercase letters.</li> <li>The string <code>SYSTEM</code>, if present, is in uppercase letters.</li> <li>The string <code>PUBLIC</code>, if present, is in uppercase letters.</li> <li>A Formal Public Identifier (FPI), if present, is a case-sensitive match of the registered FPI to which it points.</li> <li>A URI, if present in the document type declaration, is a case-sensitive match of the URI to which it points. <ul> <li>If the URI is the string <code>about:legacy-compat</code>, <a>polyglot markup</a> includes the string in lowercase letters, as required by HTML5.</li> <li>If the URI is an http URL, the URI points to the correct resource, using case-sensitive letters.</li> </ul> </li> </ul> <p class="note"> For valid XML the document element named in the document type declaration must exactly match the top-level element of the document, including in case. This rule is relaxed for well-formed, rather than valid, XML documents. Because XHTML requires a lower-case <code>html</code> element, Polyglot documents SHOULD use lower-case <code>html</code> for the element named in the DOCTYPE declaration. Bear in mind that a customized XHTML DTD with element and entity declarations inside the document type definition subset within the document, or one that points to an alternate DTD, may have special case requirements. </p> <p> Note that using <code>about:legacy-compat</code> in XML may yield unpredictable parsing results, depending on the XML processing pipeline. </p> <p> <a title="polyglot markup">Polyglot markup</a> does not use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML. </p> <!--End section: The DOCTYPE--> </section> <section id="namespaces" class="section"> <h3>Namespaces</h3> <p> The following rules apply to namespaces used in <a>polyglot markup</a>. </p> <section id="element-level-namespaces" class="section"> <h4>Element-level namespaces</h4> <p> [[!HTML5]] introduces undeclared (native) default namespaces for the root HTML element, <code>html</code>, the root SVG element, <code>svg</code>, and the root MathML element, <code>math</code>. <a title="polyglot markup">Polyglot markup</a> declares the following default namespaces, when the markup languages are included in the document, to maintain XML compatibility [[!XML10]]:</p> <ul class="inline-list"> <li><code><html xmlns="http://www.w3.org/1999/xhtml"></code></li> <li><code><math xmlns="http://www.w3.org/1998/Math/MathML"></code></li> <li><code><svg xmlns="http://www.w3.org/2000/svg"></code></li> </ul> <p> <a title="polyglot markup">Polyglot markup</a> declares the default namespaces on the root HTML element, <code>html</code>, the root SVG element, <code>svg</code>, and the root MathML element <code>math</code>, and on any HTML elements used as children of SVG or MathML elements. <a title="polyglot markup">Polyglot markup</a> does not declare any other default or prefixed element namespace, because [[!HTML5]] does not natively support the declaring of any other default or prefixed element namespace. </p> <!-- End section, "Element-Level Namespaces" --> </section> <section id="attribute-level-namespaces" class="section"> <h4>Attribute-level namespaces</h4> <p> [[!HTML5]] introduces undeclared (native) support for attributes in the XLink namespace and with the prefix <code>xlink:</code>. To maintain XML-compatibility, <a title="polyglot markup">polyglot markup</a> explicitly declares the XLink namespace: <code>xmlns:xlink="http://www.w3.org/1999/xlink"</code>). [[!XML10]]</p> <p>For conformance with the HTML specification’s conformance rules, the declaration has to take place in each foreign content section where it is used, typically on a such section’s root element (e.g. on the <code>svg</code> start tag for an SVG section and on the <code>math</code> start tag for a MathML section) since the declaration must occur before using any of the <code>xlink:</code> prefixed attributes, </p> <ul class="inline-list"> <li><code>xlink:actuate</code></li> <li><code>xlink:arcrole</code></li> <li><code>xlink:href</code></li> <li><code>xlink:role</code></li> <li><code>xlink:show</code></li> <li><code>xlink:title</code></li> <li><code>xlink:type</code></li> </ul> <p> The <code>xml:</code> namespace prefix used in <code>xml:base</code>, <code>xml:lang</code>, <code>xml:space</code>, and <code>xml:id</code> does not need to be declared in XML documents, and therefore <a>polyglot markup</a> does not declare these prefixes via <code>xmlns</code>. The prefixes are implicitly declared in XML and are automatically applied to the appropriate attributes in HTML. See CSS namespaces [[!CSS3NAMESPACE]] how to use CSS selectors with these attributes. </p> <p> For more about the issues related to attribute selectors and namespaces, with and without prefixes, see the section on <a href="#scripting-and-styling-polyglot-markup">Scripting and styling polyglot markup</a>. </p> <!-- End section, "Attribute-Level Namespaces" --> </section> <!--End section: Namespaces--> </section> <section id="elements" class="section"> <h3>Element syntax</h3> <p><a title="polyglot markup">Polyglot markup</a> conforms to the following rules regarding elements.</p> <section id="required-elements" class="section"> <h6>Required elements and tags</h6> <p> <a title="polyglot markup">Polyglot markup</a> does not employ <a>optional tags</a>. HTML5’s concept of <dfn>optional tags</dfn> – missing start tags and/or end tags – covers <a href="http://www.w3.org/TR/html5/syntax.html#optional-tags"> elements that the HTML parser itself automatically adds to the DOM</a> if the code doesn’t contain the tags for them. Because XML does not have such a feature that adds missing start and/or end tags to the DOM, omitting a tag in <a>polyglot markup</a> is equivalent to producing a document that is not well-formed or, if both tags are omitted, equivalent to not adding the element at all. </p> <p>The fact that <a>polyglot markup</a> doesn’t operate with optional tags may create surprises for an author not used to adding the <code>tbody</code> tags in their markup, for example, or to someone accustomed to omitting the end tag of the <code>p</code> element. However, the requirement to be well-formed with regard to tags is a key feature of <a>polyglot markup</a> that makes the code <a href="#dfn-robust-syntax">robust</a> against subpar parsers and authoring surprises. </p> <section id="minimal-polyglot-html-document"> <h4>A minimal HTML document</h4> <p> Every <a>polyglot markup</a> document therefore contains an <code>html</code>, <code>head</code>, <code>title</code>, and <code>body</code> element. The <code>html</code> element is the root element. The <code>head</code> and <code>body</code> elements are children of the <code>html</code> element. The <code>title</code> element is a child of the <code>head</code> element. Therefore, the following is the most basic <a>polyglot markup</a> document. </p> <pre class="example highlight"><!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> <head> <title></title> </head> <body> </body> </html> </pre> <!--End section: A minimal HTML document --> </section> <section id="required-tags-exampls"> <h4>Required element examples</h4> <p> Whenever it uses a <code>tr</code> element, <a>polyglot markup</a> always wraps the <code>tr</code> element inside a <code>tbody</code>, <code>thead</code>, or <code>tfoot</code> element. In HTML, if a group of one or more adjacent <code>tr</code> elements are not explictly wrapped inside a <code>tbody</code>, <code>thead</code>, or <code>tfoot</code> element, the HTML parser creates and wraps a new <code>tbody</code> element around the <code>tr</code> elements. XML parsers do not create the <code>tbody</code> element, thus offering the potential for creating different DOMs. [973 lines skipped]
Received on Friday, 16 May 2014 16:08:28 UTC