W3C home > Mailing lists > Public > public-html-commits@w3.org > July 2008

html5/spec Overview.html,1.1095,1.1096

From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
Date: Wed, 23 Jul 2008 02:02:58 +0000
To: public-html-commits@w3.org
Message-Id: <E1KLTgw-0000UW-AP@lionel-hutz.w3.org>

Update of /sources/public/html5/spec
In directory hutz:/tmp/cvs-serv1671

Modified Files:
	Overview.html 
Log Message:
Provide a way to mutate the DOM into an infoset. (Bug 5808) (credit: hs) (whatwg r1907)

Index: Overview.html
===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.1095
retrieving revision 1.1096
diff -u -d -r1.1095 -r1.1096
--- Overview.html	23 Jul 2008 01:05:20 -0000	1.1095
+++ Overview.html	23 Jul 2008 02:02:55 -0000	1.1096
@@ -2018,6 +2018,9 @@
         </ul>
 
        <li><a href="#the-end"><span class=secno>8.2.6 </span>The end</a>
+
+       <li><a href="#coercing"><span class=secno>8.2.7 </span>Coercing an
+        HTML DOM into an infoset</a>
       </ul>
 
      <li><a href="#namespaces"><span class=secno>8.3 </span>Namespaces</a>
@@ -51069,6 +51072,130 @@
 /parser/htmlparser/src/nsElementTable.cpp, line 1901 - // Ex: <H1><LI><H1><LI>. Inner LI has the potential of getting nested
 -->
 
+  <h4 id=coercing><span class=secno>8.2.7 </span>Coercing an HTML DOM into an
+   infoset</h4>
+
+  <p>When an application uses an <a href="#html-0">HTML parser</a> in
+   conjunction with an XML pipeline, it is possible that the constructed DOM
+   is not compatible with the XML tool chain in certain subtle ways. For
+   example, an XML toolchain might not be able to represent attributes with
+   the name <code title="">xmlns</code>, since they conflict with the
+   Namespaces in XML syntax. <a href="#references">[XMLNS]</a>
+
+  <p>There is also some data that the <a href="#html-0">HTML parser</a>
+   generates that isn't included in the DOM itself.
+
+  <p>To allow tools to apply a consistent set of adjustments to the output of
+   their <a href="#html-0">HTML parser</a> to allow for compatibility with
+   the rest of their XML toolchain, this section documents a set of mutations
+   and conventions that will convert the output of the <a href="#html-0">HTML
+   parser</a> for any arbitrary input into an XML Infoset that doesn't have
+   any problematic characteristics.
+
+  <p>Tools that cannot convey the out-of-band information using out-of-band
+   mechanisms, or that cannot convey the DOM exact as prescribed by this
+   specification, may either ignore the offending information or DOM feature,
+   or may represent it internally in the DOM using the conventions described
+   below.
+
+  <p>These conventions are not conforming HTML, and user agents must not
+   output such syntax outside of their XML pipeline.
+
+  <dl>
+   <dt>The <code>DocumentType</code> node's <code title="">name</code>, <code
+    title="">publicId</code>, and <code title="">systemId</code> attributes
+
+   <dd>If the XML API being used doesn't support DOCTYPEs, tools may drop
+    DOCTYPEs altogether or create a set of three attributes on the root
+    element, named <code title="">__doctype_name__</code>, <code
+    title="">__doctype_publicid__</code>, and <code
+    title="">__doctype_systemid__</code>, respectively, whose values are the
+    values that would have been put on the <code>DocumentType</code> node.
+
+   <dt>The document being set to <i><a href="#no-quirks">no quirks
+    mode</a></i>, <i><a href="#limited1">limited quirks mode</a></i>, or
+    <i><a href="#quirks">quirks mode</a></i>
+
+   <dd>To convey this information, create an attribute <code
+    title="">__mode__</code> on the root element, with values "noquirks",
+    "limitedquirks", or "quirks" respectively.
+
+   <dt>Elements that have a namespace without appropriate <code
+    title="">xmlns</code> attributes being in scope
+
+   <dd>Construct the DOM as if appropriate namespace declarations were in
+    scope.
+
+   <dt>Elements whose names contain U+003A COLON (:) characters or characters
+    that cannot be represented in XML element names
+
+   <dd>Drop the element and all its children, or replace any offending
+    characters with a U+005F LOW LINE (_) character.
+
+   <dt>Attributes named <code title="">xmlns</code> whose values match the
+    namespace of the element node
+
+   <dd>Construct the DOM as if these were default namespace declarations.
+
+   <dt>Attributes named <code title="">xmlns:xlink</code> whose values match
+    the <a href="#xlink">XLink namespace</a>, on elements whose namespace is
+    not the <a href="#html-namespace0">HTML namespace</a>
+
+   <dd>Construct the DOM as if these were namespace prefix declarations.
+
+   <dt>Other attributes whose names are <code title="">xmlns</code> or start
+    with <code title="">xmlns:</code>
+
+   <dd>Drop the attributes or add two U+005F LOW LINE (_) characters to the
+    start of the attributes' names and replace any U+003A COLON (:)
+    characters with a U+005F LOW LINE (_) character.
+
+   <dt>Other attributes in no namespace whose names contain U+003A COLON (:)
+    characters
+
+   <dt>Attributes whose names contain characters that cannot be represented
+    in XML attribute names
+
+   <dd>Drop the attributes or replace any offending characters with a U+005F
+    LOW LINE (_) character, dropping any attributes where doing this would
+    cause an attribute name clash.
+
+   <dt>Form controls being associated with forms that aren't their nearest
+    ancestor (use of the <a href="#form-element"><code>form</code> element
+    pointer</a>
+
+   <dd>Create an attribute <code title="">__formid__</code> on the form, with
+    a value unique amongst <code title="">__formid__</code> attributes in the
+    document, and create an attribute <code title="">__form__</code> on the
+    form control, whose value matches the unique identifier given to the
+    form.
+
+   <dt>Any U+000C FORM FEED (FF) character
+
+   <dd>Replace the character with a U+0020 SPACE character.
+
+   <dt>Any other literal non-XML character
+
+   <dd>Replace the character with a U+FFFD REPLACEMENT CHARACTER.
+
+   <dt>A comment that contains two adjacent U+002D HYPHEN-MINUS characters
+    (--).
+
+   <dd>Insert a U+0020 SPACE character between them.
+  </dl>
+
+  <p>Tools that use these conventions should guard against documents that
+   include markup that clashes with them by always dropping all attributes in
+   the document that start with two U+005F LOW LINE (_) characters.
+
+  <p class=note>These conventions apply <em>after</em> the <a
+   href="#html-0">HTML parser</a>'s rules have been applied. For example, a
+   <code title="">&lt;a::></code> start tag will be closed by a <code
+   title="">&lt;/a::></code> end tag, and never by a <code
+   title="">&lt;/a__></code> end tag, even if the user agent is using the
+   rules above to then generate an actual element in the DOM with the name
+   <code title="">a__</code> for that start tag.
+
   <h3 id=namespaces><span class=secno>8.3 </span>Namespaces</h3>
 
   <p>The <dfn id=html-namespace0>HTML namespace</dfn> is:
Received on Wednesday, 23 July 2008 02:03:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 9 October 2008 20:32:58 GMT