- From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
- Date: Wed, 23 Jul 2008 02:02:58 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/spec In directory hutz:/tmp/cvs-serv1671 Modified Files: Overview.html Log Message: Provide a way to mutate the DOM into an infoset. (Bug 5808) (credit: hs) (whatwg r1907) Index: Overview.html =================================================================== RCS file: /sources/public/html5/spec/Overview.html,v retrieving revision 1.1095 retrieving revision 1.1096 diff -u -d -r1.1095 -r1.1096 --- Overview.html 23 Jul 2008 01:05:20 -0000 1.1095 +++ Overview.html 23 Jul 2008 02:02:55 -0000 1.1096 @@ -2018,6 +2018,9 @@ </ul> <li><a href="#the-end"><span class=secno>8.2.6 </span>The end</a> + + <li><a href="#coercing"><span class=secno>8.2.7 </span>Coercing an + HTML DOM into an infoset</a> </ul> <li><a href="#namespaces"><span class=secno>8.3 </span>Namespaces</a> @@ -51069,6 +51072,130 @@ /parser/htmlparser/src/nsElementTable.cpp, line 1901 - // Ex: <H1><LI><H1><LI>. Inner LI has the potential of getting nested --> + <h4 id=coercing><span class=secno>8.2.7 </span>Coercing an HTML DOM into an + infoset</h4> + + <p>When an application uses an <a href="#html-0">HTML parser</a> in + conjunction with an XML pipeline, it is possible that the constructed DOM + is not compatible with the XML tool chain in certain subtle ways. For + example, an XML toolchain might not be able to represent attributes with + the name <code title="">xmlns</code>, since they conflict with the + Namespaces in XML syntax. <a href="#references">[XMLNS]</a> + + <p>There is also some data that the <a href="#html-0">HTML parser</a> + generates that isn't included in the DOM itself. + + <p>To allow tools to apply a consistent set of adjustments to the output of + their <a href="#html-0">HTML parser</a> to allow for compatibility with + the rest of their XML toolchain, this section documents a set of mutations + and conventions that will convert the output of the <a href="#html-0">HTML + parser</a> for any arbitrary input into an XML Infoset that doesn't have + any problematic characteristics. + + <p>Tools that cannot convey the out-of-band information using out-of-band + mechanisms, or that cannot convey the DOM exact as prescribed by this + specification, may either ignore the offending information or DOM feature, + or may represent it internally in the DOM using the conventions described + below. + + <p>These conventions are not conforming HTML, and user agents must not + output such syntax outside of their XML pipeline. + + <dl> + <dt>The <code>DocumentType</code> node's <code title="">name</code>, <code + title="">publicId</code>, and <code title="">systemId</code> attributes + + <dd>If the XML API being used doesn't support DOCTYPEs, tools may drop + DOCTYPEs altogether or create a set of three attributes on the root + element, named <code title="">__doctype_name__</code>, <code + title="">__doctype_publicid__</code>, and <code + title="">__doctype_systemid__</code>, respectively, whose values are the + values that would have been put on the <code>DocumentType</code> node. + + <dt>The document being set to <i><a href="#no-quirks">no quirks + mode</a></i>, <i><a href="#limited1">limited quirks mode</a></i>, or + <i><a href="#quirks">quirks mode</a></i> + + <dd>To convey this information, create an attribute <code + title="">__mode__</code> on the root element, with values "noquirks", + "limitedquirks", or "quirks" respectively. + + <dt>Elements that have a namespace without appropriate <code + title="">xmlns</code> attributes being in scope + + <dd>Construct the DOM as if appropriate namespace declarations were in + scope. + + <dt>Elements whose names contain U+003A COLON (:) characters or characters + that cannot be represented in XML element names + + <dd>Drop the element and all its children, or replace any offending + characters with a U+005F LOW LINE (_) character. + + <dt>Attributes named <code title="">xmlns</code> whose values match the + namespace of the element node + + <dd>Construct the DOM as if these were default namespace declarations. + + <dt>Attributes named <code title="">xmlns:xlink</code> whose values match + the <a href="#xlink">XLink namespace</a>, on elements whose namespace is + not the <a href="#html-namespace0">HTML namespace</a> + + <dd>Construct the DOM as if these were namespace prefix declarations. + + <dt>Other attributes whose names are <code title="">xmlns</code> or start + with <code title="">xmlns:</code> + + <dd>Drop the attributes or add two U+005F LOW LINE (_) characters to the + start of the attributes' names and replace any U+003A COLON (:) + characters with a U+005F LOW LINE (_) character. + + <dt>Other attributes in no namespace whose names contain U+003A COLON (:) + characters + + <dt>Attributes whose names contain characters that cannot be represented + in XML attribute names + + <dd>Drop the attributes or replace any offending characters with a U+005F + LOW LINE (_) character, dropping any attributes where doing this would + cause an attribute name clash. + + <dt>Form controls being associated with forms that aren't their nearest + ancestor (use of the <a href="#form-element"><code>form</code> element + pointer</a> + + <dd>Create an attribute <code title="">__formid__</code> on the form, with + a value unique amongst <code title="">__formid__</code> attributes in the + document, and create an attribute <code title="">__form__</code> on the + form control, whose value matches the unique identifier given to the + form. + + <dt>Any U+000C FORM FEED (FF) character + + <dd>Replace the character with a U+0020 SPACE character. + + <dt>Any other literal non-XML character + + <dd>Replace the character with a U+FFFD REPLACEMENT CHARACTER. + + <dt>A comment that contains two adjacent U+002D HYPHEN-MINUS characters + (--). + + <dd>Insert a U+0020 SPACE character between them. + </dl> + + <p>Tools that use these conventions should guard against documents that + include markup that clashes with them by always dropping all attributes in + the document that start with two U+005F LOW LINE (_) characters. + + <p class=note>These conventions apply <em>after</em> the <a + href="#html-0">HTML parser</a>'s rules have been applied. For example, a + <code title=""><a::></code> start tag will be closed by a <code + title=""></a::></code> end tag, and never by a <code + title=""></a__></code> end tag, even if the user agent is using the + rules above to then generate an actual element in the DOM with the name + <code title="">a__</code> for that start tag. + <h3 id=namespaces><span class=secno>8.3 </span>Namespaces</h3> <p>The <dfn id=html-namespace0>HTML namespace</dfn> is:
Received on Wednesday, 23 July 2008 02:03:33 UTC