[Bug 5808] New: Define a way to coerce HTML5 parser output to an XML 1.0 4th ed. + Namespaces 1.0 infoset

http://www.w3.org/Bugs/Public/show_bug.cgi?id=5808

           Summary: Define a way to coerce HTML5 parser output to an XML 1.0
                    4th ed. + Namespaces 1.0 infoset
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Spec proposals
        AssignedTo: dave.null@w3.org
        ReportedBy: hsivonen@iki.fi
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, mike@w3.org, public-html@w3.org


There's now a canned answer for anyone who argues that XHTML works better with
the 'XML toolchain' than HTML5: "Just put an HTML5 parser at the start of your
XML pipeline."

There's a slight problem though: The HTML5 parser algorithm can output a
document tree that is not an XML 1.0 4th ed. + Namespaces 1.0 infoset. This
poses a problem if a processing pipeline serializes to XML and expects a later
stage to reparse using a conforming XML 1.0 4th ed. + Namespaces 1.0 parser or
if a component in the pipeline (e.g. the XOM library) performs early checks.

Therefore, every HTML5 parser writer who wishes to provide a full-featured
general-purpose HTML5 parser needs to come up with a coercion from an HTML5 DOM
onto an XML 1.0 4th ed. + Namespaces 1.0 Infoset.

I suggest documenting a mapping.

Here's a list of problems with proposed solutions:

 * The document mode isn't part of the infoset: Optionally communicate as
out-of-infoset-band data. Instruct apps to use the standards mode when not
communicated.
 * The form pointer isn't part of the infoset: Make communicating the form
pointer optional. Allow communicating it as out-of-infoset-band data. When the
form element is not an ancestor of the form control, allow an UUID id attribute
be generated on the form element and allow a form attribute be generated on the
form control.
 * Some XML APIs treat the doctype as syntactic sugar: Make representing the
document type information item is optional.
 * Attributes with the local name "xmlns" or a local name starting with
"xmlns:" are not permitted attribute information items: Drop on the floor.
 * Namespace declarations are not attribute information items: Drop on the
floor. (Optionally syntethize namespace information items for XLink and SVG or
MathML on <svg> and <math> nodes, respectively, and XHTML namespace information
items on HTML elements (including root) that do not have an HTML element as the
parent.)
 * Form feed is not an XML character (either literally or as a character
reference expansion): turn into a space.
 * The input stream contains a literal non-XML character other than form feed:
turn into a REPLACEMENT CHARACTER.
 * A comment contains "--": Replace with "- -".
 * A name is not an NCName: Use the original name on tree builder stack for
matching, but use as escaped name in the output. The escaping function must
escape each non-NCName to a unique NCName, and the result must have at least
one upper case ASCII character but must not match any known SVG camelCase name.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Thursday, 26 June 2008 12:53:30 UTC