[Bug 5809] New: Mitigate data loss when conforming documents are coerced to XML 1.0

http://www.w3.org/Bugs/Public/show_bug.cgi?id=5809

           Summary: Mitigate data loss when conforming documents are coerced
                    to XML 1.0
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Spec proposals
        AssignedTo: dave.null@w3.org
        ReportedBy: hsivonen@iki.fi
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, mike@w3.org, public-html@w3.org


Over in bug 5808 I suggested a way to coerce the output of the HTML5 parsing
algorithm into XML.

It's theoretically unpure for conforming documents to trigger coercions that
aren't mostly harmless. I, therefore, suggest narrowing the conformance
definition accordingly.

 * The document mode isn't part of the infoset: Optionally communicate as
out-of-infoset-band data. Instruct apps to use the standards mode when not
communicated.

Mostly harmless.

 * The form pointer isn't part of the infoset: Make communicating the form
pointer optional. Allow communicating it as out-of-infoset-band data. When the
form element is not an ancestor of the form control, allow an UUID id attribute
be generated on the form element and allow a form attribute be generated on the
form control.

Mostly harmless.

 * Some XML APIs treat the doctype as syntactic sugar: Make representing the
document type information item is optional.

Mostly harmless.

 * Attributes with the local name "xmlns" or a local name starting with
"xmlns:" are not permitted attribute information items: Drop on the floor.

Mostly harmless. However, in the case of <embed>, this theoretically loses
conforming data. These attributes could be excluded from what is permitted on
<embed> as plug-in parameters.

 * Namespace declarations are not attribute information items: Drop on the
floor. (Optionally syntethize namespace information items for XLink and SVG or
MathML on <svg> and <math> nodes, respectively, and XHTML namespace information
items on HTML elements (including root) that do not have an HTML element as the
parent.)

Mostly harmless.

 * Form feed is not an XML character (either literally or as a character
reference expansion): turn into a space.

Mostly harmless.

 * The input stream contains a literal non-XML character other than form feed:
turn into a REPLACEMENT CHARACTER.

Mostly harmless, but these might as well be defined as non-conforming.

 * A comment contains "--": Replace with "- -".

Mostly harmless.

 * A name is not an NCName: Use the original name on tree builder stack for
matching, but use as escaped name in the output. The escaping function must
escape each non-NCName to a unique NCName, and the result must have at least
one upper case ASCII character but must not match any known SVG camelCase name.

This is dataloss in theory even if not in probable practice. Attributes that
are actually used on <embed> are NCNames anyway, so forbidding non-NCNames
wouldn't break anything. Forbidding data-* from forming a non-NCName would
still leave a countably infinite space of names, and authors are likely to use
printable ASCII anyway.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Thursday, 26 June 2008 13:13:25 UTC