Review of DOM 3 Core + Load&Save

Hello DOM WG

The i18n WG has reviewed the recently published DOM 3 Core and Load&Save
working drafts:
http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030226/
http://www.w3.org/TR/2003/WD-DOM-Level-3-LS-20030226/

We understand that this is still very much work in progress, but thought
that submitting comments now could be productive.  In particular, many of
our comments can serve as hints as to where the specs need clarification,
while others raise potential issues that do not appear to be in your issues
lists and therefore may not be on your radar.  We plan to re-review the
specs when they reach Last Call, but it is probably better to raise issues
we know about now instead of later when the spec is considered done.

-- 
François Yergeau
for the i18n WG


DOM 3 Core
==========

C1) Document interface, "actualEncoding" and "encoding" attributes: it is
not quite clear what these two are, especially how they differ.  Also, the
effect, if any, of setting them (they are not readonly) is not clear.  Same
issues with interface Entity.


C2) Document interface, "standalone" attribute: this is said to match the
[standalone] property of the infoset, but is boolean whereas the infoset
property can have 3 values: "yes", "no" and "no value".  Either the datatype
should be changed or (my preference) it should be specified that this is
true when the infoset says "yes", false otherwise.


C3) Document interface, "strictErrorChecking" attribute: shouldn't this be a
parameter of DOMConfiguration?


C4) Document interface, "version" attribute: presumably, this controls the
error checking that is done on names in e.g. createAttribute(), which raises
INVALID_CHARACTER_ERR if the specified name contains an illegal character.
Which rules (1.0 or 1.1) apply if "version" is null?  Shouldn't "version"
default to "1.0" if not specified?


C5) Document interface, "version" attribute:  what happens if version is set
from "1.1" to "1.0" and the document already contains names that are not
legal in 1.0?  Is this controlled by "strictErrorChecking"?


C6) Document interface, "adoptNode()" method: what happens if a 1.0 document
adopts a node containing names not legal in 1.0 (e.g. from a 1.1 document)?
By analogy with createAttribute() and friends, this should throw an
INVALID_CHARACTER_ERR exception.  Same comment for importNode(); Ah!  I see
that importNode() does throw INVALID_CHARACTER_ERR in that case.


C7) Document interface, "createAttribute()" method and several others:
should specify that the rules that decide whether an INVALID_CHARACTER_ERR
exception is thrown depend on the "version" attribute.  Same comment for
Document.setAttribute() and Document.setAttributeNS().


C8) Document interface, "createCDATASection()" method: what happens if the
"data" argument contains the string "]]>"?  Is this controlled by
"strictErrorChecking"?  Impacts Load&Save.


C9) Document interface, "createComment()" method: what happens if the "data"
argument contains the string "--"?  Is this controlled by
"strictErrorChecking"?  Impacts Load&Save.


C10) Document interface, "createTextNode()" method: what happens if the
"data" argument contains characters not allowed by the XML Char production?
Is this controlled by "strictErrorChecking"?  Impacts Load&Save.  Same
question for setting Node.textContent.


C11) Document interface, "normalizeDocument()" method: doesn't mention
character-normalization.


C12) Node interface, "normalize()" method: this should also perform
character normalization, perhaps conditional to the config of the containing
Document.


C13) CharacterData interface: are the various methods supposed to maintain
normalization?  Under the control of the config of the containing Document?
Of "strictErrorChecking"?


C14) Attr interface, last paragraph before Note before IDL definition
contains the term "character entity reference", which is not defined
anywhere.  This whole para is pretty unclear, one comes out not knowing what
the value of an attribute is supposed to be or not to be.


C15) Attr interface, "value" attribute: what happens if the attribute
contains a reference to an entity for which no definition is available?
Same question for getAttribute() and getAttributeNS() in the Element
interface.


C16) DOMLocator interface, "offset" attribute: there should be two
attributes, one for byte offset and the other for character offset (or
alternatively another attribute that says whether "offset" is byte or
character), since the application may not be able to determine if the source
was bytes or characters.


C17) Substitute IRI for URI throughout.


C18) DOMConfiguration interface, "cdata-sections" and "entities" parameters:
it doesn't make sense to default to keeping CDATA sections but not entity
references.  The former are mere syntactic sugar with no structural role
(hint: they do not exist in the infoset) while the latter are part of the
physical structure of XML documents.  At least change "cdata-sections"
default to false.


C19) DOMConfiguration interface, "normalize-characters" parameter: it is not
quite clear what exactly this setting does and when.  Change "Perform the
W3C Text Normalization of the characters [CharModel] in the document." to
"The characters in the document are fully-normalized according to the rules
defined in [CharModel] supplemented by the definitions of relevant
constructs from Section 2.13 of [XML1.1]."  This reflects both a change of
terminology in CharModel and the necessity of taking into account the
relevant constructs defined in XML 1.1 (as per the provisions of CharModel).
Since Charmodel says that text SHOULD be normalized, the default for this
should be true, the user having the chance to set it to false after careful
consideration of the consequences (see definition of SHOULD in RFC2119).


C20) Entity interface: the 4th paragraph starts "XML does not mandate that a
non-validating XML processor read and process entity declarations made in
the external subset or declared in external parameter entities. "  The last
occurrence of "external" is superfluous and somewhat misleading, since
non-validating processors are not obligated to read even *internal*
parameter entities.


C21) The references to Unicode 2.0 and ISO/IEC 10646 need to be updated.
Both are obsolete and unavailable. There is no apparent reason not to use
current versions or, better, version-less references (see Charmod section
9).


DOM 3 Load&Save
===============

LS1) The "schemaType" arg of DOMImplementationLS.createDOMBuilder()
specifies an "absolute URI representing the type of the schema language used
during the load of a Document".  That URI is used solely for matching (à la
XML namespace), not for resolving, and should be an absolute *IRI*.  The
identity matching rules (e.g. character for character, %e9 == %E9 or not,
etc.) should be specified.  This also applies to the "schema-type" parameter
of DOMConfiguration in DOM 3 Core.


LS2) The effect of the "certified" parameter of DOMConfiguration in
DOMBuilder is not clearly defined.  Its interaction with the
"normalize-characters" parameter defined in Core should be clarified.
Actually, "certified" should be a property of DOMInputSource, not of
DOMBuilder.  It is in fact a source that can be certified (or not), not a
parser.  And certification may be different for a main document and for the
external entities it pulls in during parse.


LS3) It should be specified clearly somewhere what
"normalize-characters=true" in the config of a DOMBuilder means:
non-certified input will be verified for full-normalization and the load
will fail with an error if it is not.  The default value of
"normalize-characters" must be true in DOMBuilder, at least when loading XML
1.1 documents, in order to satisfy the prescriptions of [XML1.1] and
[CharModel].  In particular, the DocumentLS.load() and loadXML() methods
automatically do the wrong thing and have no way to do the right thing if
the default is false.


LS4) The "unknown-characters" parameter of DOMConfiguration in DOMBuilder is
correctly designed but poorly named.  Suggest
"ignore-unknown-denormalizations".  Same remark for same-named parameter in
DOMWriter.


LS5) Substitute IRI for URI throughout.


LS6) In interface DOMInputSource, the role of the "publicId" attribute is
not clear at all.  It is not mentionned in the paragraph above the IDL
definition that describes how the source of input is determined (nor is the
"stringData" attribute mentionned there).  The role of the "encoding"
attribute is mentionned in too many places.


LS7) In the discussion of interface DOMWriter (above the IDL definition), it
would be nice if character references were specified to be hexadecimal
(preferred) or decimal.  One way or the other determined by the spec, not
implementation-dependent.  Similarly (still within DOMWriter), it would be
better to specify serialization of attribute values to be always in quotes
(or apostrophes, you choose), with escaping as necessary.  Requiring
serializers to examine the value and choose quotes or apostrophes based on
content seems like useless work.


LS8) In the paragraph (still within DOMWriter) discussing the effect of
"normalize-characters", change "...is W3C Text normalized according to the
rules defined in [CharModel]." to "...is fully-normalized according to the
rules defined in [CharModel] supplemented by the definitions of relevant
constructs from Section 2.13 of [XML1.1]."  This reflects both a change of
terminology in CharModel and the necessity of taking into account the
relevant constructs defined in XML 1.1 (as per the provisions of CharModel).


LS9) In the description of "encoding" in DOMWriter, it is said that
encoding info can be gleaned from e.g.  "actualEncoding" from the Document.
What about "encoding" from the Document?  What if both are set, which wins?


LS10) It would be nice to be more specific about what happens when
"encoding" is either "UTF-16" or "UTF-32".  The implementation has to choose
between big-endian and little-endian; the DOM spec could say which or say
"implementation-dependent" explicitly.  Then, for UTF-16, the implementation
can choose to output a BOM and no encoding declaration (or a declaration
that says "UTF-16"), or to output no BOM but an encoding declaration that
says either "UTF-16BE" or "UTF-16LE".  We have no specific recommendation 
to make at this point, but think that the spec should specify more precisely
what is supposed to happen.


LS11) In DOMWriter, there should be a way to specify the version of XML
under which serialization is performed.  While it seems possible to set the
Document.version attribute, this has the side effect of changing the DOM in
memory and more seriously is not practical when serializing other than the
whole document.


LS12) Unless the Core guarantees this never happens (cf. C5 above), it needs
to be specified what happens when a node containing names (e.g. element
names) legal only in XML 1.1 is serialized using 1.0 rules: DOMException of
type INVALID_CHARACTER_ERR?  Error event sent to errorHandler?  If the
latter, details?


LS13) The asymetry between DOMBuilder and DOMWriter is bothersome.  Why
isn't there a DOMOutputSink to paralle DOMInputSource?  Why isn't there a
DOMWriter.writeURI() to parallel DOMBuilder.parseURI()?  Saving to an HTTP
(with PUT), FTP or mailto URI appears to make a lot of sense.


LS14) The references to Unicode 2.0 and ISO/IEC 10646 need to be updated.
Both are obsolete and unavailable. There is no apparent reason not to use
current versions or, better, version-less references (see Charmod section
9).

Received on Wednesday, 12 March 2003 09:00:56 UTC