Re: i18n reviews of DOM 3 Core and Load&Save from Philippe Le Hegaret on 2003-08-29 (www-dom@w3.org from July to September 2003)

From: Philippe Le Hegaret <plh@w3.org>
Date: 29 Aug 2003 14:01:41 -0400
To: Francois Yergeau <FYergeau@alis.com>
Cc: "'www-dom@w3.org'" <www-dom@w3.org>
Message-Id: <1062180101.3416.48.camel@jfouffa.w3.org>
On Fri, 2003-08-08 at 11:39, Francois Yergeau wrote:
> These are the review comments on the last calls for DOM 3 Core and
> Load&Save, from the i18n WG.
> 
> DOM 3 Core
> http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030609/
> 
> 
> C1) Document interface, "actualEncoding" and "xmlEncoding" attributes: This
> is very much improved since the previous version, but unfortunately still
> not totally clear.  Since the DOM stores documents in UTF-16 exclusively,
> these attributes must necessarily refer to the encoding of a serialized
> document that is parsed to create the DOM tree; since xmlEncoding is not
> read-only, it can also be set programmatically, but that shouldn't change
> its semantics.  Semantics is precisely where some dark spots remain.
> actualEncoding is pretty clearly defined as the actual encoding of the
> parsed document, supposedly gleaned from the parser. xmlEncoding is then
> said to be taken from the XML declaration, but Appendix C.1.1 says that
> xmlEncoding is supposed to come from the infoset's [character encoding
> scheme] property.  The latter is defined as "The name of the character
> encoding scheme in which the document entity is expressed", matching the
> semantics of actualEncoding, not those of an encoding label read from the
> XML declaration.  So the meaning of xmlEncoding remains pretty murky.

xmlEncoding is now read-only, and only represents what was found in the
XML declaration, if any. actualEncoding was the encoding used to load
the document, again if any. For the Save module, it should be clarified
that the XML declaration, if generated, will get whatever encoding was
used for the serialization, and not actualEncoding or xmlEncoding
necessarily.

> One wonders why there are actually 2 attributes, since there is only one
> encoding of interest: that of the document that was parsed to create the DOM
> tree.

For the completeness of the representation of the original document,
since some higher protocols could override the encoding specified in the
document.

>  If the intent was to enable DOM users to control encoding during
> later serialization, this is defeated by the order of priorities specified
> in DOMSerializer.write(): actualEncoding precedes xmlEncoding.  The former
> being read-only, the user has no control.

The control of the encoding at serialization is done using
DOMOutput.encoding, since a write operation should be accomplished
without having to modify the DOM tree.

> C3) Document interface, "renameNode()" method: should specify, like
> createAttribute() and others, that an INVALID_CHARACTER_ERR exception can be
> thrown, depending on the "xmlVersion" attribute.

correct. fixed.

> C4) Node interface, "normalize()" method: this should also perform character
> normalization, perhaps conditional to the config of the containing Document.
> This method's business in life is to concatenate Text nodes; concatenation
> is one of the well-known cases that actually *produces* character
> denormalization.  It would be silly to have a method called normalize()
> which actually denormalizes, so any denormalizations caused by concatenation
> should be repaired as part of the method's normal functioning.  Backward
> compatibility can probably be addressed by making the repairs conditional on
> xmlVersion or the config of the containing document or both.

normalize() is a DOM Level 1 method. The name is unfortunate since it
collides character normalization but we cannot change its semantics or
rename it. This explains the introduction of normalizeDocument(),
instead of reusing normalize() on Document nodes. An other example of
discrepancy with names is our namespaceURI and the [namespace name]
Infoset property.

> Also, it should be specified that this method is sensitive to the value of
> the "cdata-sections" config parameter.

Only normalizeDocument is sensitive to the configuration parameters.
Changing the behavior of normalize() could break DOM Level 1 and 2
applications.

> C6) DOMLocator interface, "offset" attribute: there should be two
> attributes, one for byte offset and the other for character offset (or
> alternatively another attribute that says whether "offset" is byte or
> character), since the application may not be able to determine if the source
> was bytes or characters.

We split the attribute offset into byteOffset and utf16Offset (since the
DOM deals with utf16 units).

> 
> C7) DOMConfiguration interface, "cdata-sections"  parameter: this should
> default to false.  CDATA sections are mere syntactic sugar with no
> structural role (hint: they do not exist in the infoset), they do not
> deserve to be preserved by default.

The parse methods of the LS module don't load CDATA sections by default
(the "infoset" parameter is true by default, this implies that
cdata-sections default is false for the parse methods). So unless an
application adds CDATASection nodes during manipulations, the
"cdata-sections" parameter won't change anything in the tree. And if the
application do add CDATASection nodes in the tree, or the parse
operation was requested to preserve the cdata sections, then they should
be preserved by default since the application explicitly asked to get
them.

> C8) DOMConfiguration interface, "check-character-normalization" parameter:
> it is not clear *when* this setting has any effect (i.e. what methods of
> what interfaces it affects).

[there is a pending action item on the Core editors to clarify that only
normalizeDocument is affected by the DOMConfiguration parameters]

>  Since Charmodel says that text SHOULD be
> checked, the default for this should be true, the user having the chance to
> set it to false after careful consideration of the consequences (see
> definition of SHOULD in RFC2119).

The parameter check-character-normalization is optional so the default
cannot be true. Applications can certainly check if the parameter is
activated, or can be activated, using the methods defined on the
DOMConfiguration object.

> C9) The reference to Unicode 3.0 should be updated to Unicode 4.0, ISBN
> 0-321-18578-1.

done. For the record, the Character Model of the Web (August 2003
version) still links to Unicode Version 3.0. Is it intentional?
http://www.w3.org/TR/2003/WD-charmod-20030822/#unicode

Philippe
Received on Friday, 29 August 2003 14:02:25 UTC