RE: i18n reviews of DOM 3 Core and Load&Save from Francois Yergeau on 2003-09-23 (www-dom@w3.org from July to September 2003)

From: Francois Yergeau <FYergeau@alis.com>
Date: Tue, 23 Sep 2003 16:17:32 -0400
To: "'www-dom@w3.org'" <www-dom@w3.org>
Cc: "'w3c-i18n-ig@w3.org'" <w3c-i18n-ig@w3.org>
Message-ID: <F7D4BDA0E5A1D14B99D32C022AEB73660EB394@alis-2k.alis.domain>
This is the reply from the i18n WG on the answers received to our comments
(http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0109.html).

We gathered answers from the following messages:
From Philippe Le Hégaret:
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0163.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0219.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0220.html

From Johnny Stenback:
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0206.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0207.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0208.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0209.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0210.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0212.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0213.html
http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0215.html

We also acknowledge a satisfactory answer to our C2
(http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0143.html), which
was actually a question.

We're still missing answers on our C10 and LS5.


> > DOM 3 Core
> > http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030609/
> > 
> > 
> > C1) Document interface, "actualEncoding" and "xmlEncoding" 
> attributes: This
> > is very much improved since the previous version, but 
> unfortunately still
> > not totally clear.  Since the DOM stores documents in 
> UTF-16 exclusively,
> > these attributes must necessarily refer to the encoding of 
> a serialized
> > document that is parsed to create the DOM tree; since 
> xmlEncoding is not
> > read-only, it can also be set programmatically, but that 
> shouldn't change
> > its semantics.  Semantics is precisely where some dark spots remain.
> > actualEncoding is pretty clearly defined as the actual 
> encoding of the
> > parsed document, supposedly gleaned from the parser. 
> xmlEncoding is then
> > said to be taken from the XML declaration, but Appendix 
> C.1.1 says that
> > xmlEncoding is supposed to come from the infoset's 
> [character encoding
> > scheme] property.  The latter is defined as "The name of 
> the character
> > encoding scheme in which the document entity is expressed", 
> matching the
> > semantics of actualEncoding, not those of an encoding label 
> read from the
> > XML declaration.  So the meaning of xmlEncoding remains 
> pretty murky.
> 
> xmlEncoding is now read-only, and only represents what was 
> found in the
> XML declaration, if any. actualEncoding was the encoding used to load
> the document, again if any. For the Save module, it should be 
> clarified
> that the XML declaration, if generated, will get whatever encoding was
> used for the serialization, and not actualEncoding or xmlEncoding
> necessarily.

Fine.

> > One wonders why there are actually 2 attributes, since 
> there is only one
> > encoding of interest: that of the document that was parsed 
> to create the DOM tree.
> 
> For the completeness of the representation of the original document,
> since some higher protocols could override the encoding 
> specified in the document.

But then the encoding found in the XML decl would be of no interest.  "For
completeness" doesn't seem to meet any reasonable 80/20 criterion of
usefulness, implementation effort would seem to be better spent elsewhere.
Nevertheless, the status quo (as fixed above) is acceptable.


> > C3) Document interface, "renameNode()" method: should specify, like
> > createAttribute() and others, that an INVALID_CHARACTER_ERR 
> exception can be
> > thrown, depending on the "xmlVersion" attribute.
> 
> correct. fixed.

OK, thanks.


> > C4) Node interface, "normalize()" method: this should also 
> perform character
> > normalization, perhaps conditional to the config of the 
> containing Document.
> > This method's business in life is to concatenate Text 
> nodes; concatenation
> > is one of the well-known cases that actually *produces* character
> > denormalization.  It would be silly to have a method called 
> normalize()
> > which actually denormalizes, so any denormalizations caused 
> by concatenation
> > should be repaired as part of the method's normal 
> functioning.  Backward
> > compatibility can probably be addressed by making the 
> repairs conditional on
> > xmlVersion or the config of the containing document or both.
> 
> normalize() is a DOM Level 1 method. The name is unfortunate since it
> collides character normalization but we cannot change its semantics or
> rename it. This explains the introduction of normalizeDocument(),
> instead of reusing normalize() on Document nodes. An other example of
> discrepancy with names is our namespaceURI and the [namespace name]
> Infoset property.

This doesn't seem to address the comment.  Backward compatibility is
certainly an issue, but not necessarily a show-stopper: there are numerous
instances of "Modified in DOM Level 3" in the spec.  We did offer some ideas
for addressing compatibility, they may be dead-on-arrival but we would like
to understand why.


> > Also, it should be specified that this method is sensitive 
> to the value of
> > the "cdata-sections" config parameter.
> 
> Only normalizeDocument is sensitive to the configuration parameters.

Good to know, thanks.


> > C5) CharacterData interface: are the various methods  
> supposed to maintain
> > character normalization?  Under the control of the config 
> of the containing
> > Document? Of "strictErrorChecking"?
> > 
> > The config parameters "check-character-normalization" and
> > "normalize-characters" appear to be pertinent, but neither their
> > descriptions nor the descriptions of the CharacterData.* 
> methods say that
> > they have any effect for these methods. 
>
> As an update, the following text was added on the description of the
> DOMString type:
> [[
> If the normalization happened at load time, or the method
> Document.normalizeDocument() was invoked (in both cases, the parameter
> "normalize-characters" needs to be true), characters are
> fully-normalized according to the rules defined in [CharModel]
> supplemented by the definitions of relevant constructs from 
> Section 2.13
> of [XML 1.1]. Note that, with the exception of
> Document.normalizeDocument(), manipulating characters using 
> DOM methods
> does not guarantee to preserve a fully-normalized text.
> ]]

Good enough.  English is a bit poor perhaps, one wonders what is *the*
normalization, some rewording would be appreciated.


> > C6) DOMLocator interface, "offset" attribute: there should be two
> > attributes, one for byte offset and the other for character 
> offset (or
> > alternatively another attribute that says whether "offset" 
> is byte or
> > character), since the application may not be able to 
> determine if the source
> > was bytes or characters.
> 
> We split the attribute offset into byteOffset and utf16Offset 
> (since the DOM deals with utf16 units).

Fine, thanks.


> > C7) DOMConfiguration interface, "cdata-sections"  
> parameter: this should
> > default to false.  CDATA sections are mere syntactic sugar with no
> > structural role (hint: they do not exist in the infoset), 
> they do not
> > deserve to be preserved by default.
> 
> The parse methods of the LS module don't load CDATA sections 
> by default
> (the "infoset" parameter is true by default, this implies that
> cdata-sections default is false for the parse methods). So unless an
> application adds CDATASection nodes during manipulations, the
> "cdata-sections" parameter won't change anything in the tree. 
> And if the
> application do add CDATASection nodes in the tree, or the parse
> operation was requested to preserve the cdata sections, then 
> they should
> be preserved by default since the application explicitly asked to get
> them.

OK, this makes sense.  Wasn't very clear, though!

 
> > C8) DOMConfiguration interface, 
> "check-character-normalization" parameter:
> > it is not clear *when* this setting has any effect (i.e. 
> what methods of
> > what interfaces it affects).
> 
> [there is a pending action item on the Core editors to 
> clarify that only
> normalizeDocument is affected by the DOMConfiguration parameters]

...unless specified elsewhere, as in Load&Save!


> >  Since Charmodel says that text SHOULD be
> > checked, the default for this should be true, the user 
> having the chance to
> > set it to false after careful consideration of the consequences (see
> > definition of SHOULD in RFC2119).
> 
> The parameter check-character-normalization is optional so the default
> cannot be true. Applications can certainly check if the parameter is
> activated, or can be activated, using the methods defined on the
> DOMConfiguration object.

The optional character of check-character-normalization is the first wrong
(the other being that "true" is not the default).  As argued in our comment,
a DOM user cannot do the right thing (check normalization unless careful
consideration of the consequences...) if the normalization checking
functionality is absent.  The DOM is missing functionality essential for
things as ssimple and basic as string matching.  We object to that.


> > C9) The reference to Unicode 3.0 should be updated to 
> Unicode 4.0, ISBN
> > 0-321-18578-1.
> 
> done. For the record, the Character Model of the Web (August 2003
> version) still links to Unicode Version 3.0. Is it intentional?

:-(  No, oversight.  Fixed now (in our editorial copy).




> > DOM 3 Load&Save
> > http://www.w3.org/TR/2003/WD-DOM-Level-3-LS-20030619/
> >
> > LS1) Interface DOMParser: character normalization checking 
> is now controlled
> > by the "check-character-normalization" parameter of 
> DOMCOnfiguration defined
> > in Core. The fact that the "true" value (do check) is 
> marked as [optional]
> > (not the default, not even required to implement) is not acceptable.
> > Whereas Charmod says that normalization SHOULD be checked, 
> users are not
> > even able to check if the "true" value is not implemented.  
> Furthermore, the
> > DocumentLS.load() and loadXML() methods automatically do 
> the wrong thing and
> > have no way to do the right thing if the default is false.
> 
> Users *are* able to check if the "true" value is implemented or not. 

True, we missed that, sorry.

This doesn't address our issue that making implementation optional is not
acceptable.  See C8 above.

> Using the DOMConfiguration object, a user can call 
> config.canSetParameter("check-character-normalization", 
> true), and that 
> will tell them if the implementation supports character normalization 
> checking. The DOM Level 3 Load and Save (and DOM Level 3 
> Core) specs do 
> not *require* that implementations *must* support character 
> normalization.
> 
> As for DocumentLS, that interface is no longer part of the LS spec.

No objection to that.


> > LS2) Interface DOMParser: There should be an error type 
> defined for failure
> > to check normalization (sugg. 
> "normalization-checking-failure") in addition
> > to the existing "unknown-character-denormalization".
> 
> Fixed. The following has been added to the spec:
> 
> [[
>   "check-character-normalization-failure" [error]
>      Raised if the paramter "check-character-normalization" is set to
>      true and a character is encoutered that is not normalized.
> ]]

Good in principle, but editorially a bit loose.  A character in general
cannot be "not normalized", this is a property of sequences (strings).
Please reword to:

[[
  "check-character-normalization-failure" [error]
     Raised if the paramter "check-character-normalization" is set to
     true and a string is encoutered that fails normalization checking.
]]

or something similar.


> > LS3) In the discussion of interface DOMSerializer (above the IDL
> > definition), it would be nice if character references were 
> specified to be
> > hexadecimal (preferred) or decimal.  One way or the other 
> determined by the
> > spec, not implementation-dependent.  Similarly (still 
> within DOMSerializer),
> > it would be better to specify serialization of attribute 
> values to be always
> > in quotes (or apostrophes, you choose), with escaping as necessary.
> 
> The DOM WG discussed this before, and the WG has always 
> decided against 
> doing this. If you want canonicalized output, set the 
> "canonical-form" 
> parameter, if not, you'll get implementation dependent output.

Hmmmm...  Reluctantly accepted.  Given the apparently zero implementation
burden of choosing one way or the other in the spec, one wonders why the WG
resists this.  Of course, the benefit is not great either, but given the
rather severe under-specification of serializing anything but Documents and
Entities, any amount of predictability would seem desirable...

We would appreciate a at least some text encouraging implementers to use hex
for character references, since that is what all character encoding
standards use.


> > LS6) In DOMSerializer, method writeURI(): there is no way 
> to control the
> > encoding that will be used to output.  The method itself 
> doesn't have a
> > parameter, and the order of priorities is 
> Document.actualEncoding followed
> > by Document.xmlEncoding. Document.actualEncoding being 
> read-only, the user
> > has no way to specify the output encoding, except if by chance
> > Document.actualEncoding is null.  There should be an 
> additional  "encoding"
> > parameter (nullable, to fall back to actualEncoding and 
> xmlEncoding) to the
> > method.
> 
> DOMSerializer.writeURI() is merely a convenience method (and is now 
> defined as such), if you need to pass encoding information 
> when writing 
> to a IRI, use DOMSerializer.write() and set the encoding on 
> the DOMOutput.

OK, as long as the behaviour w/r to encoding is sufficiently specified
(which seems to be the case now).


> > LS7) In DOMSerializer, method writeURI(): the name writeURI 
> is a little
> > unfortunate, it seems to imply that a URI is written, not 
> that it is written *to*.
> 
> Agreed. writeURI() will be renamed to writeToURI().

Good, thanks.


> > LS8) It should be specified that DOMSerializers MUST be 
> able to serialize in
> > UTF-8 and both byte-orders of UTF-16, to close the loop 
> with XML parsers
> > which are obligated to read these.
> 
> The DOM WG decided against requiring support for all of those 
> encodings, 
> but it did decide to require support for one or more of those 
> encodins.

While this is sufficient for strict interoperability, it is not for
compatibility of code.  If there is not at least one required encoding, it
is not possible to write a DOM program that will work over any DOM
implementation.  We insist that at least UTF-8 be required.  Furthermore,
since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3.
Please reconsider.


> > LS9) In DocumentLS.load(), it is said that 'the parameters 
> used in the
> > DOMParser interface are assumed to have their default 
> values with the
> > exception that the parameters "entities", "normalize-characters",
> > "check-character-normalization" are set to "false".', which 
> is strange as
> > the last 2 of these parameters do default to false anyway.
> > "check-character-normalization" should default to true (see 
> other comment).
> 
> After consideing this issue, and the other issues that we've 
> got against 
> DocumentLS and ElementLS, the DOM WG decided to drop these interfaces 
> from the LS specification.

We have no objection to this resolution.


> > LS10) The reference to Unicode 3.0 should be updated to 
> Unicode 4.0, ISBN
> > 0-321-18578-1.
> 
> Done.

Thanks.

-- 
François Yergeau
Received on Tuesday, 23 September 2003 16:17:43 UTC