Re: i18n reviews of DOM 3 Core and Load&Save

Francois Yergeau wrote:
> This is the reply from the i18n WG on the answers received to our comments
> (http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0109.html).
> 
> We gathered answers from the following messages:
> From Philippe Le Hégaret:
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0163.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0219.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0220.html
> 
> From Johnny Stenback:
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0206.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0207.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0208.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0209.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0210.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0212.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0213.html
> http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0215.html
> 
> We also acknowledge a satisfactory answer to our C2
> (http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0143.html), which
> was actually a question.
> 
> We're still missing answers on our C10 and LS5.

LS5 was dealt with and the spec was updated accordingly last week (or 
the week before), but due to a cut n' paste error in the issues list, I 
forgot to reply to the mailing list with the resolution of that issue.

Reply now at:

   http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0236.html

Sorry for that oversight.

> 
> 
> 
>>>DOM 3 Core
>>>http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030609/
>>>
>>>
>>>C1) Document interface, "actualEncoding" and "xmlEncoding" 
>>
>>attributes: This
>>
>>>is very much improved since the previous version, but 
>>
>>unfortunately still
>>
>>>not totally clear.  Since the DOM stores documents in 
>>
>>UTF-16 exclusively,
>>
>>>these attributes must necessarily refer to the encoding of 
>>
>>a serialized
>>
>>>document that is parsed to create the DOM tree; since 
>>
>>xmlEncoding is not
>>
>>>read-only, it can also be set programmatically, but that 
>>
>>shouldn't change
>>
>>>its semantics.  Semantics is precisely where some dark spots remain.
>>>actualEncoding is pretty clearly defined as the actual 
>>
>>encoding of the
>>
>>>parsed document, supposedly gleaned from the parser. 
>>
>>xmlEncoding is then
>>
>>>said to be taken from the XML declaration, but Appendix 
>>
>>C.1.1 says that
>>
>>>xmlEncoding is supposed to come from the infoset's 
>>
>>[character encoding
>>
>>>scheme] property.  The latter is defined as "The name of 
>>
>>the character
>>
>>>encoding scheme in which the document entity is expressed", 
>>
>>matching the
>>
>>>semantics of actualEncoding, not those of an encoding label 
>>
>>read from the
>>
>>>XML declaration.  So the meaning of xmlEncoding remains 
>>
>>pretty murky.
>>
>>xmlEncoding is now read-only, and only represents what was 
>>found in the
>>XML declaration, if any. actualEncoding was the encoding used to load
>>the document, again if any. For the Save module, it should be 
>>clarified
>>that the XML declaration, if generated, will get whatever encoding was
>>used for the serialization, and not actualEncoding or xmlEncoding
>>necessarily.
> 
> 
> Fine.
> 
> 
>>>One wonders why there are actually 2 attributes, since 
>>
>>there is only one
>>
>>>encoding of interest: that of the document that was parsed 
>>
>>to create the DOM tree.
>>
>>For the completeness of the representation of the original document,
>>since some higher protocols could override the encoding 
>>specified in the document.
> 
> 
> But then the encoding found in the XML decl would be of no interest.  "For
> completeness" doesn't seem to meet any reasonable 80/20 criterion of
> usefulness, implementation effort would seem to be better spent elsewhere.
> Nevertheless, the status quo (as fixed above) is acceptable.
> 
> 
> 
>>>C3) Document interface, "renameNode()" method: should specify, like
>>>createAttribute() and others, that an INVALID_CHARACTER_ERR 
>>
>>exception can be
>>
>>>thrown, depending on the "xmlVersion" attribute.
>>
>>correct. fixed.
> 
> 
> OK, thanks.
> 
> 
> 
>>>C4) Node interface, "normalize()" method: this should also 
>>
>>perform character
>>
>>>normalization, perhaps conditional to the config of the 
>>
>>containing Document.
>>
>>>This method's business in life is to concatenate Text 
>>
>>nodes; concatenation
>>
>>>is one of the well-known cases that actually *produces* character
>>>denormalization.  It would be silly to have a method called 
>>
>>normalize()
>>
>>>which actually denormalizes, so any denormalizations caused 
>>
>>by concatenation
>>
>>>should be repaired as part of the method's normal 
>>
>>functioning.  Backward
>>
>>>compatibility can probably be addressed by making the 
>>
>>repairs conditional on
>>
>>>xmlVersion or the config of the containing document or both.
>>
>>normalize() is a DOM Level 1 method. The name is unfortunate since it
>>collides character normalization but we cannot change its semantics or
>>rename it. This explains the introduction of normalizeDocument(),
>>instead of reusing normalize() on Document nodes. An other example of
>>discrepancy with names is our namespaceURI and the [namespace name]
>>Infoset property.
> 
> 
> This doesn't seem to address the comment.  Backward compatibility is
> certainly an issue, but not necessarily a show-stopper: there are numerous
> instances of "Modified in DOM Level 3" in the spec.  We did offer some ideas
> for addressing compatibility, they may be dead-on-arrival but we would like
> to understand why.
> 
> 
> 
>>>Also, it should be specified that this method is sensitive 
>>
>>to the value of
>>
>>>the "cdata-sections" config parameter.
>>
>>Only normalizeDocument is sensitive to the configuration parameters.
> 
> 
> Good to know, thanks.
> 
> 
> 
>>>C5) CharacterData interface: are the various methods  
>>
>>supposed to maintain
>>
>>>character normalization?  Under the control of the config 
>>
>>of the containing
>>
>>>Document? Of "strictErrorChecking"?
>>>
>>>The config parameters "check-character-normalization" and
>>>"normalize-characters" appear to be pertinent, but neither their
>>>descriptions nor the descriptions of the CharacterData.* 
>>
>>methods say that
>>
>>>they have any effect for these methods. 
>>
>>As an update, the following text was added on the description of the
>>DOMString type:
>>[[
>>If the normalization happened at load time, or the method
>>Document.normalizeDocument() was invoked (in both cases, the parameter
>>"normalize-characters" needs to be true), characters are
>>fully-normalized according to the rules defined in [CharModel]
>>supplemented by the definitions of relevant constructs from 
>>Section 2.13
>>of [XML 1.1]. Note that, with the exception of
>>Document.normalizeDocument(), manipulating characters using 
>>DOM methods
>>does not guarantee to preserve a fully-normalized text.
>>]]
> 
> 
> Good enough.  English is a bit poor perhaps, one wonders what is *the*
> normalization, some rewording would be appreciated.
> 
> 
> 
>>>C6) DOMLocator interface, "offset" attribute: there should be two
>>>attributes, one for byte offset and the other for character 
>>
>>offset (or
>>
>>>alternatively another attribute that says whether "offset" 
>>
>>is byte or
>>
>>>character), since the application may not be able to 
>>
>>determine if the source
>>
>>>was bytes or characters.
>>
>>We split the attribute offset into byteOffset and utf16Offset 
>>(since the DOM deals with utf16 units).
> 
> 
> Fine, thanks.
> 
> 
> 
>>>C7) DOMConfiguration interface, "cdata-sections"  
>>
>>parameter: this should
>>
>>>default to false.  CDATA sections are mere syntactic sugar with no
>>>structural role (hint: they do not exist in the infoset), 
>>
>>they do not
>>
>>>deserve to be preserved by default.
>>
>>The parse methods of the LS module don't load CDATA sections 
>>by default
>>(the "infoset" parameter is true by default, this implies that
>>cdata-sections default is false for the parse methods). So unless an
>>application adds CDATASection nodes during manipulations, the
>>"cdata-sections" parameter won't change anything in the tree. 
>>And if the
>>application do add CDATASection nodes in the tree, or the parse
>>operation was requested to preserve the cdata sections, then 
>>they should
>>be preserved by default since the application explicitly asked to get
>>them.
> 
> 
> OK, this makes sense.  Wasn't very clear, though!
> 
>  
> 
>>>C8) DOMConfiguration interface, 
>>
>>"check-character-normalization" parameter:
>>
>>>it is not clear *when* this setting has any effect (i.e. 
>>
>>what methods of
>>
>>>what interfaces it affects).
>>
>>[there is a pending action item on the Core editors to 
>>clarify that only
>>normalizeDocument is affected by the DOMConfiguration parameters]
> 
> 
> ...unless specified elsewhere, as in Load&Save!
> 
> 
> 
>>> Since Charmodel says that text SHOULD be
>>>checked, the default for this should be true, the user 
>>
>>having the chance to
>>
>>>set it to false after careful consideration of the consequences (see
>>>definition of SHOULD in RFC2119).
>>
>>The parameter check-character-normalization is optional so the default
>>cannot be true. Applications can certainly check if the parameter is
>>activated, or can be activated, using the methods defined on the
>>DOMConfiguration object.
> 
> 
> The optional character of check-character-normalization is the first wrong
> (the other being that "true" is not the default).  As argued in our comment,
> a DOM user cannot do the right thing (check normalization unless careful
> consideration of the consequences...) if the normalization checking
> functionality is absent.  The DOM is missing functionality essential for
> things as ssimple and basic as string matching.  We object to that.
> 
> 
> 
>>>C9) The reference to Unicode 3.0 should be updated to 
>>
>>Unicode 4.0, ISBN
>>
>>>0-321-18578-1.
>>
>>done. For the record, the Character Model of the Web (August 2003
>>version) still links to Unicode Version 3.0. Is it intentional?
> 
> 
> :-(  No, oversight.  Fixed now (in our editorial copy).
> 
> 
> 
> 
> 
>>>DOM 3 Load&Save
>>>http://www.w3.org/TR/2003/WD-DOM-Level-3-LS-20030619/
>>>
>>>LS1) Interface DOMParser: character normalization checking 
>>
>>is now controlled
>>
>>>by the "check-character-normalization" parameter of 
>>
>>DOMCOnfiguration defined
>>
>>>in Core. The fact that the "true" value (do check) is 
>>
>>marked as [optional]
>>
>>>(not the default, not even required to implement) is not acceptable.
>>>Whereas Charmod says that normalization SHOULD be checked, 
>>
>>users are not
>>
>>>even able to check if the "true" value is not implemented.  
>>
>>Furthermore, the
>>
>>>DocumentLS.load() and loadXML() methods automatically do 
>>
>>the wrong thing and
>>
>>>have no way to do the right thing if the default is false.
>>
>>Users *are* able to check if the "true" value is implemented or not. 
> 
> 
> True, we missed that, sorry.
> 
> This doesn't address our issue that making implementation optional is not
> acceptable.  See C8 above.
> 
> 
>>Using the DOMConfiguration object, a user can call 
>>config.canSetParameter("check-character-normalization", 
>>true), and that 
>>will tell them if the implementation supports character normalization 
>>checking. The DOM Level 3 Load and Save (and DOM Level 3 
>>Core) specs do 
>>not *require* that implementations *must* support character 
>>normalization.
>>
>>As for DocumentLS, that interface is no longer part of the LS spec.
> 
> 
> No objection to that.
> 
> 
> 
>>>LS2) Interface DOMParser: There should be an error type 
>>
>>defined for failure
>>
>>>to check normalization (sugg. 
>>
>>"normalization-checking-failure") in addition
>>
>>>to the existing "unknown-character-denormalization".
>>
>>Fixed. The following has been added to the spec:
>>
>>[[
>>  "check-character-normalization-failure" [error]
>>     Raised if the paramter "check-character-normalization" is set to
>>     true and a character is encoutered that is not normalized.
>>]]
> 
> 
> Good in principle, but editorially a bit loose.  A character in general
> cannot be "not normalized", this is a property of sequences (strings).
> Please reword to:
> 
> [[
>   "check-character-normalization-failure" [error]
>      Raised if the paramter "check-character-normalization" is set to
>      true and a string is encoutered that fails normalization checking.
> ]]
> 
> or something similar.
> 
> 
> 
>>>LS3) In the discussion of interface DOMSerializer (above the IDL
>>>definition), it would be nice if character references were 
>>
>>specified to be
>>
>>>hexadecimal (preferred) or decimal.  One way or the other 
>>
>>determined by the
>>
>>>spec, not implementation-dependent.  Similarly (still 
>>
>>within DOMSerializer),
>>
>>>it would be better to specify serialization of attribute 
>>
>>values to be always
>>
>>>in quotes (or apostrophes, you choose), with escaping as necessary.
>>
>>The DOM WG discussed this before, and the WG has always 
>>decided against 
>>doing this. If you want canonicalized output, set the 
>>"canonical-form" 
>>parameter, if not, you'll get implementation dependent output.
> 
> 
> Hmmmm...  Reluctantly accepted.  Given the apparently zero implementation
> burden of choosing one way or the other in the spec, one wonders why the WG
> resists this.  Of course, the benefit is not great either, but given the
> rather severe under-specification of serializing anything but Documents and
> Entities, any amount of predictability would seem desirable...
> 
> We would appreciate a at least some text encouraging implementers to use hex
> for character references, since that is what all character encoding
> standards use.
> 
> 
> 
>>>LS6) In DOMSerializer, method writeURI(): there is no way 
>>
>>to control the
>>
>>>encoding that will be used to output.  The method itself 
>>
>>doesn't have a
>>
>>>parameter, and the order of priorities is 
>>
>>Document.actualEncoding followed
>>
>>>by Document.xmlEncoding. Document.actualEncoding being 
>>
>>read-only, the user
>>
>>>has no way to specify the output encoding, except if by chance
>>>Document.actualEncoding is null.  There should be an 
>>
>>additional  "encoding"
>>
>>>parameter (nullable, to fall back to actualEncoding and 
>>
>>xmlEncoding) to the
>>
>>>method.
>>
>>DOMSerializer.writeURI() is merely a convenience method (and is now 
>>defined as such), if you need to pass encoding information 
>>when writing 
>>to a IRI, use DOMSerializer.write() and set the encoding on 
>>the DOMOutput.
> 
> 
> OK, as long as the behaviour w/r to encoding is sufficiently specified
> (which seems to be the case now).
> 
> 
> 
>>>LS7) In DOMSerializer, method writeURI(): the name writeURI 
>>
>>is a little
>>
>>>unfortunate, it seems to imply that a URI is written, not 
>>
>>that it is written *to*.
>>
>>Agreed. writeURI() will be renamed to writeToURI().
> 
> 
> Good, thanks.
> 
> 
> 
>>>LS8) It should be specified that DOMSerializers MUST be 
>>
>>able to serialize in
>>
>>>UTF-8 and both byte-orders of UTF-16, to close the loop 
>>
>>with XML parsers
>>
>>>which are obligated to read these.
>>
>>The DOM WG decided against requiring support for all of those 
>>encodings, 
>>but it did decide to require support for one or more of those 
>>encodins.
> 
> 
> While this is sufficient for strict interoperability, it is not for
> compatibility of code.  If there is not at least one required encoding, it
> is not possible to write a DOM program that will work over any DOM
> implementation.  We insist that at least UTF-8 be required.  Furthermore,
> since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3.
> Please reconsider.
> 
> 
> 
>>>LS9) In DocumentLS.load(), it is said that 'the parameters 
>>
>>used in the
>>
>>>DOMParser interface are assumed to have their default 
>>
>>values with the
>>
>>>exception that the parameters "entities", "normalize-characters",
>>>"check-character-normalization" are set to "false".', which 
>>
>>is strange as
>>
>>>the last 2 of these parameters do default to false anyway.
>>>"check-character-normalization" should default to true (see 
>>
>>other comment).
>>
>>After consideing this issue, and the other issues that we've 
>>got against 
>>DocumentLS and ElementLS, the DOM WG decided to drop these interfaces 
>>from the LS specification.
> 
> 
> We have no objection to this resolution.
> 
> 
> 
>>>LS10) The reference to Unicode 3.0 should be updated to 
>>
>>Unicode 4.0, ISBN
>>
>>>0-321-18578-1.
>>
>>Done.
> 
> 
> Thanks.
> 


-- 
jst

Received on Tuesday, 23 September 2003 21:39:54 UTC