- From: Johnny Stenback <jst@w3c.jstenback.com>
- Date: Tue, 23 Sep 2003 18:39:14 -0700
- To: Francois Yergeau <FYergeau@alis.com>
- Cc: "'www-dom@w3.org'" <www-dom@w3.org>, "'w3c-i18n-ig@w3.org'" <w3c-i18n-ig@w3.org>
Francois Yergeau wrote: > This is the reply from the i18n WG on the answers received to our comments > (http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0109.html). > > We gathered answers from the following messages: > From Philippe Le Hégaret: > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0163.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0219.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0220.html > > From Johnny Stenback: > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0206.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0207.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0208.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0209.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0210.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0212.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0213.html > http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0215.html > > We also acknowledge a satisfactory answer to our C2 > (http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0143.html), which > was actually a question. > > We're still missing answers on our C10 and LS5. LS5 was dealt with and the spec was updated accordingly last week (or the week before), but due to a cut n' paste error in the issues list, I forgot to reply to the mailing list with the resolution of that issue. Reply now at: http://lists.w3.org/Archives/Public/www-dom/2003JulSep/0236.html Sorry for that oversight. > > > >>>DOM 3 Core >>>http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030609/ >>> >>> >>>C1) Document interface, "actualEncoding" and "xmlEncoding" >> >>attributes: This >> >>>is very much improved since the previous version, but >> >>unfortunately still >> >>>not totally clear. Since the DOM stores documents in >> >>UTF-16 exclusively, >> >>>these attributes must necessarily refer to the encoding of >> >>a serialized >> >>>document that is parsed to create the DOM tree; since >> >>xmlEncoding is not >> >>>read-only, it can also be set programmatically, but that >> >>shouldn't change >> >>>its semantics. Semantics is precisely where some dark spots remain. >>>actualEncoding is pretty clearly defined as the actual >> >>encoding of the >> >>>parsed document, supposedly gleaned from the parser. >> >>xmlEncoding is then >> >>>said to be taken from the XML declaration, but Appendix >> >>C.1.1 says that >> >>>xmlEncoding is supposed to come from the infoset's >> >>[character encoding >> >>>scheme] property. The latter is defined as "The name of >> >>the character >> >>>encoding scheme in which the document entity is expressed", >> >>matching the >> >>>semantics of actualEncoding, not those of an encoding label >> >>read from the >> >>>XML declaration. So the meaning of xmlEncoding remains >> >>pretty murky. >> >>xmlEncoding is now read-only, and only represents what was >>found in the >>XML declaration, if any. actualEncoding was the encoding used to load >>the document, again if any. For the Save module, it should be >>clarified >>that the XML declaration, if generated, will get whatever encoding was >>used for the serialization, and not actualEncoding or xmlEncoding >>necessarily. > > > Fine. > > >>>One wonders why there are actually 2 attributes, since >> >>there is only one >> >>>encoding of interest: that of the document that was parsed >> >>to create the DOM tree. >> >>For the completeness of the representation of the original document, >>since some higher protocols could override the encoding >>specified in the document. > > > But then the encoding found in the XML decl would be of no interest. "For > completeness" doesn't seem to meet any reasonable 80/20 criterion of > usefulness, implementation effort would seem to be better spent elsewhere. > Nevertheless, the status quo (as fixed above) is acceptable. > > > >>>C3) Document interface, "renameNode()" method: should specify, like >>>createAttribute() and others, that an INVALID_CHARACTER_ERR >> >>exception can be >> >>>thrown, depending on the "xmlVersion" attribute. >> >>correct. fixed. > > > OK, thanks. > > > >>>C4) Node interface, "normalize()" method: this should also >> >>perform character >> >>>normalization, perhaps conditional to the config of the >> >>containing Document. >> >>>This method's business in life is to concatenate Text >> >>nodes; concatenation >> >>>is one of the well-known cases that actually *produces* character >>>denormalization. It would be silly to have a method called >> >>normalize() >> >>>which actually denormalizes, so any denormalizations caused >> >>by concatenation >> >>>should be repaired as part of the method's normal >> >>functioning. Backward >> >>>compatibility can probably be addressed by making the >> >>repairs conditional on >> >>>xmlVersion or the config of the containing document or both. >> >>normalize() is a DOM Level 1 method. The name is unfortunate since it >>collides character normalization but we cannot change its semantics or >>rename it. This explains the introduction of normalizeDocument(), >>instead of reusing normalize() on Document nodes. An other example of >>discrepancy with names is our namespaceURI and the [namespace name] >>Infoset property. > > > This doesn't seem to address the comment. Backward compatibility is > certainly an issue, but not necessarily a show-stopper: there are numerous > instances of "Modified in DOM Level 3" in the spec. We did offer some ideas > for addressing compatibility, they may be dead-on-arrival but we would like > to understand why. > > > >>>Also, it should be specified that this method is sensitive >> >>to the value of >> >>>the "cdata-sections" config parameter. >> >>Only normalizeDocument is sensitive to the configuration parameters. > > > Good to know, thanks. > > > >>>C5) CharacterData interface: are the various methods >> >>supposed to maintain >> >>>character normalization? Under the control of the config >> >>of the containing >> >>>Document? Of "strictErrorChecking"? >>> >>>The config parameters "check-character-normalization" and >>>"normalize-characters" appear to be pertinent, but neither their >>>descriptions nor the descriptions of the CharacterData.* >> >>methods say that >> >>>they have any effect for these methods. >> >>As an update, the following text was added on the description of the >>DOMString type: >>[[ >>If the normalization happened at load time, or the method >>Document.normalizeDocument() was invoked (in both cases, the parameter >>"normalize-characters" needs to be true), characters are >>fully-normalized according to the rules defined in [CharModel] >>supplemented by the definitions of relevant constructs from >>Section 2.13 >>of [XML 1.1]. Note that, with the exception of >>Document.normalizeDocument(), manipulating characters using >>DOM methods >>does not guarantee to preserve a fully-normalized text. >>]] > > > Good enough. English is a bit poor perhaps, one wonders what is *the* > normalization, some rewording would be appreciated. > > > >>>C6) DOMLocator interface, "offset" attribute: there should be two >>>attributes, one for byte offset and the other for character >> >>offset (or >> >>>alternatively another attribute that says whether "offset" >> >>is byte or >> >>>character), since the application may not be able to >> >>determine if the source >> >>>was bytes or characters. >> >>We split the attribute offset into byteOffset and utf16Offset >>(since the DOM deals with utf16 units). > > > Fine, thanks. > > > >>>C7) DOMConfiguration interface, "cdata-sections" >> >>parameter: this should >> >>>default to false. CDATA sections are mere syntactic sugar with no >>>structural role (hint: they do not exist in the infoset), >> >>they do not >> >>>deserve to be preserved by default. >> >>The parse methods of the LS module don't load CDATA sections >>by default >>(the "infoset" parameter is true by default, this implies that >>cdata-sections default is false for the parse methods). So unless an >>application adds CDATASection nodes during manipulations, the >>"cdata-sections" parameter won't change anything in the tree. >>And if the >>application do add CDATASection nodes in the tree, or the parse >>operation was requested to preserve the cdata sections, then >>they should >>be preserved by default since the application explicitly asked to get >>them. > > > OK, this makes sense. Wasn't very clear, though! > > > >>>C8) DOMConfiguration interface, >> >>"check-character-normalization" parameter: >> >>>it is not clear *when* this setting has any effect (i.e. >> >>what methods of >> >>>what interfaces it affects). >> >>[there is a pending action item on the Core editors to >>clarify that only >>normalizeDocument is affected by the DOMConfiguration parameters] > > > ...unless specified elsewhere, as in Load&Save! > > > >>> Since Charmodel says that text SHOULD be >>>checked, the default for this should be true, the user >> >>having the chance to >> >>>set it to false after careful consideration of the consequences (see >>>definition of SHOULD in RFC2119). >> >>The parameter check-character-normalization is optional so the default >>cannot be true. Applications can certainly check if the parameter is >>activated, or can be activated, using the methods defined on the >>DOMConfiguration object. > > > The optional character of check-character-normalization is the first wrong > (the other being that "true" is not the default). As argued in our comment, > a DOM user cannot do the right thing (check normalization unless careful > consideration of the consequences...) if the normalization checking > functionality is absent. The DOM is missing functionality essential for > things as ssimple and basic as string matching. We object to that. > > > >>>C9) The reference to Unicode 3.0 should be updated to >> >>Unicode 4.0, ISBN >> >>>0-321-18578-1. >> >>done. For the record, the Character Model of the Web (August 2003 >>version) still links to Unicode Version 3.0. Is it intentional? > > > :-( No, oversight. Fixed now (in our editorial copy). > > > > > >>>DOM 3 Load&Save >>>http://www.w3.org/TR/2003/WD-DOM-Level-3-LS-20030619/ >>> >>>LS1) Interface DOMParser: character normalization checking >> >>is now controlled >> >>>by the "check-character-normalization" parameter of >> >>DOMCOnfiguration defined >> >>>in Core. The fact that the "true" value (do check) is >> >>marked as [optional] >> >>>(not the default, not even required to implement) is not acceptable. >>>Whereas Charmod says that normalization SHOULD be checked, >> >>users are not >> >>>even able to check if the "true" value is not implemented. >> >>Furthermore, the >> >>>DocumentLS.load() and loadXML() methods automatically do >> >>the wrong thing and >> >>>have no way to do the right thing if the default is false. >> >>Users *are* able to check if the "true" value is implemented or not. > > > True, we missed that, sorry. > > This doesn't address our issue that making implementation optional is not > acceptable. See C8 above. > > >>Using the DOMConfiguration object, a user can call >>config.canSetParameter("check-character-normalization", >>true), and that >>will tell them if the implementation supports character normalization >>checking. The DOM Level 3 Load and Save (and DOM Level 3 >>Core) specs do >>not *require* that implementations *must* support character >>normalization. >> >>As for DocumentLS, that interface is no longer part of the LS spec. > > > No objection to that. > > > >>>LS2) Interface DOMParser: There should be an error type >> >>defined for failure >> >>>to check normalization (sugg. >> >>"normalization-checking-failure") in addition >> >>>to the existing "unknown-character-denormalization". >> >>Fixed. The following has been added to the spec: >> >>[[ >> "check-character-normalization-failure" [error] >> Raised if the paramter "check-character-normalization" is set to >> true and a character is encoutered that is not normalized. >>]] > > > Good in principle, but editorially a bit loose. A character in general > cannot be "not normalized", this is a property of sequences (strings). > Please reword to: > > [[ > "check-character-normalization-failure" [error] > Raised if the paramter "check-character-normalization" is set to > true and a string is encoutered that fails normalization checking. > ]] > > or something similar. > > > >>>LS3) In the discussion of interface DOMSerializer (above the IDL >>>definition), it would be nice if character references were >> >>specified to be >> >>>hexadecimal (preferred) or decimal. One way or the other >> >>determined by the >> >>>spec, not implementation-dependent. Similarly (still >> >>within DOMSerializer), >> >>>it would be better to specify serialization of attribute >> >>values to be always >> >>>in quotes (or apostrophes, you choose), with escaping as necessary. >> >>The DOM WG discussed this before, and the WG has always >>decided against >>doing this. If you want canonicalized output, set the >>"canonical-form" >>parameter, if not, you'll get implementation dependent output. > > > Hmmmm... Reluctantly accepted. Given the apparently zero implementation > burden of choosing one way or the other in the spec, one wonders why the WG > resists this. Of course, the benefit is not great either, but given the > rather severe under-specification of serializing anything but Documents and > Entities, any amount of predictability would seem desirable... > > We would appreciate a at least some text encouraging implementers to use hex > for character references, since that is what all character encoding > standards use. > > > >>>LS6) In DOMSerializer, method writeURI(): there is no way >> >>to control the >> >>>encoding that will be used to output. The method itself >> >>doesn't have a >> >>>parameter, and the order of priorities is >> >>Document.actualEncoding followed >> >>>by Document.xmlEncoding. Document.actualEncoding being >> >>read-only, the user >> >>>has no way to specify the output encoding, except if by chance >>>Document.actualEncoding is null. There should be an >> >>additional "encoding" >> >>>parameter (nullable, to fall back to actualEncoding and >> >>xmlEncoding) to the >> >>>method. >> >>DOMSerializer.writeURI() is merely a convenience method (and is now >>defined as such), if you need to pass encoding information >>when writing >>to a IRI, use DOMSerializer.write() and set the encoding on >>the DOMOutput. > > > OK, as long as the behaviour w/r to encoding is sufficiently specified > (which seems to be the case now). > > > >>>LS7) In DOMSerializer, method writeURI(): the name writeURI >> >>is a little >> >>>unfortunate, it seems to imply that a URI is written, not >> >>that it is written *to*. >> >>Agreed. writeURI() will be renamed to writeToURI(). > > > Good, thanks. > > > >>>LS8) It should be specified that DOMSerializers MUST be >> >>able to serialize in >> >>>UTF-8 and both byte-orders of UTF-16, to close the loop >> >>with XML parsers >> >>>which are obligated to read these. >> >>The DOM WG decided against requiring support for all of those >>encodings, >>but it did decide to require support for one or more of those >>encodins. > > > While this is sufficient for strict interoperability, it is not for > compatibility of code. If there is not at least one required encoding, it > is not possible to write a DOM program that will work over any DOM > implementation. We insist that at least UTF-8 be required. Furthermore, > since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3. > Please reconsider. > > > >>>LS9) In DocumentLS.load(), it is said that 'the parameters >> >>used in the >> >>>DOMParser interface are assumed to have their default >> >>values with the >> >>>exception that the parameters "entities", "normalize-characters", >>>"check-character-normalization" are set to "false".', which >> >>is strange as >> >>>the last 2 of these parameters do default to false anyway. >>>"check-character-normalization" should default to true (see >> >>other comment). >> >>After consideing this issue, and the other issues that we've >>got against >>DocumentLS and ElementLS, the DOM WG decided to drop these interfaces >>from the LS specification. > > > We have no objection to this resolution. > > > >>>LS10) The reference to Unicode 3.0 should be updated to >> >>Unicode 4.0, ISBN >> >>>0-321-18578-1. >> >>Done. > > > Thanks. > -- jst
Received on Tuesday, 23 September 2003 21:39:54 UTC