- From: Mark Scardina <mark.scardina@oracle.com>
- Date: Fri, 28 Jun 2002 12:54:58 -0700
- To: <www-i18n-comments@w3.org>
- Cc: <w3c-xsl-wg@w3.org>
Below are XSL WG compiled comments/issues on the I18N Character Model Working Draft located at http://www.w3.org/TR/2002/WD-charmod-20020430. 1) Section 2 Conformance "[S] Every W3C specification MUST conform to the requirements applicable to specifications, specify that implementations MUST conform to the requirements applicable to software, and specify that content created according to that specification MUST conform to the requirements applicable to content. [S] If an existing W3C specification does not conform to the requirements in this document, then the next version of that specification SHOULD be modified in order to conform." [XSL]Why is it that every spec MUST but subsequent specs only SHOULD meet this requirement? Is the intent here to permit current non-conforming specs to maintain backwards compatibiltiy in future releases? This is an XSL requirement. 2) 3.1.3 Units of visual rendering "[S] Protocols, data formats and APIs MUST store, interchange or process text data in logical order." [XSL]This appears to be higher level of conformance than necessary for interoperability. Why do internals need to be dictated as long as there is exernal conformance? 3) 3.1.5 Units of Collation "Note that, where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' should be determined to be that of the current user, and may thus differ from user to user." [XSL]Suggest the "should" become "SHOULD" to bring it to the level of a recommendation. 4) 3.1.7 Summary "[S] When specifications use the term 'character ' it MUST be clear which of the possible meanings they intend. [S] Specifications SHOULD avoid the use of the term 'character' if a more specific term is available." [XSL]In 3.1.7 it is stated that specifications must make it clear "which of the possible meanings" of the word "character" is intended. But it's not explicit what the "possible meanings" are. Where do we read that one of the possible meanings is "a Unicode code point"? [XSL]There should be examples of this as the spec itself is an offender here. While it could be explicitly stated in line with their own requirement, We read it that the possible meanings were "Units of aural rendering", "Units of visual rendering", "Units of input", "Units of collation" and "Units of storage". [XSL] It is very difficult to conform to the second requirement in 3.1.7, as is illustrated by the fact that the Character Model document itself fails to conform to it: see the immediately following section heading. 5) 3.2 Digital Encoding of Characters "A CES, together with the CCSes it is used with, is identified by an IANA charset identifier. Given a sequence of bytes representing text and a charset identifier, one can in principle unambiguously recover the sequence of characters of the text." [XSL]There are other legal identifiers other than IANA. We should not be restricted to these. 6) 3.5 Reference Processing Model " [S] Specifications MAY allow use of any character encoding which can be transcoded to Unicode for its text entities. [S] Specifications MAY choose to disallow or deprecate some encodings and to make others mandatory. Independent of the actual encoding, the specified behavior MUST be the same as if the processing happened as follows: The encoding of any text entity received by the application implementing the specification MUST be determined and the text entity MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the entity to some Unicode encoding form , adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form. All processing MUST take place on this sequence of Unicode characters. If text is output by the application, the sequence of Unicode characters MUST be encoded using an encoding chosen among those allowed by the specification. [S] If a specification is such that multiple text entities are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these entities to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all entities." [XSL] It may be less confusing to have these requirements separated with a clarifying sentence, breaking these out under a clarifying context. Is this intent to forbid entity representation of non-Unicode characters? 7) 3.7 Character Escaping "Certain guidelines apply to content developers, as well as to software that generates content: ... [I] [C] Choose an encoding for the document that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes. In general, if the first encoding choice is not satisfactory, Unicode is the next best choice, for its large character repertoire and its wide base of support." [XSL]The last bullet immediately before the section heading of section 4 seems strange. Grammatically, it is hard to parse, and is in the imperative mood which is not used elsewhere. Semantically, the statement that "If the first encoding choice is not satisfactory, Unicode is the next best choice" seems very odd. Surely (a) Unicode is always the first choice, and (b) Unicode is not an encoding? Also the term "satisfactory" is far too vague for a specification. We also question the appropriateness of these "guidelines" in the spec body. They seem more appropriate for a note or appendix. [XSL]We have a concern about the guideline preventing new character escaping syntax. 8) 4.4 Responsibility for Normalization "[C] In order to conform to this specification, all text content on the Web MUST be in include-normalized form and SHOULD be in fully-normalized form." [XSL]The impacts of this requirement on XSLT and other infoset "pipeline" type processes are still unclear to us. [XSL]For instance, XSLT and many other specifications are designed around an infoset "pipeline" so that various processes can transform, augment, or otherwise manipulate content. A final step in a pipeline often involves serialization of the infoset. It appears to us that serialization of an infoset according to the Character Model may result in either significant manipulation of the data within that infoset (resulting in a loss of data fidelity) or failure to serialize. In either case, an upstream process such as an XSLT transformation cannot trust that its output can be successfully processed further on in the pipeline, without adopting normalization rules at the infoset level as well. The practical inability to limit normalization to text content on the Web concerns us. The implications of this are not adequately discussed in the Character Model spec. [XSL]Without a clear idea of the implications of the Character Model upon the tendency to rely on the XML Information Set instead of upon text for composing processes within a system, we cannot agree to the mandate for normalization. 9) "[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text . Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed." [XSL] The exception for private agreements is crippled by the observable results restriction thus when all is said and done any suspect text will always remain. [XSL] Section 4.4 appears to require that XML be changed to disallow the use of a composing character as the first character in an entity. This change would be backwards incompatible. XSL WG specifications such as XSLT and XPath must continue to work with all XML well-formed documents. [XSL] Since the contents of an XML text node are "suspect text" (there is nothing to prevent use of a composing character as the first character in a text node), section 4.4 appears to be saying that XPath must disallow operations such as substring() unless the text is inspected and found to be normalized. We do not believe that users want to pay the high cost of this feature. 10) "[I] A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text." [XSL] The fourth requirement in section 4.4 is labelled [I], but XPath implementations have to do what the XPath specification says, so this is actually an [S] requirement. The implication of this requirement is that functions such as concat() should perform normalization. This is both expensive and backwards-incompatible, we will have to examine whether it is something where the benefits exceed the costs. This also seems to violate the self-imposed limitation to only require conformance to observable behaviors. How XPaths are handled within an XSLT Processor should not be the subject of this spec as long as the results are conformant. 11) "[S] Specifications of text-based languages and protocols SHOULD define precisely the construct boundaries necessary to obtain a complete definition of full-normalization . These definitions MUST include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) and SHOULD include any other boundary that may create denormalization when instances of the language are processed." [XSL] The requirement (still in 4.4) about defining construct boundaries is very unclear when applied to a language that performs dynamic manipulation of strings. 12) "[S] Specifications MUST document any security issues related to normalization." [XSL] The requirement "Specifications MUST document any security issues related to normalization." is untestable on its face and should be detailed. 13) 6. String Identity Matching "[S] [I] Forms of string matching other than identity matching SHOULD be performed as if the following steps were followed: Steps 1 to 3 for string identity matching . Matching the strings in a way that is appropriate to the application." [XSL] It is unclear whether the procedure for string identity matching in section 6 establishes a requirement for expansion of %HH escapes in URIs, especially when comparing namespace URIs, where such expansion has not traditionally been performed. Section 8 should give guidance on this. Mark ________________________________________________________________ Mark V. Scardina Group Product Mgr & XML Evangelist CORE & XML DEVELOPMENT GROUP E-mail: Mark.Scardina@oracle.com Web Site: http://otn.oracle.com/tech/xml/
Received on Friday, 28 June 2002 15:56:43 UTC