XSL WG Comments on Chairacter Model WD from Mark Scardina on 2002-06-28 (www-i18n-comments@w3.org from June 2002)

From: Mark Scardina <mark.scardina@oracle.com>
Date: Fri, 28 Jun 2002 12:54:58 -0700
To: <www-i18n-comments@w3.org>
Cc: <w3c-xsl-wg@w3.org>
Message-ID: <004f01c21edd$b008d610$3ab62382@us.oracle.com>
Below are XSL WG compiled comments/issues on the I18N Character Model
Working Draft located at http://www.w3.org/TR/2002/WD-charmod-20020430.

1) Section 2 Conformance
"[S] Every W3C specification MUST conform to the requirements applicable
to specifications, specify that implementations MUST conform to the
requirements applicable to software, and specify that content created
according to that specification MUST conform to the requirements
applicable to content. [S] If an existing W3C specification does not
conform to the requirements in this document, then the next version of
that specification SHOULD be modified in order to conform."
 
[XSL]Why is it that every spec MUST but subsequent specs only SHOULD
meet this requirement? Is the intent here to permit current
non-conforming specs to maintain backwards compatibiltiy in future
releases? This is an XSL requirement.
 

2) 3.1.3 Units of visual rendering
"[S] Protocols, data formats and APIs MUST store, interchange or process
text data in logical order."
 
[XSL]This appears to be higher level of conformance than necessary for
interoperability.  Why do internals need to be dictated as long as there
is exernal conformance? 
 

3) 3.1.5 Units of Collation
"Note that, where searching or sorting is done dynamically, particularly
in a multilingual environment, the 'relevant language' should be
determined to be that of the current user, and may thus differ from user
to user."
 
[XSL]Suggest the "should" become "SHOULD" to bring it to the level of a
recommendation.
 

4) 3.1.7 Summary
"[S] When specifications use the term 'character ' it MUST be clear
which of the possible meanings they intend. 
[S] Specifications SHOULD avoid the use of the term 'character' if a
more specific term is available."
 
[XSL]In 3.1.7 it is stated that specifications must make it clear "which
of the possible meanings" of the word "character" is intended. But it's
not explicit what the "possible meanings" are. Where do we read that one
of the possible meanings is "a Unicode code point"? 
 
[XSL]There should be examples of this as the spec itself is an offender
here. While it could be explicitly stated in line with their own
requirement, We read it that the possible meanings were "Units of aural
rendering", "Units of visual rendering", "Units of input", "Units of
collation" and "Units of storage".
 
[XSL] It is very difficult to conform to the second requirement in
3.1.7, as is illustrated by the fact that the Character Model document
itself fails to conform to it: see the immediately following section
heading.
 

5) 3.2 Digital Encoding of Characters
"A CES, together with the CCSes it is used with, is identified by an
IANA charset identifier. Given a sequence of bytes representing text and
a charset identifier, one can in principle unambiguously recover the
sequence of characters of the text."
 
[XSL]There are other legal identifiers other than IANA.  We should not
be restricted to these.
 

6) 3.5 Reference Processing Model
" [S] Specifications MAY allow use of any character encoding which can
be transcoded to Unicode for its text entities.
  [S] Specifications MAY choose to disallow or deprecate some encodings
and to make others mandatory. Independent of the actual encoding, the
specified behavior MUST be the same as if the processing happened as
follows: 
The encoding of any text entity received by the application implementing
the specification MUST be determined and the text entity MUST be
interpreted as a sequence of Unicode characters - this MUST be
equivalent to transcoding the entity to some Unicode encoding form ,
adjusting any character encoding label if necessary, and receiving it in
that Unicode encoding form. All processing MUST take place on this
sequence of Unicode characters. If text is output by the application,
the sequence of Unicode characters MUST be encoded using an encoding
chosen among those allowed by the specification. [S] If a specification
is such that multiple text entities are involved (such as an XML
document referring to external parsed entities), it MAY choose to allow
these entities to be in different character encodings. In all cases, the
Reference Processing Model MUST be applied to all entities."
 
[XSL] It may be less confusing to have these requirements separated with
a clarifying sentence, breaking these out under a clarifying context.
Is this intent to forbid entity representation of non-Unicode
characters?  
 

7) 3.7 Character Escaping
"Certain guidelines apply to content developers, as well as to software
that generates content: ... [I] [C] Choose an encoding for the document
that maximizes the opportunity to directly represent characters and
minimizes the need to represent characters by markup means such as
character escapes. In general, if the first encoding choice is not
satisfactory, Unicode is the next best choice, for its large character
repertoire and its wide base of support."
 
[XSL]The last bullet immediately before the section heading of section 4
seems strange. Grammatically, it is hard to parse, and is in the
imperative mood which is not used elsewhere. Semantically, the statement
that "If the first encoding choice is not satisfactory, Unicode is the
next best choice" seems very odd. Surely (a) Unicode is always the first
choice, and (b) Unicode is not an encoding? Also the term "satisfactory"
is far too vague for a specification.  We also question the
appropriateness of these "guidelines" in the spec body. They seem more
appropriate for a note or appendix.

[XSL]We have a concern about the guideline preventing new character
escaping syntax. 

8) 4.4 Responsibility for Normalization
"[C] In order to conform to this specification, all text content on the
Web MUST be in include-normalized form and SHOULD be in fully-normalized
form."

[XSL]The impacts of this requirement on XSLT and other infoset
"pipeline" type processes are still unclear to us.

[XSL]For instance, XSLT and many other specifications are designed
around an infoset "pipeline" so that various processes can transform,
augment, or otherwise manipulate content.  A final step in a pipeline
often involves serialization of the infoset.  It appears to us that
serialization of an infoset according to the Character Model may result
in either significant manipulation of the data within that infoset
(resulting in a loss of data fidelity) or failure to serialize.  In
either case, an upstream process such as an XSLT transformation cannot
trust that its output can be successfully processed further on in the
pipeline, without adopting normalization rules at the infoset level as
well.  The practical inability to limit normalization to text content on
the Web concerns us.  The implications of this are not adequately
discussed in the Character Model spec.

[XSL]Without a clear idea of the implications of the Character Model
upon the tendency to rely on the XML Information Set instead of upon
text for composing processes within a system, we cannot agree to the
mandate for normalization.

9) "[S] [I] A text-processing component that receives suspect text MUST
NOT perform any normalization-sensitive operations unless it has first
confirmed through inspection that the text is in normalized form, and
MUST NOT normalize the suspect text . Private agreements MAY, however,
be created within private systems which are not subject to these rules,
but any externally observable results MUST be the same as if the rules
had been obeyed."
 
[XSL] The exception for private agreements is crippled by the observable
results restriction thus when all is said and done any suspect text will
always remain.
 
[XSL] Section 4.4 appears to require that XML be changed to disallow the
use of a composing character as the first character in an entity. This
change would be backwards incompatible. XSL WG specifications such as
XSLT and XPath must continue to work with all XML well-formed documents.
 
[XSL] Since the contents of an XML text node are "suspect text" (there
is nothing to prevent use of a composing character as the first
character in a text node), section 4.4 appears to be saying that XPath
must disallow operations such as substring() unless the text is
inspected and found to be normalized. We do not believe that users want
to pay the high cost of this feature. 
 
10) "[I] A text-processing component which modifies text and performs
normalization-sensitive operations MUST behave as if normalization took
place after each modification, so that any subsequent
normalization-sensitive operations always behave as if they were dealing
with normalized text."
 
[XSL] The fourth requirement in section 4.4 is labelled [I], but XPath
implementations have to do what the XPath specification says, so this is
actually an [S] requirement. The implication of this requirement is that
functions such as concat() should perform normalization. This is both
expensive and backwards-incompatible, we will have to examine whether it
is something where the benefits exceed the costs.  This also seems to
violate the self-imposed limitation to only require conformance to
observable behaviors. How XPaths are handled within an XSLT Processor
should not be the subject of this spec as long as the results are
conformant.  
 

11) "[S] Specifications of text-based languages and protocols SHOULD
define precisely the construct boundaries necessary to obtain a complete
definition of full-normalization . These definitions MUST include at
least the boundaries between markup and character data as well as entity
boundaries (if the language has any include mechanism) and SHOULD
include any other boundary that may create denormalization when
instances of the language are processed."
 
[XSL] The requirement (still in 4.4) about defining construct boundaries
is very unclear when applied to a language that performs dynamic
manipulation of strings. 
 

12) "[S] Specifications MUST document any security issues related to
normalization."
 
[XSL] The requirement "Specifications MUST document any security issues
related to normalization." is untestable on its face and should be
detailed.
 

13) 6. String Identity Matching
"[S] [I] Forms of string matching other than identity matching SHOULD be
performed as if the following steps were followed:
  Steps 1 to 3 for string identity matching .
  Matching the strings in a way that is appropriate to the application."
 
[XSL] It is unclear whether the procedure for string identity matching
in section 6 establishes a requirement for expansion of %HH escapes in
URIs, especially when comparing namespace URIs, where such expansion has
not traditionally been performed. Section 8 should give guidance on
this.

Mark

________________________________________________________________
Mark V. Scardina              Group Product Mgr & XML Evangelist
CORE & XML DEVELOPMENT GROUP  E-mail: Mark.Scardina@oracle.com
Web Site: http://otn.oracle.com/tech/xml/
Received on Friday, 28 June 2002 15:56:43 UTC