Comments on 26 January 2001 (last call) Character Model Working Draft from Ian Jacobs on 2001-01-29 (www-i18n-comments@w3.org from January 2001)

From: Ian Jacobs <ij@w3.org>
Date: Mon, 29 Jan 2001 15:31:50 -0500
To: www-i18n-comments@w3.org
CC: ij@w3.org
Message-ID: <3A75D336.4078E474@w3.org>
Hello,

I have a few comments and questions about the 26 January 2001
(last call) Working Draft of "Character Model for the World
Wide Web 1.0" [1]. My comments are not sent on behalf of any 
W3C Working Group.

My global comment is that I think the editors have done a good
job explaining some very difficult concepts, but that more
work may be necessary so that someone reading this specification
without extensive experience with these topics will be able
to grasp the issues. I have had some exposure to these topics
and they are still challenging to me. Where sections begin with
examples, I find those sections easier to absorb, and Appendix A
is very instructive. My global advice is therefore to add
more examples and explanation of those examples. 

Specific comments and questions follow. They are not prioritized.
Some are editorial, but more than "there is a missing comma".

Congratulations on your progress on this document!

 - Ian

[1] http://www.w3.org/TR/2001/WD-charmod-20010126

====================
COMMENTS
====================

 | 1 Introduction
 | 
 | 1.1 Goals and Scope

[snip] 

 | All W3C specifications have to conform to this document (see
 | section [57]2 Conformance). 

IJ: I don't think you can make this statement. This specification
must define conformance to it and should encourage other specs to
require conformance to it. However, I don't think it can impose
this dependency by other specifications. This may be something
that a future W3C architecture document can do.

[snip]

 | 2 Conformance

[snip]

 | In order to conform to this document, all applicable requirements
 | MUST be satisfied. 

IJ: I don't believe that you've defined "applicable". Is this a
blanket exception clause? I suspect this means that I am not
required to implement a "MUST" requirement if I don't consider it
applicable to my specification. I recommend narrowing the scope
of this significantly. For instance, you can define conformance
profiles (e.g., for format specifications, for protocols, for
graphical user interfaces, etc.) and (ideally) have no
applicability exemptions within each class.

 | Requirements vary for content, software and
 | specifications. All new or revised W3C specification MUST conform
 | to the requirements applicable to specifications.

IJ: See comments above: I don't think this spec should/can
include such a statement.

 | 3 Characters
 | 3.1 Perceptions of Characters
 | 3.1.1 Introduction
 | 3.1.4 Units of Input
 | 
 | In keyboard input, it is not the case that keystrokes and input
 | characters correspond one-to-one. 

IJ: I propose "it is not the case in general"

 | 3.1.5 Units of Collation

 [snip]

 | Although
 | Spanish sorting, and to some extent Spanish everyday use, treat
 | "ch" as a character, digital encodings treat it as two
 | characters, and keyboards do the same.

Does "digital encodings treat it as two characters" mean that
these encodings happen to do this today? The sentence as written
suggests to me that they should be treated as two characters
(even though people who use Spanish treat it as one character).
I expected to read in this sentence that encodings "should" or
"should not" do this.

 | 3.2 Digital Representation of Characters
 | 
 | To be of any use in computers, in computer communications and in
 | particular on the World Wide Web, characters must be encoded. 

IJ: I think that this section should have the word "encoding"
in the title. Also, the term "character encoding" is used later
in the document, and it's not clear what that term applies to
in this section. Does "character encoding" comprise bullets 1-4,
or just 3 and 4? 

[snip]

 | 1. A set of characters to
 | be encoded is identified. The units of encoding, the
 | characters,

I find "units of encoding" hard to understand, especially since
it means "characters" here. Can "units of encoding" be dropped?

The term "unit" is used through the document, and maybe it
can be kept if there's a clear statement early on about how
it's used. 

 | 3.6 Choice and Identification of Character Encodings

[snip]
 
 | Receiving software MUST determine the encoding from available
 | information. 

IJ: I propose instead: "Receiving software MUST determine the
encoding according to specification (include default
encodings)." This would include:

 * Specification-defined defaults
 * Information in headers
 * Information in markup.

 | It MAY recognize as many encodings (names and
 | aliases) as appropriate. 

IJ: What does "as appropriate" mean?

 | A field-upgradeable mechanism may be
 | appropriate for this purpose. When a IANA-registered charset name
 | is recognized, receiving software MUST interpret the received
 | data according to the encoding associated with the name in the
 | IANA registry. When no charset is provided the receiving software
 | MUST adhere to the default encoding(s) specified in the
 | specification.

IJ: Last sentence can therefore be deleted.

 | 3.6.2 Private Use Code Points

[snip]

 | Where specifications need to allow the transmission of symbols
 | not in Unicode or need to identify specific variants of Unicode
 | characters, they MAY define markup for this purpose.

A couple of questions:
 - Can you give some examples of mechanisms other than markup?
 - Why is this a "MAY" rather than a "SHOULD"? What are the
   more recommended techniques than this?

 | 4.2 Definitions for W3C Text Normalization
 | 
 | 4.2.1 Unicode-normalized Text
 | 
 | Text data is, for the purposes of this specification,
 | Unicode-normalized if it is in a Unicode encoding form 

IJ: Is a "Unicode encoding form" a "Unicode character encoding
form (CEF)"? It would be good to tie this back to the bulleted
list of section 3.2.

 | and is in
 | Unicode Normalization Form C (according to revision 18 of
 | [115][UTR #15]).
 | 
 |  [115] http://www.w3.org/TR/charmod/#UTR15
 | 
 | 4.2.2 W3C-normalized Text

IJ: Questions:

- What is the definition of "legacy text"? "Legacy encoding"?

- The definitions rely on the existence of normalizing
  transcoders. Does this mean actual software must exist. I would
  prefer that the definition relied on the possibility of writing
  a transcoder rather than the actual existence of one. On the 
  other hand, it's not very satisfying to write "and if one could
  write a normalizing transcoder". But that's what it amounts to,
  I think.

 | 4.2.3 Examples

IJ: I found these examples hard to understand because I don't
know anything about Unicode Normalization Form C, and it's not
clear to me, for example, wheter U+0327 is a character escape. I
think taking more time to explain the examples would be useful.

 | The string "suçon", expressed as the sequence of five characters
 | U+0073 U+0075 U+00E7 U+006F U+006E and encoded in a Unicode
 | encoding form, is both Unicode-normalized and W3C-normalized. The
 | same string encoded in a legacy encoding for which there exists a
 | normalizing-transcoder would be W3C-normalized but not
 | Unicode-normalized.
 | 
 | The string "suçon", expressed as the sequence of six characters
 | U+0073 U+0075 U+0063 U+0327 U+006F U+006E (U+0327 is the
 | COMBINING CEDILLA) and encoded in a Unicode encoding form, is
 | neither W3C-normalized nor Unicode-normalized.
 | 
 | In an XML or HTML context, the string "suc&#x0327;on" is not
 | W3C-normalized, whatever the encoding form, because expanding
 | "&#x0327;" yields the sequence "suc¸on" which is not
 | Unicode-normalized. Note that, since Unicode-normalization
 | doesn't take escapes into account, the string "suc&#x0327;on" is
 | Unicode-normalized if encoded in a Unicode encoding form.

I do not understand the last sentence. I suspect that more 
preceding explanation will help.

 | 5 Compatibility and Formatting Characters

[snip]

 | Specifications SHOULD exclude compatibility characters in the
 | syntactic elements (markup, delimiters, identifiers) of the
 | formats they define (e.g. exclusion of compatibility characters
 | for GIs in XML).

IJ: What are "GIs"?

 | 6 String Identity Matching

[snip]

 | To avoid unnecessary conversions and, more importantly, to ensure
 | predictability, all components of the Web must use the same
 | identity testing mechanism. To meet this requirement and support
 | the above definition of identity, this specification mandates the
 | following steps for string identity matching: 

 | 1. Early uniform
 | normalization to W3C-normalized form, as defined in [120]4.2.2
 | W3C-normalized Text
 | 2. Conversion to a common encoding of UCS, if necessary 

Is this physical string encoding or byte string encoding?

 | 3. Expansion of all escapes 
 | 4. Binary comparison


 | In accordance with section [121]4 Early Uniform Normalization,
 | the first step MUST be performed by the producers of the strings
 | to be compared. This ensures 1) that the identity matching
 | process can produce correct results using the next three steps
 | and 2) that a minimum of effort is spent on solving the problem.
 | 
 |  [121] http://www.w3.org/TR/charmod/#sec-Normalization
 | 
 | 7 String Indexing

I think that the beginning of this section would benefit
from more examples.

[snip] 

 | It is noteworthy that there exist other, non-numeric ways of
 | identifying substrings which have favorable properties. For
 | instance, substrings based on string matching are quite robust
 | against small edits; substrings based on document structure (in
 | structured formats such as XML) are even more robust against
 | edits and even against translation of a document from one
 | language to another. 

IJ: one "markup" language or one human language?

 | Consequently, specifications that need a way
 | to identify substrings or point within a string SHOULD provide
 | ways other than string indexing to perform this operation. Users
 | of such specifications (software developers, content developers)
 | SHOULD prefer those other ways whenever possible.

IJ: Is the DOM Range spec an example of this? Should this be
    used as an example?

 | Experience shows that more general, flexible and robust
 | specifications result when individual characters are understood
 | and processed as substrings, identified by a position before and
 | a position after the substring. Understanding indices as boundary
 | positions between the counting units also makes it easier to
 | relate the indices resulting from the different string
 | definitions. Specifications SHOULD use this form of indexing,
 | regardless of the choice of counting units. APIs in addition
 | SHOULD NOT specify single character or single encoding-unit
 | arguments.

IJ: This last sentence is cryptic and some examples or further
explanation should be given.

 | 8 Character Encoding in URI References

IJ: I think a slightly different ordering of the
discussion might be more effective if split 
along character set/character encoding lines.

Character set:

 a) The character set of URIs, defined in RFC2396, is a subset of
    US-ASCII.
 b) Specs that use URI references must not impose this
    restriction and must allow URIs to include any Unicode
    character. This includes some characters disallowed by
    RFC2396 (e.g., number sign, etc.). These are called "IRIs".

Character encoding:

 a) RFC2396 does allow encoding of arbitrary byte sequences. 
    Here's why this is limited...

 b) Since some applications will not be able to deal with
    the characters outside of RFC2396, here's how you must
    convert to the encoding specified in RFC2396...

 | According to the definition in [130][RFC 2396], URI references
 | are restricted to a subset of US-ASCII. This RFC also specifies
 | an escaping mechanism to encode arbitrary byte values, using the
 | %HH convention. However, because the RFC does not define the
 | mapping from characters to bytes, the %HH convention by itself is
 | of limited use.  

IJ: Can you provide more explanation of why the convention is of
limited use, perhaps referring to the bulleted items of section
3.2?

[snip]

 | 9 Referencing the Unicode Standard and ISO/IEC 10646

[snip]

 | The fact that both ISO/IEC 10646 and Unicode are evolving (in
 | synchronism) 

I would prefer "in synchrony".


-- 
Ian Jacobs (jacobs@w3.org)   http://www.w3.org/People/Jacobs
Tel:                         +1 831 457-2842
Cell:                        +1 917 450-8783
Received on Monday, 29 January 2001 15:31:53 UTC