RE: Last Call review of Character Model for the WWW from Karlsson Kent - keka on 2001-01-30 (www-i18n-comments@w3.org from January 2001)

From: Karlsson Kent - keka <keka@im.se>
Date: Tue, 30 Jan 2001 20:19:13 +0100
To: "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Cc: "'Martin Duerst'" <duerst@w3.org>, misha.wolf@reuters.com, "'Asmus Freytag'" <asmusf@ix.netcom.com>, "'Kenneth Whistler'" <kenw@sybase.com>, "'Mark Davis'" <mark.davis@us.ibm.com>
Message-ID: <C110A2268F8DD111AA1A00805F85E58D0115A92D@ntgbg1>
I'm not at all sure the document is ready for "last call".
See below on clause 4.2.2.

=============================

* clause 1.3, code position notation; maybe sufficient here, but not
precise.

"Conformance" -> "Conformity" (English)

The phrase "MUST NOT" reflects in itself a lack of internationalisation.  In
English, "must not" means the same as "shall not", so use the phrase "shall
not".  In other languages the word for "must" followed by the word for "not"
(like in Swedish "måste inte") means the same as "does not have to", which
is quite different from "shall not".  However, the word for "shall" followed
by the word for "not" does not have such an issue, but retains the English
meaning.  Similarly, "MAY NOT" also has the same kind of problem.  This is
the reason why ISO/IEC JTC 1 procedures does not allow the phrase "must not"
(nor "must"), but instead uses the phrase "shall not" (and for similarity,
uses "shall" for the positive requirements).  The phrase "REQUIRED" seems
superfluous, use "SHALL" (with a reformulation to form a proper sentence).

The terminology (SHALL, ..., OPTIONAL, ...) should come before the
conformity clause (among other definitions, that are generally missing; see
also below).

--typo: "All...specification" --> "All...specifications" (plural)


* clause 3.1.6.
Except for compressions; when is multiple 'characters' stored in a single
'physical unit of storage' (in a context where 'physical unit of storage'
are such things as a byte or a wyde)?


* clause 3.1.7.
"MUST (sic) specify which"; but then there should be an explicit list in the
"char-model" document, should there not?  Otherwise you have an open-ended
requirement.


* clause 3.2
There is no definition of terms in the document.  Terms such as "byte" and
"wyde" are left for the reader to guess, likewise for "octet", though that
is more precise.  Note that some well-known standards (such as that for C)
does NOT limit a "byte" to be an "octet".

"code point"...; "code position" seems to be the 10646 term, though not
formally defined.

"Transfer Encoding Syntax" is missing here (and see below).


* clause 3.6.1.
"charset" is mentioned a number of times.  It should say that XML uses a
pseudo-attribute called "encoding" rather than "charset".

It should be mentioned that due to a decision to have only a few
"Transfer-Encoding" values, some encodings that are really Transfer Encoding
Syntaxes got registered as "charsets".  For instance UTF-7 (despite the name
it's not a UTF) and HZ-GB-2312 are really TESes, not CESes.  UTF-7 is
already deprecated, and was only intended for e-mail in the same way as
Quoted-Printable was only intended for (7-bit) e-mail.  No TES should be
used other than for backwards compatibility in e-mail support (i.e. SHOULD
not be used else-where; or even SHALL NOT be used elsewhere...).

There are also some 2022 "charsets" registered.  But due to the lack of
widespread support for 2022, it should be avoided except for backwards
compatibility in e-mail support, and should not be used elsewhere.

Further, there are some older registered encodings related to 10646/Unicode
apart from UTF-7: --UCS-2, --UTF-1, --UCS-4; UNICODE-1-1 as well as a few
subsets.  These should be recommended against.


* clause 3.7.
For XML (and thus XHTML) one should recommend to use the hexadecimal rather
than decimal "character escape".

XHTML has inherited a number of named (rather than numbered) "character
escapes"; are these counted as character escapes too, or are they not?  (See
also below.) 

XML 1.0 does not allow any "character escapes" in identifiers (they are
allowed in comments, but I'm not sure if a "source viewer" is supposed to
interpret them there).  Maybe a note about that...

Some editing tools are too eager to automatically use character escapes (or
named character entities; like &aring;) even though the target encoding
perfectly well can represent the character directly without any problems.
There should be a recommendation not to do so, but to insert the characters
directly as typed on the keyboard (or pasted in as plain text), when
representable and when they would not cause parsing problems (like e.g., '<'
would in XML).


* clause 4.1
"UTR #15" --> "UAX 15"; it's in UAX status, and the # is just ugly.


* clause 4.2.2.
For clarity, the parenthetical definition should be removed, along with its
application.

[this clause is a mess, as are the references to it]

"does not contain any character escapes whose unescaping..."?  This appears
to be targeting such things as numeric escapes (like &#... in XML).  It's
not clear if standardised named character entities are to be considered, or
even worse, non-standard externally parsed entities that someone might have
defined in another file (or whatever). If not, expanding them may result in
non-NFC, which is guarded against when it comes to numeric character
escapes.  If they are, then defined entities (in the XML-meaning) must be
examined too. Are they then to be expanded during W3C-normalisation?

E.g. is &Auml;&#x304; W3C-normalised (for XHTML) or not?  Note that
expanding both the named and numeric character reference, and then creating
an NFC version generates the single character called LATIN CAPITAL LETTER A
WITH DIAERESIS AND MACRON. The situation gets even worse with
non-W3C-standard entities that may be defined, in the same "file" or in
another "file", which may contain any text, including markup, and even if
the definition itself may be "W3C-normalised", at the point of use there may
still be a concatenation of strings whose result is not in any normal form.
Does a W3C-normaliser for XML need to consider externally parsed entities?

4.2.2. defines "W3C-normalised" w.r.t. the "character escape" syntax used,
but is not clear about what that is.  A further problem is that  4.2.2 does
NOT actually define what "W3C-normalisATION", the algorithm, is supposed to
do.  Are input to be rejected if not already normalised? Probably not. Are
some numeric character escapes to be expanded and combine that with creation
of a result in NFC?  Maybe.  But what about XML's entities; are they to be
examined? And if the data is then found not be W3C-normalised, what then?
Expand the entity?  That may contain markup, and be arbitrarily large; and
the entity may have multiple occurrences.  It may also destroy the document
design ("I DID want that entity with a combining character first!").

It's not clear to me why W3C-normalisation at all has to be defined.
Expanding character escapes (and other entities), if done while editing,
should be accompanied by (local) establishment of NFC.  But that is no
different from, say, pasting in some text (that may contain or even begin
with combining characters) during editing.  Likewise for string identity
matching, after expanding entities (and numeric character escapes), a local
normalisation step may be needed.  For such things as signature creation,
entities and numeric character escapes would not be expanded, creating
different signatures for the unexpanded and expanded versions.  Maybe there
is some assumed distinction between numeric character escapes and other
entities (still in the XML-ish sense) that I've missed. Like that numeric
character escapes are to be interpreted while string identity matching and
signature creation, while the otherwise rather similar (character or larger)
named entities are not to be so expanded for those operations. If so, please
explain.  Also please explain why such a difference in treatment would not
be a problem.  Writing &Auml; instead of &#xC4; is not all that different.

There is nothing about versioning.  A new text that is in NFC for a version
of Unicode that does not contain (unallocated) a "new" precomposed
characters that is later allocated and used in the "new" text, will not be
in NFC for the "new" version of Unicode, which will decompose it.

The note about legacy (plain) text always being normalised might not be true
for all (any?) legacy encodings for Vietnamese (and now maybe not for Hebrew
either...).  See in particular MS CP 1258.

Side remark: turning marked-up W3C-normalised text into plain text may
produce non-NFC results in another way too; e.g.
<ex>A<emphasise>&#x308;</emphasise></ex> (say that 'emphasise' uses red
colour when displayed/printed). Just expanding the character escape while
removing the markup tags results in a decomposed Ä as the plain text
version.


* clause 5
Expand "GI" to "generic identifier" (or avoid that term, which is not even
properly defined in the XML spec.).


* clause 8 (on URIs)
[this is a general and ugly mess]

"The conversion MUST (sic) take place as late as possible." Good.
Similarly, the conversion back to a form that does not use the %-encoding
should be done as early as possible (in case a URI protocol element is
passed back as parameter, it should not then still be %-encoded). Nor should
pre-%-encoded URIs occur in stored or generated documents. This should keep
%xx's out of any UI.

Note that the %-encoding is very similar to the TES Quoted-Printable.

--typo: "conversion a legal" --> "conversion to a legal".

* Example A.3
This example appears oversimplified; no keyboard state, nor intermediary
displays (with quite different characters) are shown.  That is hard to show
in a simple table, but there should be some explanatory note about that.

=========================================================
Received on Tuesday, 30 January 2001 14:22:42 UTC