RE: Last Call review of Character Model for the WWW from Martin Duerst on 2001-02-20 (www-i18n-comments@w3.org from February 2001)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 20 Feb 2001 19:23:55 +0900
To: Karlsson Kent - keka <keka@im.se>, "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Cc: misha.wolf@reuters.com, "'Asmus Freytag'" <asmusf@ix.netcom.com>, "'Kenneth Whistler'"<kenw@sybase.com>, "'Mark Davis'" <mark.davis@us.ibm.com>
Message-Id: <4.2.0.58.J.20010220183331.02fa9850@sh.w3.mag.keio.ac.jp>
Hello Kent,

Many thanks for your comments.

Here are some intermediate answers, from me personally, or questions for
clarification.

At 20:19 01/01/30 +0100, Karlsson Kent - keka wrote:

>=============================
>
>* clause 1.3, code position notation; maybe sufficient here, but not precise.

Do you mean that we should say that leading zeroes in position
five and six (from the right) are not allwed? As you note,
it may indeed be sufficient for our purposes, because the
reader only has to understand the notation when it turns up,
not to create his/her own things.


>"Conformance" -> "Conformity" (English)

The prevalent term in W3C is Conformance, sorry.


>The phrase "MUST NOT" reflects in itself a lack of 
>internationalisation.  In English, "must not" means the same as "shall 
>not", so use the phrase "shall not".  In other languages the word for 
>"must" followed by the word for "not" (like in Swedish "m蚶te inte") means 
>the same as "does not have to", which is quite different from "shall 
>not".  However, the word for "shall" followed by the word for "not" does 
>not have such an issue, but retains the English meaning.  Similarly, "MAY 
>NOT" also has the same kind of problem.  This is the reason why ISO/IEC 
>JTC 1 procedures does not allow the phrase "must not" (nor "must"), but 
>instead uses the phrase "shall not" (and for similarity, uses "shall" for 
>the positive requirements).  The phrase "REQUIRED" seems superfluous, use 
>"SHALL" (with a reformulation to form a proper sentence).

Sorry, but we use IETF RFC 2119 terminology, and clearly say so.
This is in line with most other W3C specs. As for "REQUIRED", it's
often convenient to word something this way.


>The terminology (SHALL, ..., OPTIONAL, ...) should come before the 
>conformity clause (among other definitions, that are generally missing; 
>see also below).

You mean it should go into the "Notation" section?


>--typo: "All...specification" --> "All...specifications" (plural)

Thanks. Will fix.


>* clause 3.1.6.
>Except for compressions; when is multiple 'characters' stored in a single 
>'physical unit of storage' (in a context where 'physical unit of storage' 
>are such things as a byte or a wyde)?

What about an "fi" ligature stored in a single position?
Not that I think this is a good idea, but this is a very general section.
But maybe that's an example that is a bit far-reaching.
On the other hand, if we leave out many-to-one, readers
will ask why.


>* clause 3.1.7.
>"MUST (sic) specify which"; but then there should be an explicit list in 
>the "char-model" document, should there not?  Otherwise you have an 
>open-ended requirement.

Well, yes, the requirement is that another WG actually think this
through, rather than that they just pick something from a list.
I don't think it's a problem to satisfy this requirement, do you?


>* clause 3.2
>There is no definition of terms in the document.  Terms such as "byte" and 
>"wyde" are left for the reader to guess, likewise for "octet", though that 
>is more precise.  Note that some well-known standards (such as that for C) 
>does NOT limit a "byte" to be an "octet".

Does anything in the spec not work out because the reader doesn't
know what a byte is? I don't know, but if that's not the case,
then we don't have to be more precise, or do we?


>"code point"...; "code position" seems to be the 10646 term, though not 
>formally defined.

We checked that, you are right. I think we decided to add
"code position" in parenthesis to give the link to 10646 terminology.


>"Transfer Encoding Syntax" is missing here (and see below).
>
>* clause 3.6.1.
>"charset" is mentioned a number of times.  It should say that XML uses a 
>pseudo-attribute called "encoding" rather than "charset".

I think this is a good point, although we tried to word things so that
they apply to more than just XML.


>It should be mentioned that due to a decision to have only a few 
>"Transfer-Encoding" values, some encodings that are really Transfer 
>Encoding Syntaxes got registered as "charsets".  For instance UTF-7 
>(despite the name it's not a UTF) and HZ-GB-2312 are really TESes, not 
>CESes.  UTF-7 is already deprecated, and was only intended for e-mail in 
>the same way as Quoted-Printable was only intended for (7-bit) e-mail.  No 
>TES should be used other than for backwards compatibility in e-mail 
>support (i.e. SHOULD not be used else-where; or even SHALL NOT be used 
>elsewhere...).

I agree with you content-wise, but I'm not sure we need to go into that much
detail. Mail isn't really our business. And the details of TES vs. CES doesn't
really affect us. Also, there are many other 'charset's that we probably
would like to recommend against, for one or the other reason, but this
will become a very long story.


>There are also some 2022 "charsets" registered.  But due to the lack of 
>widespread support for 2022, it should be avoided except for backwards 
>compatibility in e-mail support, and should not be used elsewhere.

iso-2022-jp is quite widely used, even on the Web. And it's mentioned in the
XML Rec. For many others, I agree, but again: do we need to bless them
by mentioning them?


>Further, there are some older registered encodings related to 
>10646/Unicode apart from UTF-7: --UCS-2, --UTF-1, --UCS-4; UNICODE-1-1 as 
>well as a few subsets.  These should be recommended against.

Are they in widespread use? Do you think that without recommending against 
them,
our recommendations for UTF-8 and UTF-16 are not clear enough?


>* clause 3.7.
>For XML (and thus XHTML) one should recommend to use the hexadecimal 
>rather than decimal "character escape".

Good point. The main reason up to now for not doing that was Netscape 4.x,
but I guess it might be time to forget about that.


>XHTML has inherited a number of named (rather than numbered) "character 
>escapes"; are these counted as character escapes too, or are they 
>not?  (See also below.)

We have received a similar comments from other people. We will have
to clarify this.


>XML 1.0 does not allow any "character escapes" in identifiers (they are 
>allowed in comments, but I'm not sure if a "source viewer" is supposed to 
>interpret them there).  Maybe a note about that...

Yes, probably a good idea, although we don't want to become an XML tutorial.


>Some editing tools are too eager to automatically use character escapes 
>(or named character entities; like &aring;) even though the target 
>encoding perfectly well can represent the character directly without any 
>problems.  There should be a recommendation not to do so, but to insert 
>the characters directly as typed on the keyboard (or pasted in as plain 
>text), when representable and when they would not cause parsing problems 
>(like e.g., '<' would in XML).

This is also a good idea. I think quite some members of our WG will agree.


>* clause 4.1
>"UTR #15" --> "UAX 15"; it's in UAX status, and the # is just ugly.
>
>* clause 4.2.2.
>For clarity, the parenthetical definition should be removed, along with 
>its application.
>
>[this clause is a mess, as are the references to it]
>
>"does not contain any character escapes whose unescaping..."?  This 
>appears to be targeting such things as numeric escapes (like &#... in 
>XML).  It's not clear if standardised named character entities are to be 
>considered, or even worse, non-standard externally parsed entities that 
>someone might have defined in another file (or whatever). If not, 
>expanding them may result in non-NFC, which is guarded against when it 
>comes to numeric character escapes.  If they are, then defined entities 
>(in the XML-meaning) must be examined too. Are they then to be expanded 
>during W3C-normalisation?
>
>E.g. is &Auml;&#x304; W3C-normalised (for XHTML) or not?  Note that 
>expanding both the named and numeric character reference, and then 
>creating an NFC version generates the single character called LATIN 
>CAPITAL LETTER A WITH DIAERESIS AND MACRON. The situation gets even worse 
>with non-W3C-standard entities that may be defined, in the same "file" or 
>in another "file", which may contain any text, including markup, and even 
>if the definition itself may be "W3C-normalised", at the point of use 
>there may still be a concatenation of strings whose result is not in any 
>normal form.  Does a W3C-normaliser for XML need to consider externally 
>parsed entities?
>
>4.2.2. defines "W3C-normalised" w.r.t. the "character escape" syntax used, 
>but is not clear about what that is.  A further problem is that  4.2.2 
>does NOT actually define what "W3C-normalisATION", the algorithm, is 
>supposed to do.  Are input to be rejected if not already normalised? 
>Probably not. Are some numeric character escapes to be expanded and 
>combine that with creation of a result in NFC?  Maybe.

Well, yes, indeed, we just judge the result. How you get there is your
own business. And as long as you follow the rules, you can use as
many escapes as you like and still be normalized.

>But what about XML's entities; are they to be examined? And if the data is 
>then found not be W3C-normalised, what then?  Expand the entity?  That may 
>contain markup, and be arbitrarily large; and the entity may have multiple 
>occurrences.  It may also destroy the document design ("I DID want that 
>entity with a combining character first!").

I think your proposal in another mail to prohibit (in essence) combining 
characters
at the beginning of an entity makes a lot of sense. Anyway, the question of
the interaction between entities and Early Normalization has been raised
by other people, too, and we will work on addressing it.


>It's not clear to me why W3C-normalisation at all has to be 
>defined.  Expanding character escapes (and other entities), if done while 
>editing, should be accompanied by (local) establishment of NFC.  But that 
>is no different from, say, pasting in some text (that may contain or even 
>begin with combining characters) during editing.  Likewise for string 
>identity matching, after expanding entities (and numeric character 
>escapes), a local normalisation step may be needed.

The very idea behind Early Uniform Normalization is to avoid that normalization
has to be done on every single matching operation.


>For such things as signature creation, entities and numeric character 
>escapes would not be expanded, creating different signatures for the 
>unexpanded and expanded versions.

Not exactly. XML signature for sure gets rid of numeric character references,
and also entities as far as I remember.


>Maybe there is some assumed distinction between numeric character escapes 
>and other entities (still in the XML-ish sense) that I've missed. Like 
>that numeric character escapes are to be interpreted while string identity 
>matching and signature creation, while the otherwise rather similar 
>(character or larger) named entities are not to be so expanded for those 
>operations. If so, please explain.  Also please explain why such a 
>difference in treatment would not be a problem.  Writing &Auml; instead of 
>&#xC4; is not all that different.
>
>There is nothing about versioning.  A new text that is in NFC for a 
>version of Unicode that does not contain (unallocated) a "new" precomposed 
>characters that is later allocated and used in the "new" text, will not be 
>in NFC for the "new" version of Unicode, which will decompose it.

The idea is that such new precomposed characters are decomposed at
the origin, and never appear on the Web.


>The note about legacy (plain) text always being normalised might not be 
>true for all (any?) legacy encodings for Vietnamese (and now maybe not for 
>Hebrew either...).  See in particular MS CP 1258.

This depends on the definition, i.e. what does it mean for a legacy encoding
to be normalized. We chose the definition that best fits with the rest of
our explanations, which may not be the same as some other definitions.
Some clarification may be needed.


>Side remark: turning marked-up W3C-normalised text into plain text may 
>produce non-NFC results in another way too; e.g. 
><ex>A<emphasise>&#x308;</emphasise></ex> (say that 'emphasise' uses red 
>colour when displayed/printed). Just expanding the character escape while 
>removing the markup tags results in a decomposed ト as the plain text version.

Good point. But removing the markup is an editing operation, I guess.


>* clause 5
>Expand "GI" to "generic identifier" (or avoid that term, which is not even 
>properly defined in the XML spec.).
>
>* clause 8 (on URIs)
>[this is a general and ugly mess]

Can you be more specific? E.g. what do you not understand,
what do you think should be added/removed/changed,...


>"The conversion MUST (sic) take place as late as possible." 
>Good.  Similarly, the conversion back to a form that does not use the 
>%-encoding should be done as early as possible (in case a URI protocol 
>element is passed back as parameter, it should not then still be 
>%-encoded). Nor should pre-%-encoded URIs occur in stored or generated 
>documents. This should keep %xx's out of any UI.

This is not always possible, because of legacy URIs that may not be
back-convertible, or that may look back-convertible, but actually
were created completely differently.


>Note that the %-encoding is very similar to the TES Quoted-Printable.
>
>--typo: "conversion a legal" --> "conversion to a legal".

We'll fix that.


>* Example A.3
>This example appears oversimplified; no keyboard state, nor intermediary 
>displays (with quite different characters) are shown.  That is hard to 
>show in a simple table, but there should be some explanatory note about that.

It's mentioned that the user types nine keystrokes, but that can definitely
be expanded. Otherwise, some readers may think that the user types blindly.


Regards,  Martin.
Received on Tuesday, 20 February 2001 05:27:28 UTC