RE: Last Call review of Character Model for the WWW from Karlsson Kent - keka on 2001-02-21 (www-i18n-comments@w3.org from February 2001)

From: Karlsson Kent - keka <keka@im.se>
Date: Wed, 21 Feb 2001 16:59:59 +0100
To: "'duerst@w3.org'" <duerst@w3.org>, "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Cc: misha.wolf@reuters.com, "'Asmus Freytag'" <asmusf@ix.netcom.com>, "'Kenneth Whistler'" <kenw@sybase.com>, "'Mark Davis'" <mark.davis@us.ibm.com>
Message-ID: <C110A2268F8DD111AA1A00805F85E58D0115A9B1@ntgbg1>
Hi Martin!

> -----Original Message-----
> From: Martin Duerst [mailto:duerst@w3.org]
...
> At 20:19 01/01/30 +0100, Karlsson Kent - keka wrote:
> 
> >=============================
> >
> >* clause 1.3, code position notation; maybe sufficient here, but not
precise.
> 
> Do you mean that we should say that leading zeroes in position
> five and six (from the right) are not allwed? As you note,
> it may indeed be sufficient for our purposes, because the
> reader only has to understand the notation when it turns up,
> not to create his/her own things.

That was one of the imprecisions I was thinking about; and yes, that
precision
is not needed here.

> >The phrase "MUST NOT" reflects in itself a lack of 
> >internationalisation.  In English, "must not" means the same as "shall 
> >not", so use the phrase "shall not".  In other languages the word for 
> >"must" followed by the word for "not" (like in Swedish "m蚶te inte")
means 
> >the same as "does not have to", which is quite different from "shall 
> >not".  However, the word for "shall" followed by the word for "not" does 
> >not have such an issue, but retains the English meaning.  Similarly, "MAY

> >NOT" also has the same kind of problem.  This is the reason why ISO/IEC 
> >JTC 1 procedures does not allow the phrase "must not" (nor "must"), but 
> >instead uses the phrase "shall not" (and for similarity, uses "shall" for

> >the positive requirements).  The phrase "REQUIRED" seems superfluous, use

> >"SHALL" (with a reformulation to form a proper sentence).
> 
> Sorry, but we use IETF RFC 2119 terminology, and clearly say so.
> This is in line with most other W3C specs. As for "REQUIRED", it's
> often convenient to word something this way.

RFC 2119 does not imply that "MUST" and "MUST NOT" are the terms to use.
It allows equally the terms "SHALL" and "SHALL NOT".  (ISO/IEC JTC1 on
the other hand rightly shuns "must" and "must not".)  So there is nothing
preventing referring to RFC 2119 and still *not* use "MUST" and "MUST NOT",
but instead "SHALL" and "SHALL NOT".  That way you can follow both
RFC 2119 and the JTC1 directives on that point. I think it would be a
mistake
to follow only RFC 2119 in this regard.

> 
> >The terminology (SHALL, ..., OPTIONAL, ...) should come before the 
> >conformity clause (among other definitions, that are generally missing; 
> >see also below).
> 
> You mean it should go into the "Notation" section?

Well, "notation" seems to be the wrong heading.  But a "Symbols [or
Notation] and
definitions" clause. 

> >* clause 3.1.6.
> >Except for compressions; when is multiple 'characters' stored in a single

> >'physical unit of storage' (in a context where 'physical unit of storage'

> >are such things as a byte or a wyde)?
> 
> What about an "fi" ligature stored in a single position?
> Not that I think this is a good idea, but this is a very general section.
> But maybe that's an example that is a bit far-reaching.

This example touches on what the term "character" means.  But using
the term "character" in the 10646/Unicode sense, the fi ligature stores
two letters in a single character (which in some encodings fit in a single
'unit of physical storage'). Not two characters in a single character (which
in some encodings fit in a single 'unit of physical storage')...

> On the other hand, if we leave out many-to-one, readers
> will ask why.

My reaction was: why is many-to-one left *in*...  If what you are talking
about
here are such things as the "squared" ligatures and other ligatures, then
that should
be made explicit.  Side remark: The fi ligature is especially unfortunate.
Some softwares
automatically replaces fi with the fi ligature, and have no other means
(yet) of handling
ligatures.  They then miss out on fj resulting in poor typographic result
for words like
fjarde (fourth), fjord, fjolaret (the previous year), fjall (scales or
mountain...).

> 
> >* clause 3.1.7.
> >"MUST (sic) specify which"; but then there should be an explicit list in 
> >the "char-model" document, should there not?  Otherwise you have an 
> >open-ended requirement.
> 
> Well, yes, the requirement is that another WG actually think this
> through, rather than that they just pick something from a list.
> I don't think it's a problem to satisfy this requirement, do you?

No, but the phrase "must specify which of *the* possible..." implies (to me
at least)
that there's a predefined list.  Maybe a better fomulation is:
"Specifications using the term
character shall specify the meaning of that term as used in that
specification."

> 
> >* clause 3.2
> >There is no definition of terms in the document.  Terms such as "byte"
and 
> >"wyde" are left for the reader to guess, likewise for "octet", though
that 
> >is more precise.  Note that some well-known standards (such as that for
C) 
> >does NOT limit a "byte" to be an "octet".
> 
> Does anything in the spec not work out because the reader doesn't
> know what a byte is? I don't know, but if that's not the case,
> then we don't have to be more precise, or do we?

After seeing the recent discussion on the "Open Group" e-mail list about
the next version of POSIX, where a discussion thread is going on and on
about 9-bit bytes, 10-bit bytes (for historic architectures) and the
eventual
possibility of 16-bit bytes, I find it best to avoid the term byte
all-together
and just write octet.

> 
> >"code point"...; "code position" seems to be the 10646 term, though not 
> >formally defined.
> 
> We checked that, you are right. I think we decided to add
> "code position" in parenthesis to give the link to 10646 terminology.

Or just write "code position" throughout...  (I think it's a better term,
since it does not
involve the term "point", which has other connotations.)
> 
> >"Transfer Encoding Syntax" is missing here (and see below).
> >
> >* clause 3.6.1.
> >"charset" is mentioned a number of times.  It should say that XML uses a 
> >pseudo-attribute called "encoding" rather than "charset".
> 
> I think this is a good point, although we tried to word things so that
> they apply to more than just XML.

Of course, but a "Note" would be appropriate.

> 
> >It should be mentioned that due to a decision to have only a few 
> >"Transfer-Encoding" values, some encodings that are really Transfer 
> >Encoding Syntaxes got registered as "charsets".  For instance UTF-7 
> >(despite the name it's not a UTF) and HZ-GB-2312 are really TESes, not 
> >CESes.  UTF-7 is already deprecated, and was only intended for e-mail in 
> >the same way as Quoted-Printable was only intended for (7-bit) e-mail.
No 
> >TES should be used other than for backwards compatibility in e-mail 
> >support (i.e. SHOULD not be used else-where; or even SHALL NOT be used 
> >elsewhere...).
> 
> I agree with you content-wise, but I'm not sure we need to go into that
much
> detail. Mail isn't really our business. And the details of TES vs. CES
doesn't
> really affect us. Also, there are many other 'charset's that we probably
> would like to recommend against, for one or the other reason, but this
> will become a very long story.
> 
> 
> >There are also some 2022 "charsets" registered.  But due to the lack of 
> >widespread support for 2022, it should be avoided except for backwards 
> >compatibility in e-mail support, and should not be used elsewhere.
> 
> iso-2022-jp is quite widely used, even on the Web. And it's mentioned in
the
> XML Rec. For many others, I agree, but again: do we need to bless them
> by mentioning them?

Well, my poit was to UNbless them...

> 
> >Further, there are some older registered encodings related to 
> >10646/Unicode apart from UTF-7: --UCS-2, --UTF-1, --UCS-4; UNICODE-1-1 as

> >well as a few subsets.  These should be recommended against.
> 
> Are they in widespread use? Do you think that without recommending against

> them,our recommendations for UTF-8 and UTF-16 are not clear enough?
> 
> 
> >* clause 3.7.
> >For XML (and thus XHTML) one should recommend to use the hexadecimal 
> >rather than decimal "character escape".
> 
> Good point. The main reason up to now for not doing that was Netscape 4.x,
> but I guess it might be time to forget about that.
> 
> 
> >XHTML has inherited a number of named (rather than numbered) "character 
> >escapes"; are these counted as character escapes too, or are they 
> >not?  (See also below.)
> 
> We have received a similar comments from other people. We will have
> to clarify this.
> 
> 
> >XML 1.0 does not allow any "character escapes" in identifiers (they are 
> >allowed in comments, but I'm not sure if a "source viewer" is supposed to

> >interpret them there).  Maybe a note about that...
> 
> Yes, probably a good idea, although we don't want to become 
> an XML tutorial.
> 
> 
> >Some editing tools are too eager to automatically use character escapes 
> >(or named character entities; like &aring;) even though the target 
> >encoding perfectly well can represent the character directly without any 
> >problems.  There should be a recommendation not to do so, but to insert 
> >the characters directly as typed on the keyboard (or pasted in as plain 
> >text), when representable and when they would not cause parsing problems 
> >(like e.g., '<' would in XML).
> 
> This is also a good idea. I think quite some members of our WG will agree.
> 
> 
> >* clause 4.1
> >"UTR #15" --> "UAX 15"; it's in UAX status, and the # is just ugly.
> >
> >* clause 4.2.2.
> >For clarity, the parenthetical definition should be removed, along with 
> >its application.
> >
> >[this clause is a mess, as are the references to it]
> >
> >"does not contain any character escapes whose unescaping..."?  This 
> >appears to be targeting such things as numeric escapes (like &#... in 
> >XML).  It's not clear if standardised named character entities are to be 
> >considered, or even worse, non-standard externally parsed entities that 
> >someone might have defined in another file (or whatever). If not, 
> >expanding them may result in non-NFC, which is guarded against when it 
> >comes to numeric character escapes.  If they are, then defined entities 
> >(in the XML-meaning) must be examined too. Are they then to be expanded 
> >during W3C-normalisation?
> >
> >E.g. is &Auml;&#x304; W3C-normalised (for XHTML) or not?  Note that 
> >expanding both the named and numeric character reference, and then 
> >creating an NFC version generates the single character called LATIN 
> >CAPITAL LETTER A WITH DIAERESIS AND MACRON. The situation gets even worse

> >with non-W3C-standard entities that may be defined, in the same "file" or

> >in another "file", which may contain any text, including markup, and even

> >if the definition itself may be "W3C-normalised", at the point of use 
> >there may still be a concatenation of strings whose result is not in any 
> >normal form.  Does a W3C-normaliser for XML need to consider externally 
> >parsed entities?
> >
> >4.2.2. defines "W3C-normalised" w.r.t. the "character escape" syntax
used, 
> >but is not clear about what that is.  A further problem is that  4.2.2 
> >does NOT actually define what "W3C-normalisATION", the algorithm, is 
> >supposed to do.  Are input to be rejected if not already normalised? 
> >Probably not. Are some numeric character escapes to be expanded and 
> >combine that with creation of a result in NFC?  Maybe.
> 
> Well, yes, indeed, we just judge the result. How you get there is your
> own business. And as long as you follow the rules, you can use as
> many escapes as you like and still be normalized.

I'm not sure I follow.

UAX 15 defines not only criteria for the normal forms, indeed it does
not define any critera directly, but it defines the process, the algorithm.
This algorithm produces a string in NFx from any given string of characters
that were allocated for that version of the algorithm (and that is, ahem,
promised
to be stable for new versions of the algorithm).

Here, on the other hand, a criteria is given, but no algorithm.  Indeed, it
may be
the case that it is not possible by reasonable means to W3C-normalise any
given string (with given numeric character escape syntax) resulting in a
normalised string.
Reason: There may be named string insertions, whose definitions or insertion
points
violate the rules that I outlined in another e-mail (no combining ch.
immediately after a
string insertion point; no combining characters in the beginning of the
definition of the
insertion string).  So in contrast to UAX 15, there may here be two
different *kinds*
of results: 1) an W3C-normal string, or 2) W3C-normalisation impossible
without
changing the definition(s) of named insertion strings (or other edits that
are too 'major').

> >But what about XML's entities; are they to be examined? And if the data
is 
> >then found not be W3C-normalised, what then?  Expand the entity?  That
may 
> >contain markup, and be arbitrarily large; and the entity may have
multiple 
> >occurrences.  It may also destroy the document design ("I DID want that 
> >entity with a combining character first!").
> 
> I think your proposal in another mail to prohibit (in essence) combining
characters
> at the beginning of an entity makes a lot of sense. Anyway, the question
of
> the interaction between entities and Early Normalization has been raised
> by other people, too, and we will work on addressing it.
> 
> 
> >It's not clear to me why W3C-normalisation at all has to be 
> >defined.  Expanding character escapes (and other entities), if done while

> >editing, should be accompanied by (local) establishment of NFC.  But that

> >is no different from, say, pasting in some text (that may contain or even

> >begin with combining characters) during editing.  Likewise for string 
> >identity matching, after expanding entities (and numeric character 
> >escapes), a local normalisation step may be needed.
> 
> The very idea behind Early Uniform Normalization is to avoid that
normalization
> has to be done on every single matching operation.

Then those extra requirements on named string insertions ("entities" in XML)
are needed.

> 
> >For such things as signature creation, entities and numeric character 
> >escapes would not be expanded, creating different signatures for the 
> >unexpanded and expanded versions.
> 
> Not exactly. XML signature for sure gets rid of numeric character
references,
> and also entities as far as I remember.

Aha.

> 
> >Maybe there is some assumed distinction between numeric character escapes

> >and other entities (still in the XML-ish sense) that I've missed. Like 
> >that numeric character escapes are to be interpreted while string
identity 
> >matching and signature creation, while the otherwise rather similar 
> >(character or larger) named entities are not to be so expanded for those 
> >operations. If so, please explain.  Also please explain why such a 
> >difference in treatment would not be a problem.  Writing &Auml; instead
of 
> >&#xC4; is not all that different.
> >
> >There is nothing about versioning.  A new text that is in NFC for a 
> >version of Unicode that does not contain (unallocated) a "new"
precomposed 
> >characters that is later allocated and used in the "new" text, will not
be 
> >in NFC for the "new" version of Unicode, which will decompose it.
> 
> The idea is that such new precomposed characters are decomposed at
> the origin, and never appear on the Web.

That assumes that characters that are allocated in version n.m are
normalised (at the
origin) by an NFC-normaliser of at least version n.m.  If someone uses an
'old' tool
to insert 'new' characters (say by numeric character references), and some
of those
were such that they are canonically decomposable, that 'old' normaliser will
not
decompose them.  To that 'old' normaliser, those are just unallocated code
positions.

Do you mean to say that a W3C-normaliser has yet another failure mode:
unallocated
code positions are used, cannot normalise (just in case they canonically
decompose)?

> 
> >The note about legacy (plain) text always being normalised might not be 
> >true for all (any?) legacy encodings for Vietnamese (and now maybe not
for 
> >Hebrew either...).  See in particular MS CP 1258.
> 
> This depends on the definition, i.e. what does it mean for a legacy
encoding
> to be normalized. We chose the definition that best fits with the rest of
> our explanations, which may not be the same as some other definitions.
> Some clarification may be needed.

I don't follow.  For CP 1258 there are a number of combining characters that
can be represented
(as well as base characters, of course).  But not all of the precoposed
characters in Unicode
that can result from a CP 1258 string directly transcoded to Unicode and
then NFCd can be
directly represented in CP 1258 (you have to use the expanded version).

> 
> >Side remark: turning marked-up W3C-normalised text into plain text may 
> >produce non-NFC results in another way too; e.g. 
> ><ex>A<emphasise>&#x308;</emphasise></ex> (say that 'emphasise' uses red 
> >colour when displayed/printed). Just expanding the character escape while

> >removing the markup tags results in a decomposed ト as the plain text
version.
> 
> Good point. But removing the markup is an editing operation, I guess.

On the other hand, it might not be a good idea to make that kind of
splitups.  I guess very few,
if any, renderer would be able to handle making just the combining mark in
another colour,
as in this example...  So, though not needed for W3C-normalisation, maybe
one should impose
a similar rule here as for defined string names: no combining characters in
the beginning of
string elements (for XML, #PCDATA and #CDATA).

> 
> >* clause 5
> >Expand "GI" to "generic identifier" (or avoid that term, which is not
even 
> >properly defined in the XML spec.).
> >
> >* clause 8 (on URIs)
> >[this is a general and ugly mess]
> 
> Can you be more specific? E.g. what do you not understand,
> what do you think should be added/removed/changed,...

Well, messy here is:
	No general backconvertability (even when applied 'correctly'),
	no clear line where on one side %-encoding is used, and on the other
it's not,
	and all the problems that follows from these.

I'm not sure what can be done about this since the %-encoding rules were
flawed
from the beginning (much more so than QP, that I guess was the inspiration).

> 
> >"The conversion MUST (sic) take place as late as possible."
> 
> >Good.  Similarly, the conversion back to a form that does not use the 
> >%-encoding should be done as early as possible (in case a URI protocol 
> >element is passed back as parameter, it should not then still be 
> >%-encoded). Nor should pre-%-encoded URIs occur in stored or generated 
> >documents. This should keep %xx's out of any UI.
> 
> This is not always possible, because of legacy URIs that may not be
> back-convertible, or that may look back-convertible, but actually
> were created completely differently.

See above.

		Kind regards
		/kent k
Received on Wednesday, 21 February 2001 11:05:03 UTC