RE: Last Call review of Character Model for the WWW from Mark Davis on 2001-02-21 (www-i18n-comments@w3.org from February 2001)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Wed, 21 Feb 2001 13:56:09 -0800
To: Karlsson Kent - keka <keka@im.se>
Cc: "'duerst@w3.org'" <duerst@w3.org>, "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>, misha.wolf@reuters.com, "'Asmus Freytag'" <asmusf@ix.netcom.com>, "'Kenneth Whistler'" <kenw@sybase.com>
Message-ID: <OFD7858386.5C04BED3-ON882569FA.0077B448@LocalDomain>

Brief comments with ***

This example touches on what the term "character" means.  But using
the term "character" in the 10646/Unicode sense, the fi ligature stores
two letters in a single character (which in some encodings fit in a single
'unit of physical storage'). Not two characters in a single character
(which
in some encodings fit in a single 'unit of physical storage')...

*** To be more accurate, for the case of the fi ligature, a sequence of two
abstract characters in a particular presentation form are represented by a
single encoded character

> On the other hand, if we leave out many-to-one, readers
> will ask why.

My reaction was: why is many-to-one left *in*...  If what you are talking
about
here are such things as the "squared" ligatures and other ligatures, then
that should
be made explicit.  Side remark: The fi ligature is especially unfortunate.
Some softwares
automatically replaces fi with the fi ligature, and have no other means
(yet) of handling
ligatures.  They then miss out on fj resulting in poor typographic result
for words like
fjarde (fourth), fjord, fjolaret (the previous year), fjall (scales or
mountain...).

*** We should definitely leave in the discussion of many:one relationships
-- can't hide that bit of ugliness.


>
> >* clause 3.2
> >There is no definition of terms in the document.  Terms such as "byte"
and
> >"wyde" are left for the reader to guess, likewise for "octet", though
that
> >is more precise.  Note that some well-known standards (such as that for
C)
> >does NOT limit a "byte" to be an "octet".
>
> Does anything in the spec not work out because the reader doesn't
> know what a byte is? I don't know, but if that's not the case,
> then we don't have to be more precise, or do we?

After seeing the recent discussion on the "Open Group" e-mail list about
the next version of POSIX, where a discussion thread is going on and on
about 9-bit bytes, 10-bit bytes (for historic architectures) and the
eventual
possibility of 16-bit bytes, I find it best to avoid the term byte
all-together
and just write octet.

*** I agree about not using wyde (certainly not without a definition).
Byte, on the other hand, is simply much better understood than octet. One
can qualify it on first use by saying that it is always 8-bit.

>
> >"code point"...; "code position" seems to be the 10646 term, though not
> >formally defined.
>
> We checked that, you are right. I think we decided to add
> "code position" in parenthesis to give the link to 10646 terminology.

Or just write "code position" throughout...  (I think it's a better term,
since it does not
involve the term "point", which has other connotations.)

*** While defined in 10646, it is rarely used in 10646. Code point is used
throughout the Unicode documents. "Position" also has its own connotations.

Received on Wednesday, 21 February 2001 16:56:22 UTC