Re: Internationalized CLASS attributes from Martin J Duerst on 1996-10-24 (www-international@w3.org from October to December 1996)

From: Martin J Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 24 Oct 1996 20:19:04 +0100 (MET)
To: keld@dkuug.dk (Keld J|rn Simonsen)
Cc: rosenne@NetVision.net.il, www-international@w3.org
Message-ID: <"josef.ifi..278:24.09.96.19.19.19"@ifi.unizh.ch>
Keld Simonsen wrote:

>Martin J Duerst writes:
>
>> Keld Simonsen wrote:
>> 
>> >Again, the user does not care
>> >how the information is encoded, as long as what (s)he sees 
>> >is understandable and what is expected. One or two characters
>> >does not matter to the user. So again it is up to the  system designer
>> >to code the information in an unambigeous and well-defined way.
>> >In the case of accented Latin characters 10646 then specifies
>> >normatively only one way of encoding.
>> 
>> As Jonathan Rosenne, I don't really agree on this point.
>> Assume I have something like A-with-dot-below, which does
>> not exist as precomposed in ISO 10646. For this thing, what
>> does ISO 10646 (normatively or otherwise) specify?
>
>Well, 1EA0 should do it for A-with-dot-below.

Sorry for the bad example.

>But anyway, I agree. I was thinking of our A-GRAVE example. 
>There are a number of Latin characters that are not defined
>in 10646, and you can encode the information with the use of combining
>characters.

Is that way of encoding these characters normative? If not,
what is it? And if it might turn out that it is not normative,
whereas a precomposed character as such is normative, does
that give one of them preference over another, as you suggest?
If I take another example, the two-letter combination "fi".
Who tells me that I have to encode it with "f" and "i"? There
is a (normative) ligature "fi" in ISO 10646! And for
plain words such as "it"? I guess ISO 10646 does not include
anything normative on how to decompose words into characters,
it relies on common sense. It just says that "i" and "t"
are at their respective codepoints. So now if I see, on
paper, an A-grave, and I see this as two things, namely
an A and a combining grave, wouldn't I just equally well
be entitled to encode it as two separate things? Is there
any passage in ISO 10646 that would prohibit me to do so,
or that would suggest that I do something different?


>> The keyword here is "at some stage". And one also has to realize
>> that combining semantics in particular for Indic scripts can be
>> handled quite different from Latin, because it is much less a
>> general combination, and much more a complicated arrangement
>> of special cases.
>> 
>> Assume, for a littel while, that not even the precombinations
>> in Latin-1 would be available in ISO 10646. This would mean
>> obviously that because of large and wealthy markets such as
>> Germany and France, everybody would immediately start to
>> work on combining characters. And these implementations would
>> be completed rather soon, and would be very straghtforward.
>
>I have heard that it should be very straightforward, but I have
>not yet seen implementations.

I'm sure Taligent has one :-). And probably also Accent, Alice,
and Gamma. These would exactly be the high-end companies. That
there are no others would nicely prove my point.

>I also know that encoding with
>similar properties like UNICODE , including ISO 6937, have not
>been very widely implemented, although it was capable of
>handling almost all Latin script based languages, and has been
>around for a long time.

I don't know much about ISO 6937. But current display and
printing technology is much more flexible than what we had
when ISO 6937 was made.

>The problem with rare languages is that you need to have also
>printers, displays etc render the rare language character, and
>this requires that the products be enhanced with the fonts for
>these characters. At least for Danish I know that you cannot 
>just do with a simple or intelligent combination of glyphs
>with the base letter and the accents, and I would imagine that
>for other languages based on the Latin script, they would have 
>similar problems. So in the interest of the rare languages
>we should work on integrating these characters in 10646.

Again, font and glyph issues are not the same as character
encoding. You might know that in most PostScript fonts,
something like a German u-Umlaut is actually decomposed
into two subrutine-like parts. So you can easily have
something like:

User thingks/keyboards:		Precomposed
Software stores:		Decomposed
Glyph selection from font:	Precomposed
Glyph subroutines:		Decomposed

Of course, there should/will be decomposed fallbacks on the
keyboard and glyph selection level. Also, it is not too
difficult to write a reasonably good accent placement
algorithm for Postscript. Of course, this will not give
top-notch typographic quality, but quite reasonable.

Regards,	Martin.
Received on Thursday, 24 October 1996 14:28:02 UTC