- From: Martin J Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 24 Oct 1996 20:19:04 +0100 (MET)
- To: keld@dkuug.dk (Keld J|rn Simonsen)
- Cc: rosenne@NetVision.net.il, www-international@w3.org
Keld Simonsen wrote: >Martin J Duerst writes: > >> Keld Simonsen wrote: >> >> >Again, the user does not care >> >how the information is encoded, as long as what (s)he sees >> >is understandable and what is expected. One or two characters >> >does not matter to the user. So again it is up to the system designer >> >to code the information in an unambigeous and well-defined way. >> >In the case of accented Latin characters 10646 then specifies >> >normatively only one way of encoding. >> >> As Jonathan Rosenne, I don't really agree on this point. >> Assume I have something like A-with-dot-below, which does >> not exist as precomposed in ISO 10646. For this thing, what >> does ISO 10646 (normatively or otherwise) specify? > >Well, 1EA0 should do it for A-with-dot-below. Sorry for the bad example. >But anyway, I agree. I was thinking of our A-GRAVE example. >There are a number of Latin characters that are not defined >in 10646, and you can encode the information with the use of combining >characters. Is that way of encoding these characters normative? If not, what is it? And if it might turn out that it is not normative, whereas a precomposed character as such is normative, does that give one of them preference over another, as you suggest? If I take another example, the two-letter combination "fi". Who tells me that I have to encode it with "f" and "i"? There is a (normative) ligature "fi" in ISO 10646! And for plain words such as "it"? I guess ISO 10646 does not include anything normative on how to decompose words into characters, it relies on common sense. It just says that "i" and "t" are at their respective codepoints. So now if I see, on paper, an A-grave, and I see this as two things, namely an A and a combining grave, wouldn't I just equally well be entitled to encode it as two separate things? Is there any passage in ISO 10646 that would prohibit me to do so, or that would suggest that I do something different? >> The keyword here is "at some stage". And one also has to realize >> that combining semantics in particular for Indic scripts can be >> handled quite different from Latin, because it is much less a >> general combination, and much more a complicated arrangement >> of special cases. >> >> Assume, for a littel while, that not even the precombinations >> in Latin-1 would be available in ISO 10646. This would mean >> obviously that because of large and wealthy markets such as >> Germany and France, everybody would immediately start to >> work on combining characters. And these implementations would >> be completed rather soon, and would be very straghtforward. > >I have heard that it should be very straightforward, but I have >not yet seen implementations. I'm sure Taligent has one :-). And probably also Accent, Alice, and Gamma. These would exactly be the high-end companies. That there are no others would nicely prove my point. >I also know that encoding with >similar properties like UNICODE , including ISO 6937, have not >been very widely implemented, although it was capable of >handling almost all Latin script based languages, and has been >around for a long time. I don't know much about ISO 6937. But current display and printing technology is much more flexible than what we had when ISO 6937 was made. >The problem with rare languages is that you need to have also >printers, displays etc render the rare language character, and >this requires that the products be enhanced with the fonts for >these characters. At least for Danish I know that you cannot >just do with a simple or intelligent combination of glyphs >with the base letter and the accents, and I would imagine that >for other languages based on the Latin script, they would have >similar problems. So in the interest of the rare languages >we should work on integrating these characters in 10646. Again, font and glyph issues are not the same as character encoding. You might know that in most PostScript fonts, something like a German u-Umlaut is actually decomposed into two subrutine-like parts. So you can easily have something like: User thingks/keyboards: Precomposed Software stores: Decomposed Glyph selection from font: Precomposed Glyph subroutines: Decomposed Of course, there should/will be decomposed fallbacks on the keyboard and glyph selection level. Also, it is not too difficult to write a reasonably good accent placement algorithm for Postscript. Of course, this will not give top-notch typographic quality, but quite reasonable. Regards, Martin.
Received on Thursday, 24 October 1996 14:28:02 UTC