Re: General: African languages from Andrew Cunningham on 2002-12-26 (public-i18n-geo@w3.org from December 2002)

From: Andrew Cunningham <andrewc@mail.vicnet.net.au>
Date: Thu, 26 Dec 2002 11:31:12 +1100
To: Martin Duerst <duerst@w3.org>
Cc: public-i18n-geo@w3.org
Message-ID: <3E0A4DD0.1030504@mail.vicnet.net.au>
Hi Martin,

Martin Duerst wrote:
> 
> Very nice notes.
> 

just a work in progress.

> It would be very good to note that at least on the Web,
> NFC should always be preferred. I.e. don't just say
> 
> U+00E3 U+0300 or U+0061 U+0303 U+0300
> LATIN SMALL LETTER A WITH TILDE WITH COMBINING GRAVE
> 
> But make it clear that U+00E3 U+0300 is the right way to go.
> This will help a lot for low-level comparisons,...
> 
> 

thanks, i was intending to write up soem notes on using NFC  and also on 
accent ordering.



>> http://www.openroad.net.au/languages/african/dinka-4.html
> 
> 
> I checked this one, and it was in NFC. But I didn't see
> a language indication for Dinka. Maybe there is no code?
> 

Dinka doesn't have a ISO-639-1 langauge code, although it does have a 
ISO-639-2 code "din". The poage was originally a HTML4 page so couldn't 
use the iso-369-2 language code.

> 
>> This is an issue for font rendering technologies (AAT/ATSUI, Uniscribe 
>> and Graphite for example). OpenType has features (e.g. MarkToBase, 
>> MarkToMark) that are designed for correct positioning of combining 
>> diacritics. Support for this in Uniscribe is currently under 
>> development. (Not sure of the status of AAT/ATSUI in this regard).
> 
> 
> Don't at least some of these technologies offer the possibility to
> define glyphs for combinations of characters? Also, please check
> SVG and see whether it contains the necessary mechanisms (it should!).
> 

I haven't looked at SVG, but will. Thanks.

AAT/ATSUI ... don'ty know enough about it .. been too long since i've 
used macs .. but I suspect it should.

Unsicribe ... there has been ongoiing discussions on some mailing lists, 
which included soem type designers and soem of the Microsoft staff. My 
understanding at the moment is that the OpenType MarToBase and 
MarkToMark features should be used, but Uniscribe doesnot currentlky 
include support for these features for the Latin script (in general), 
although support was introduced for Vietnamese at soem sdtage. I'df have 
to check which evrsion of uniscribe that was. So some people will have 
combining diacritic support for Diactric/base combinations used in 
Vietnamese (for soem reason Micrsodt doesn't use NFC or NFD for 
Vietnamese Unicode input) but no othger diacritic/base combinations in 
the Latin script.

As to which users ahve what .. it depends on

1) which evrsion of windows they ahve installed,
2) which evrsion of IE they have installed, and wether they have any of 
the language packs requiring complex script rendering iunstalled, and
3) which other Microsoft products they have installed.

Both the uniscribe development team and the Word development team are 
working on general support for combining diacritics and stacking 
diacritics. Its currently unknown when this will be released to the 
public, maybe through the next version of Office if its implemented by then.

uniscribe isn't something that is always automatically upgraded.

Graphite can, but needs soemone to design and build the graphite tables 
into appropriate fonts. SIL are working on a set of graphite unicoide 
fonts at the moment that should work with graphite and uniscribe. Not 
sure when they will ahve it finished.


> 
>> In some cases: (Dinka and Nuer for instance) the existing combining 
>> diacritics for some fonts are adequate for lowercase characters (but 
>> not optimal), although entirely unsuitable for uppercase characters. 
>> In other cases like Ife, where diacritic stacking is required, it is a 
>> crucial concern which will be alleviated when the new versions of the 
>> font rendering technologies become widespread.
> 
> 
> A more short-time solution would be to create e.g. a True-Type
> font for Dinka, which covers all the necessary combinations, and
> has the right glyph shapes for the upper-case diacritics, and
> give that font priority in style sheets for Dinka material.
> 

problem is that you still need a rendering engine that can correctly 
position the diactrics above a capital open-e and open-o. I haven't 
tried suing the gpos and gsub tabl;es in the fonts yet, but in theory 
support for that is limited in uniscribe (wrt to the Latin script).

The fonts can be built, but until teh support exists in the rendering 
engine, the font will not do to much.

> 
>> Additionally, African languages use alternative glyphs for certain 
>> characters (most common example is uppercase ENG). It is possible to 
>> create alternative glyphs for different languages/typographic 
>> traditions within an opentype font. Unfortunately current software is 
>> unable to interact sufficiently with the font rendering systems to 
>> allow use of langauge specific features within fonts.
> 
> 
> Again, having a specially-designed font (or some fonts) may be
> a short to middle-range solution.
> 

yep ... i realise that. It would mean i'd have one set of fonts for key 
african languages and another for the Yolngu-matha languages for instance.

> 
>> At least thats my current understanding.
>>
>> 3) languages that have some characters that are not present in Unicode.
>> E.g. Dagera (Burkina Faso), Hausa/Pulaar/etc. in Ajami (Arabic script).
>>
>> There has been a fair amount of discussion recently on Ajami on the 
>> Unicode-Afrique, A12N Collaboration and H-Hausa mailing lists.
> 
> 
> Very good. The important thing is to take such discussion (or the
> results and main points) over to unicode@unicode.org (or even unicore),
> and to work on actual proposals.
> 

very true

a few people are working on comiling lists at the moment. Once the basic 
work has been done then it needs to move to the appropraite forums. Kewy 
people on the a12n-collaboration and Unicode Afrique mailing lists are 
present on the unicode mailing list.

> 
>> 4) scripts currently not supported by Unicode.
>> E.g  N'ko, Vai, Tifinagh, etc.
> 
> 
> For some work on the later, please see
> http://std.dkuug.dk/jtc1/sc2/wg2/docs/n1757.pdf
> 

yep, I'm aware of this document.

> 
>> With respect to HTML, issues are how to identify languages when there 
>> is no ISO-639-1 code or IANA language code. How should the "x-" 
>> convention be used in practical settings?
> 
> 
> Not at all, if possible.
> 

umm .. that leaves the issue of how to identify nost human languages 
then. For intsnce if I had a XHTML 1.1 page designed for Southern Sudan, 
in Dinka, Nuer and Arabic. I could provide provide an iso-639-2 language 
tag for Dinka, an iso-639-1 langauge tag for Arabic. But Nuer doesn't 
have an iso-639-1, iso-639-2 or IANA language code.

OK i guess i could go through the hassle of trying to register a code, 
and wait and wait.

What happens if I have a database driven web site with content in 
hundreds of African languages, most not having iso-639-1/2 or IANA codes.

Not sure the appropriate bodies would appreciate all those registrations 
being dumped on them all at the same time.

> 
>> For an example:
>>
>> http://home.vicnet.net.au/~andrewc/samples/nuer.htm
> 
> 
> [I find the en-AU in <h1 xml:lang="en-AU">Nuer test page</h1>
> a bit too much, but that's a detail.]
> 
> 
>> I've use a convention "x-sil-" to indicate an ethnologue language 
>> codes. Although thats neither here nore there.
> 
> 
> Exactly. If the language in question has some ammout of printed works
> (50 different items to be exact), then you should apply for an iso-639-2
> code. if the language in question doesn't have that much printed material,
> you should apply for an IANA code.
> 

LOL :)  I'll keep that in mind for a future project. Not sure IANA will 
appreciate it though :)

> 
>> Other key issues include charset identification in the absence of 
>> "defined" character encodings.
> 
> 
> There are no 'undefined' character encodings. If somebody
> does an encoding, they should document it, and register it
> with IANA. It's rather easy to do that. But working towards
> getting the necessary characters into Unicode may be
> much better use of your time.
> 

LOL,

I tend to only use unicode these days, although if the language cann't 
be rendered using unicode, i'll fall back to an 8-bit character 
encoding, but most 8-bit character encodings i've seen and used within 
the last few years do not have characters encodings registered with 
IANA. I suspect that the people who hacked together the fonts don't even 
know there is an IANA character encoding registry.

We know that there is a registry and there are very good reasons for 
using it, but I come across a lot of people who are not aware of it.

Catch-22

The amount of education that needs to be done, is extansive. Not just on 
issues relating to designing and implementing web sites and web 
services, but many of the background issues, relevant standards and why 
using them is important. The appropraite registries like IANA (for 
alngauge codes and character encodings) or the iso-639 registration 
authority.

The most effective ways of using unicode.

But some of the issues are even more fundamnetal, ie which unicode 
codepoints refers to which chacacters in the language. A good example of 
the possible confusions possible is current practice in Yoruba. Yoruba 
sues a diacritic below the "e", "o" and "s". In some textx this will be 
  a dot below the letter, in others a vertical bar below the letter. 
Both forms are in wide use.

Some Yoruba text will use U+0323 (COMBINING DOT BELOW) or appropriate 
precomposed forms, while other texts would use U+0329 (COMBINING 
VERTICAL BAR BELOW) which the Unicode charts iundicate is used for Yoruba.

This lead to two possibilities:

1) the "dot below" and the "vertical bar below" represent two different 
orthographic traditions. Which isn't that uncommon in West African 
languages. Many langauges in West Africa have different orthographies in 
different regions (ie use different characters to represent the same sound).

2) that they are glyph variations, ie the same character, just different 
shapes in different fonts.

approach 1) means that different codepoints or sequences of codepoints 
will be used to represent the same "character" in the language". 2) 
menas that only one unicode character wiull be used, but teh shape of 
the diacritic will change depending on teh shape of the font. Ie U+0329 
might look like a "vertical bar below" or it mightlook like U+0323.

Either way, the countries involve may have to standardise the 
orthography for this language.

Everything is relatively nice and simple when dealing with european 
languages and east asian languages. But African languages have been 
neglected so long that its a real mess with respect to standards and web 
internationalization.


Andrew
Received on Wednesday, 25 December 2002 19:36:12 UTC