W3C home > Mailing lists > Public > public-i18n-geo@w3.org > December 2002

Re: General: African languages

From: Martin Duerst <duerst@w3.org>
Date: Tue, 24 Dec 2002 11:15:32 -0500
Message-Id: <>
To: Andrew Cunningham <andrewc@mail.vicnet.net.au>, public-i18n-geo@w3.org

Hello Andrew,

At 20:28 02/12/18 +1100, Andrew Cunningham wrote:

>WRT today's teleconference:
>What follows is very very brief. If people want more details or have 
>specific questions, please let me know.
>African languages fall into four categories:
>1) languages supported by unicode.
>E.g. Hausa and Pulaar (using Latin script).
>2) languages supported by unicode, but require additional support in 
>rendering systems.
>E.g. Yoruba, Ife, Dinka, Nuer, etc.
>This can include correct placement of combining diacritics based on 
>languages' typographic conventions, or stacking of combining diacritics. 
>Ife offers a challenging example.
>Some notes under construction that may illustrate some of the issues:

Very nice notes.

It would be very good to note that at least on the Web,
NFC should always be preferred. I.e. don't just say

U+00E3 U+0300 or U+0061 U+0303 U+0300

But make it clear that U+00E3 U+0300 is the right way to go.
This will help a lot for low-level comparisons,...


I checked this one, and it was in NFC. But I didn't see
a language indication for Dinka. Maybe there is no code?

>This is an issue for font rendering technologies (AAT/ATSUI, Uniscribe and 
>Graphite for example). OpenType has features (e.g. MarkToBase, MarkToMark) 
>that are designed for correct positioning of combining diacritics. Support 
>for this in Uniscribe is currently under development. (Not sure of the 
>status of AAT/ATSUI in this regard).

Don't at least some of these technologies offer the possibility to
define glyphs for combinations of characters? Also, please check
SVG and see whether it contains the necessary mechanisms (it should!).

>In some cases: (Dinka and Nuer for instance) the existing combining 
>diacritics for some fonts are adequate for lowercase characters (but not 
>optimal), although entirely unsuitable for uppercase characters. In other 
>cases like Ife, where diacritic stacking is required, it is a crucial 
>concern which will be alleviated when the new versions of the font 
>rendering technologies become widespread.

A more short-time solution would be to create e.g. a True-Type
font for Dinka, which covers all the necessary combinations, and
has the right glyph shapes for the upper-case diacritics, and
give that font priority in style sheets for Dinka material.

>Additionally, African languages use alternative glyphs for certain 
>characters (most common example is uppercase ENG). It is possible to 
>create alternative glyphs for different languages/typographic traditions 
>within an opentype font. Unfortunately current software is unable to 
>interact sufficiently with the font rendering systems to allow use of 
>langauge specific features within fonts.

Again, having a specially-designed font (or some fonts) may be
a short to middle-range solution.

>At least thats my current understanding.
>3) languages that have some characters that are not present in Unicode.
>E.g. Dagera (Burkina Faso), Hausa/Pulaar/etc. in Ajami (Arabic script).
>There has been a fair amount of discussion recently on Ajami on the 
>Unicode-Afrique, A12N Collaboration and H-Hausa mailing lists.

Very good. The important thing is to take such discussion (or the
results and main points) over to unicode@unicode.org (or even unicore),
and to work on actual proposals.

>4) scripts currently not supported by Unicode.
>E.g  N'ko, Vai, Tifinagh, etc.

For some work on the later, please see

>With respect to HTML, issues are how to identify languages when there is 
>no ISO-639-1 code or IANA language code. How should the "x-" convention be 
>used in practical settings?

Not at all, if possible.

>For an example:

[I find the en-AU in <h1 xml:lang="en-AU">Nuer test page</h1>
a bit too much, but that's a detail.]

>I've use a convention "x-sil-" to indicate an ethnologue language codes. 
>Although thats neither here nore there.

Exactly. If the language in question has some ammout of printed works
(50 different items to be exact), then you should apply for an iso-639-2
code. if the language in question doesn't have that much printed material,
you should apply for an IANA code.

>Other key issues include charset identification in the absence of 
>"defined" character encodings.

There are no 'undefined' character encodings. If somebody
does an encoding, they should document it, and register it
with IANA. It's rather easy to do that. But working towards
getting the necessary characters into Unicode may be
much better use of your time.

Regards,    Martin.

>A useful starting point is the "A12N gateway" http://www.bisharat.net/A12N/
>Andrew Cunningham
>Multilingual Technical Officer
>OPT, Vicnet,
>State Library of Victoria
Received on Tuesday, 24 December 2002 13:58:40 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:12:36 GMT