W3C home > Mailing lists > Public > www-international@w3.org > April to June 2007

Re: [Ltru] RE: For review: Tagging text with no language

From: John Cowan <cowan@ccil.org>
Date: Thu, 12 Apr 2007 13:00:13 -0400
To: Mark Davis <mark.davis@icu-project.org>
Cc: Richard Ishida <ishida@w3.org>, LTRU Working Group <ltru@ietf.org>, www-international@w3.org
Message-ID: <20070412170013.GF16269@mercury.ccil.org>

Mark Davis scripsit:

> The summary looks good. This discussion raises 2 items for the LTRU 
> group.
> 
> Q1. What tag should be used where it is definitely a language, but there
> is no code available yet? (This is an area where ISO 15924 is ahead
> of ISO 639 (and 3166), since it has Zzzz: Code for uncoded script.)

In principle, every natural-language item (text, audio, video) can be
coded with some 639-2 code; if the language does not have a code of its
own, it will belong to one of the 639-2 collections.

For example, the language Tarifit (639-3 code 'rif') does not have a 639-2
code, but it is a Berber language; consequently, an item in Tarifit may be
validly tagged 'ber', which represents the collection of Berber languages.
Similarly, the language Zumbun (639-3 code 'jmb') does not have an 639-2
code, nor does it belong to any of the smaller 639-2 collections, but it
does belong to the Afro-Asiatic language family; consequently, an item
in Zumbun may be validly tagged 'afa', which represents the collection
of Afro-Asiatic languages.

If all else fails, as for the language isolate Burushaski (639-3 code
'bsk'), the 639-2 collection code 'mis', representing the collection of
miscellaneous languages, may be applied.  This is the ultimate fallback
code, indicating that the language is known but nothing useful can be
said about it using 639-2 codes.

All of this lore, which represents the practice of the Library of Congress
(the ultimate source of 639-2), can of course go away when RFC 4646bis
goes into effect.  If it is necessary to be more specific before then,
and if strict compliance to 4646 is required, then rif-x-tarifit,
afa-x-jumbun, and mis-x-burushas may also be used.

> Q2. Clarify the wording around "und" vs "".

"" is not a well-formed language tag according to RFC 4646, so there is
nothing to say about it there.  It is defined by the XML Recommendation as
an extension to the set of language tags, and having the same significance
as no language declaration at all.

-- 
Dream projects long deferred             John Cowan <cowan@ccil.org>
usually bite the wax tadpole.            http://www.ccil.org/~cowan
        --James Lileks
Received on Thursday, 12 April 2007 17:00:32 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT