RE: [Ltru] RE: For review: Tagging text with no language from Peter Constable on 2007-04-13 (www-international@w3.org from April to June 2007)

From: Peter Constable <petercon@microsoft.com>
Date: Thu, 12 Apr 2007 22:26:12 -0700
To: Mark Davis <mark.davis@icu-project.org>, John Cowan <cowan@ccil.org>
CC: "www-international@w3.org" <www-international@w3.org>, LTRU Working Group <ltru@ietf.org>
Message-ID: <DDB6DE6E9D27DD478AE6D1BBBB8357955D038B87D0@NA-EXMSG-C117.redmond.corp.microsoft>

There is a difference between 'no information provide', and 'information is provided: this is unknown'.

I don't see what the issue is; I guess I must have missed the start of this thread.

Peter

________________________________
From: Mark Davis [mailto:mark.davis@icu-project.org]
Sent: Thursday, April 12, 2007 10:29 AM
To: John Cowan
Cc: www-international@w3.org; LTRU Working Group
Subject: Re: [Ltru] RE: For review: Tagging text with no language

Q1. I had missed the choice of "mis". I agree with that suggestion; we should incorporate that into 4646bis. The problem is ameliorated considerably once we add -3, but it doesn't disappear completely, so "mis" remains a good choice for dealing with that situation.

Q2. The issue *does* remain, since we talk about "und" vs the absence of a language tag, which "" represents.

Mark
On 4/12/07, John Cowan <cowan@ccil.org<mailto:cowan@ccil.org>> wrote:
Mark Davis scripsit:

> The summary looks good. This discussion raises 2 items for the LTRU
> group.
>
> Q1. What tag should be used where it is definitely a language, but there
> is no code available yet? (This is an area where ISO 15924 is ahead
> of ISO 639 (and 3166), since it has Zzzz: Code for uncoded script.)

In principle, every natural-language item (text, audio, video) can be
coded with some 639-2 code; if the language does not have a code of its
own, it will belong to one of the 639-2 collections.

For example, the language Tarifit (639-3 code 'rif') does not have a 639-2
code, but it is a Berber language; consequently, an item in Tarifit may be
validly tagged 'ber', which represents the collection of Berber languages.
Similarly, the language Zumbun (639-3 code 'jmb') does not have an 639-2
code, nor does it belong to any of the smaller 639-2 collections, but it
does belong to the Afro-Asiatic language family; consequently, an item
in Zumbun may be validly tagged 'afa', which represents the collection
of Afro-Asiatic languages.

If all else fails, as for the language isolate Burushaski (639-3 code
'bsk'), the 639-2 collection code 'mis', representing the collection of
miscellaneous languages, may be applied.  This is the ultimate fallback
code, indicating that the language is known but nothing useful can be
said about it using 639-2 codes.

All of this lore, which represents the practice of the Library of Congress
(the ultimate source of 639-2), can of course go away when RFC 4646bis
goes into effect.  If it is necessary to be more specific before then,
and if strict compliance to 4646 is required, then rif-x-tarifit,
afa-x-jumbun, and mis-x-burushas may also be used.

> Q2. Clarify the wording around "und" vs "".

"" is not a well-formed language tag according to RFC 4646, so there is
nothing to say about it there.  It is defined by the XML Recommendation as
an extension to the set of language tags, and having the same significance
as no language declaration at all.

--
Dream projects long deferred             John Cowan < cowan@ccil.org<mailto:cowan@ccil.org>>
usually bite the wax tadpole.            http://www.ccil.org/~cowan
        --James Lileks

--
Mark

Received on Friday, 13 April 2007 05:26:20 UTC