Re: [Ltru] RE: For review: Tagging text with no language

Q1. I had missed the choice of "mis". I agree with that suggestion; we
should incorporate that into 4646bis. The problem is ameliorated
considerably once we add -3, but it doesn't disappear completely, so "mis"
remains a good choice for dealing with that situation.

Q2. The issue *does* remain, since we talk about "und" vs the absence of a
language tag, which "" represents.

Mark

On 4/12/07, John Cowan <cowan@ccil.org> wrote:
>
> Mark Davis scripsit:
>
> > The summary looks good. This discussion raises 2 items for the LTRU
> > group.
> >
> > Q1. What tag should be used where it is definitely a language, but there
> > is no code available yet? (This is an area where ISO 15924 is ahead
> > of ISO 639 (and 3166), since it has Zzzz: Code for uncoded script.)
>
> In principle, every natural-language item (text, audio, video) can be
> coded with some 639-2 code; if the language does not have a code of its
> own, it will belong to one of the 639-2 collections.
>
> For example, the language Tarifit (639-3 code 'rif') does not have a 639-2
> code, but it is a Berber language; consequently, an item in Tarifit may be
> validly tagged 'ber', which represents the collection of Berber languages.
> Similarly, the language Zumbun (639-3 code 'jmb') does not have an 639-2
> code, nor does it belong to any of the smaller 639-2 collections, but it
> does belong to the Afro-Asiatic language family; consequently, an item
> in Zumbun may be validly tagged 'afa', which represents the collection
> of Afro-Asiatic languages.
>
> If all else fails, as for the language isolate Burushaski (639-3 code
> 'bsk'), the 639-2 collection code 'mis', representing the collection of
> miscellaneous languages, may be applied.  This is the ultimate fallback
> code, indicating that the language is known but nothing useful can be
> said about it using 639-2 codes.
>
> All of this lore, which represents the practice of the Library of Congress
> (the ultimate source of 639-2), can of course go away when RFC 4646bis
> goes into effect.  If it is necessary to be more specific before then,
> and if strict compliance to 4646 is required, then rif-x-tarifit,
> afa-x-jumbun, and mis-x-burushas may also be used.
>
> > Q2. Clarify the wording around "und" vs "".
>
> "" is not a well-formed language tag according to RFC 4646, so there is
> nothing to say about it there.  It is defined by the XML Recommendation as
> an extension to the set of language tags, and having the same significance
> as no language declaration at all.
>
> --
> Dream projects long deferred             John Cowan <cowan@ccil.org>
> usually bite the wax tadpole.            http://www.ccil.org/~cowan
>         --James Lileks
>



-- 
Mark

Received on Thursday, 12 April 2007 17:29:32 UTC