Re: For review: Tagging text with no language

Your conclusion from 639-2 I find extremely odd, since "zxx" is ALSO in a
column that says "English name of language", so by your logic, "zxx" is also
a language, just one with no linguistic content. What we have here is not a
definitive situation, just a conflict between two standards, 639-3 and
639-2. And the important thing for xml:lang is BCP 47, which is based on
639-2. Reinterpreting "und" as being *only somewhat unknown* (that is, that
it can't correctly be applied to non-linguistic content, even if you don't
know that that content is non-linguistic) is definitely a breaking change
for many applications, and I would be strongly opposed to making that change
in BCP 47.

In the real world, it is often very difficult or impossible to tell whether
content is in an unknown language or whether the content is not in any
language at all. Saying that an implementation or protocol is not compliant
if it doesn't correctly make this determination is bizarre. What we have
done in BCP 47 is to say that if the protocol allows it, then one should
omit the language code entirely; otherwise one can use "und".

CCing the LTRU group, since this is relevant to the next version of BCP 47.

Mark

On 4/11/07, John Cowan <cowan@ccil.org> wrote:
>
> Mark Davis scripsit:
>
> > I believe that that is adding an interpretation to "und" which is not
> > borne out by either the source standards, nor in common usage.
>
> ISO 639-2 says merely "Undetermined", but this is placed in a column
> labeled "English name of language", so I think it's fair to read it
> as "Undetermined language".  But ISO 639-3 is, I think, definitive.
> http://www.sil.org/iso639-3/scope.asp#S says (in part):
>
>         The identifier [und] (undetermined) is provided for those
>         situations in which a language or languages must be indicated
>         but the *language* cannot be identified [emphasis added].
>
> By contrast, "zxx" is explained in the next sentence thus:
>
>         The identifier [zxx] (no linguistic content) may be applied in a
>         situation in which a language identifier is required by system
>         definition, but the item being described does not actually
>         contain linguistic content.
>
> In any case, the document I'm commenting on says that "zxx" is
> non-linguistic content, and that "und" and "" are synonymous and
> represent linguistic content.  Whatever "und" may or may not mean,
> I think there's no doubt that "" can be applied to both linguistic
> and non-linguistic content.
>
> --
> You escaped them by the will-death              John Cowan
> and the Way of the Black Wheel.                 cowan@ccil.org
> I could not.  --Great-Souled Sam                http://www.ccil.org/~cowan
>



-- 
Mark

Received on Wednesday, 11 April 2007 21:19:44 UTC