W3C home > Mailing lists > Public > www-international@w3.org > April to June 2007

Re: [Ltru] RE: For review: Tagging text with no language

From: Mark Davis <mark.davis@icu-project.org>
Date: Thu, 12 Apr 2007 18:15:02 -0700
Message-ID: <30b660a20704121815j59e854dxf7cf5f53a48ece21@mail.gmail.com>
To: "Stephen Deach" <sdeach@adobe.com>
Cc: "Kent Karlsson" <kent.karlsson14@comhem.se>, "Asmus Freytag" <asmusf@ix.netcom.com>, "John Cowan" <cowan@ccil.org>, "Richard Ishida" <ishida@w3.org>, "LTRU Working Group" <ltru@ietf.org>, www-international@w3.org, "CLDR list" <cldr@unicode.org>
I think I agree with you in spirit, but not in precise details. The tag
"und" means "undetermined", so when I encounter it I don't know whether the
content contains one language, many languages, or no language. The tag "zxx"
would mean that there is no language content, "mis" would mean that there is
at least some language content, and "mul" would mean that there is language
content, with more than one language.

I think to try to consider what the motivations of the tagger are may lead
to misleading impressions. Assume for the moment that the tag is correct.
From the perspective of the tagger, using "und" could mean, as you say, that
the tagger doesn't know or care (or want to communicate, or what to spend
the time to determine) what the language is or whether there is any language
content there at all. There could be quite a variety of motivations for the
tagger's using "und"; the key is what the reader of "und" can assume about
the content, which is essentially nothing. With "mis", the situation is
similar, but slightly narrower. The tagger may still not know much, or care
much, but maybe cared enough to determine that there was something there, or
maybe there was language content there, but there is no language code that
correctly matches it (protogermanic, perhaps).

Similarly, using "may not" language is a bit too strong in your phrase
"Whereas zxx says I 'may not' apply any of those language-based services
because it is not a 'natural' language". Having content tagged with "zxx"
doesn't restrict me from doing anything I want to; it just means if it was
tagged correctly, it does not contain any language content. (I might decide
that the tagger was mistaken -- when we at Google look at the tagging people
actually do of web content, there is a fairly high percentage of both
invalid tags and valid-but-incorrect tags.)

Mark

On 4/12/07, Stephen Deach <sdeach@adobe.com> wrote:
>
>  I think much of this discussion is dealing with terminology differences
> that are so narrow that one is discussing "the number of angels who can
> dance on the head of a pin". (In other words we are debating theology, not
> practice.) In reality, specifications are worded as carefully as possible,
> but interpretation is open to the reader's most common
> definition/redefinition/translation of the exact terminology.  -- So rather
> than debate what the "exact meaning" of a word/phrase is in each of these
> languages, maybe we should take a looser interpretation of what is written
> and then clarify the intent.
>
> My reading of the ISO spec is that "und/undetermined" means "I don't know
> (or care, or am unwilling to state) what the language is (and have no closer
> alternative language identifier given the available options)". From a
> practical viewpoint, "und" indicates I can't assume any specific/preferred
> linguistic definitions for words in the content, nor can I assume any
> specific/preferred pronunciation-, spelling-, hyphenation-, and/or
> grammar-rules on the content; though I am allowed to attempt my own
> linguistic analysis to guess at the language. (Whereas zxx says I 'may not'
> apply any of those language-based services because it is not a 'natural'
> language and should not attempt any linguistic analysis to guess at the
> language.) I can't see any practical difference between "und" and "" (except
> that "" is disallowed in some processing environments) so why can't the
> documents simply say that 'a missing specification' or 'xml:lang=""' (should
> either occur), will be interpreted as "und".
>
>
> It has been a while since I considered myself fluent in Swedish (and I
> intentionally ignored the lack of the dieresis in the original text as an
> indication that the translations were "lossy"). I just thought that some
> comment would force the necessary clarification of the translations.
>
>
> At 2007.04.13-01:26(+0200), Kent Karlsson wrote:
>
> Stephen Deach wrote:
>
>  > sv.xml:                       <language type="und">obestämt
> språk</language>
>
>
> I thought "obestamt" was "unstated".
>
> "Obestämt" literally means "undetermined". "Unstated" would be "osagt",
> "outtalat", or "ej angett"
> ("not given", closer to the current German translation).
>
> Though I would agree that xml:lang="" is closer to "unstated" than
> "undetermined". I'm not sure
> that that nit-picking leads anywhere in this case. But "unstated" is not
> the same as "undetermined";
> it may well be determined, but just not stated... So maybe there is a
> difference worth bothering about.
>
>         /kent k
>
>
>
> ---Steve Deach
>    sdeach@adobe.com
>



-- 
Mark
Received on Friday, 13 April 2007 01:15:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT