W3C home > Mailing lists > Public > www-international@w3.org > April to June 2007

Re: [Ltru] RE: For review: Tagging text with no language

From: <Karen_Broome@spe.sony.com>
Date: Fri, 13 Apr 2007 16:11:07 -0700
To: Mark Davis <mark.davis@icu-project.org>
Cc: www-international@w3.org
Message-ID: <OF94441DBD.6003B2B5-ON882572BC.007EAB52-882572BC.007F86C9@spe.sony.com>
How can "podstatné jméno" be both Czech and miscellaneous at the same 
time?  I think it could be "und" or "cs" but I don't think "mis" should be 
used because that means the other language tags do not apply, which is not 
the case here.

Karen Broome




Asmus Freytag <asmusf@ix.netcom.com> 
Sent by: www-international-request@w3.org
04/13/2007 03:25 PM

To
Mark Davis <mark.davis@icu-project.org>
cc
John Cowan <cowan@ccil.org>, Stephen Deach <sdeach@adobe.com>, Kent 
Karlsson <kent.karlsson14@comhem.se>, Richard Ishida <ishida@w3.org>, LTRU 
Working Group <ltru@ietf.org>, www-international@w3.org, CLDR list 
<cldr@unicode.org>
Subject
Re: [Ltru] RE: For review: Tagging text with no language







On 4/13/2007 9:24 AM, Mark Davis wrote:
> I always like to think of these kinds of issues by looking at 
> examples, since it tends to focus the issues and make it clear when 
> people are misinterpreting others' terminology. I put out below some 
> examples of what a process should do if gets a stream of information 
> and is to tag it, where we assume that it is doing the best job it 
> can. People can comment on these or propose others.
>
> Content
>                Tag
>                Comment
> n/a
>                und, or equivalently
> "" , if that is available in the protocol              The tag where the 
process 
> is not equipped to analyze the text at all. und = "Undetermined"
> 143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe                 zxx
>                Clearly some binary junk. zxx = "No linguistic content"
> bok23
>                und
>                Maybe has linguistic content, maybe not. Can't really 
determine.
> chat           mul, if the protocol only permits a single tag
> <en, fr> otherwise             mul = "Multiple languages"
> maybe also others, since "chat" has entered the vocabulary of many 
> languages
> Suzuki                 ja-Latn
>                maybe also others, since "I bought a Suzuki" is a 
perfectly 
> reasonable English sentence.
> Igonda flatunicai vbinkli?             mis             some language the 
process recognizes, 
> but which is not in BCP 47
> podstatné jméno                mis
>                something the process recognizes as having linguistic 
content, and 
> might be in BCP 47, but it doesn't know which language it is.
> if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); }
>                art?
>                unclear whether "art" can include, or is restricted to 
cases like 
> Klingon or Esperanto. art = "Artificial (Other)"
>
Your suzuki example would benefit from context.

The "Suzuki" in "I bought a Suzuki" is clearly a proper name which 
doesn't change the fact that the entire text is in some form of English, 
while "Suzuki" appearing in context of Japanese text would indeed be 
ja-Ltn.

I sent out, a while ago, a list of possible edge cases (abstract, not 
concrete examples). You might look a them to see whether any others from 
that list should be given examples.

A./
Received on Friday, 13 April 2007 23:13:13 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT