- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Fri, 13 Apr 2007 15:25:27 -0700
- To: Mark Davis <mark.davis@icu-project.org>
- CC: John Cowan <cowan@ccil.org>, Stephen Deach <sdeach@adobe.com>, Kent Karlsson <kent.karlsson14@comhem.se>, Richard Ishida <ishida@w3.org>, LTRU Working Group <ltru@ietf.org>, www-international@w3.org, CLDR list <cldr@unicode.org>
On 4/13/2007 9:24 AM, Mark Davis wrote:
> I always like to think of these kinds of issues by looking at
> examples, since it tends to focus the issues and make it clear when
> people are misinterpreting others' terminology. I put out below some
> examples of what a process should do if gets a stream of information
> and is to tag it, where we assume that it is doing the best job it
> can. People can comment on these or propose others.
>
> Content
> Tag
> Comment
> n/a
> und, or equivalently
> "" , if that is available in the protocol The tag where the process
> is not equipped to analyze the text at all. und = "Undetermined"
> 143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe zxx
> Clearly some binary junk. zxx = "No linguistic content"
> bok23
> und
> Maybe has linguistic content, maybe not. Can't really determine.
> chat mul, if the protocol only permits a single tag
> <en, fr> otherwise mul = "Multiple languages"
> maybe also others, since "chat" has entered the vocabulary of many
> languages
> Suzuki ja-Latn
> maybe also others, since "I bought a Suzuki" is a perfectly
> reasonable English sentence.
> Igonda flatunicai vbinkli? mis some language the process recognizes,
> but which is not in BCP 47
> podstatné jméno mis
> something the process recognizes as having linguistic content, and
> might be in BCP 47, but it doesn't know which language it is.
> if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); }
> art?
> unclear whether "art" can include, or is restricted to cases like
> Klingon or Esperanto. art = "Artificial (Other)"
>
Your suzuki example would benefit from context.
The "Suzuki" in "I bought a Suzuki" is clearly a proper name which
doesn't change the fact that the entire text is in some form of English,
while "Suzuki" appearing in context of Japanese text would indeed be ja-Ltn.
I sent out, a while ago, a list of possible edge cases (abstract, not
concrete examples). You might look a them to see whether any others from
that list should be given examples.
A./
Received on Friday, 13 April 2007 22:25:50 UTC