On 4/13/2007 9:24 AM, Mark Davis wrote: > I always like to think of these kinds of issues by looking at > examples, since it tends to focus the issues and make it clear when > people are misinterpreting others' terminology. I put out below some > examples of what a process should do if gets a stream of information > and is to tag it, where we assume that it is doing the best job it > can. People can comment on these or propose others. > > Content > Tag > Comment > n/a > und, or equivalently > "" , if that is available in the protocol The tag where the process > is not equipped to analyze the text at all. und = "Undetermined" > 143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe zxx > Clearly some binary junk. zxx = "No linguistic content" > bok23 > und > Maybe has linguistic content, maybe not. Can't really determine. > chat mul, if the protocol only permits a single tag > <en, fr> otherwise mul = "Multiple languages" > maybe also others, since "chat" has entered the vocabulary of many > languages > Suzuki ja-Latn > maybe also others, since "I bought a Suzuki" is a perfectly > reasonable English sentence. > Igonda flatunicai vbinkli? mis some language the process recognizes, > but which is not in BCP 47 > podstatné jméno mis > something the process recognizes as having linguistic content, and > might be in BCP 47, but it doesn't know which language it is. > if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); } > art? > unclear whether "art" can include, or is restricted to cases like > Klingon or Esperanto. art = "Artificial (Other)" > Your suzuki example would benefit from context. The "Suzuki" in "I bought a Suzuki" is clearly a proper name which doesn't change the fact that the entire text is in some form of English, while "Suzuki" appearing in context of Japanese text would indeed be ja-Ltn. I sent out, a while ago, a list of possible edge cases (abstract, not concrete examples). You might look a them to see whether any others from that list should be given examples. A./Received on Friday, 13 April 2007 22:25:50 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT