Re: [Ltru] RE: For review: Tagging text with no language

 I always like to think of these kinds of issues by looking at examples,
since it tends to focus the issues and make it clear when people are
misinterpreting others' terminology. I put out below some examples of what a
process should do if gets a stream of information and is to tag it, where we
assume that it is doing the best job it can. People can comment on these or
propose others.

   Content
 Tag
 Comment
  n/a
 und, or equivalently
"" , if that is available in the protocol  The tag where the process is not
equipped to analyze the text at all. und = "Undetermined"
143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe zxx
 Clearly some binary junk. zxx = "No linguistic content"   bok23
 und
 Maybe has linguistic content, maybe not. Can't really determine.
  chat  mul, if the protocol only permits a single tag
<en, fr> otherwise  mul = "Multiple languages"
maybe also others, since "chat" has entered the vocabulary of many languages
  Suzuki  ja-Latn
 maybe also others, since "I bought a Suzuki" is a perfectly reasonable
English sentence.   Igonda flatunicai vbinkli?  mis  some language the
process recognizes, but which is not in BCP 47   podstatné jméno  mis
 something the process recognizes as having linguistic content, and might be
in BCP 47, but it doesn't know which language it is.
  if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); }
 art?
 unclear whether "art" can include, or is restricted to cases like Klingon
or Esperanto. art = "Artificial (Other)"

Received on Friday, 13 April 2007 16:24:30 UTC