I always like to think of these kinds of issues by looking at examples,
since it tends to focus the issues and make it clear when people are
misinterpreting others' terminology. I put out below some examples of what a
process should do if gets a stream of information and is to tag it, where we
assume that it is doing the best job it can. People can comment on these or
propose others.
Content
Tag
Comment
n/a
und, or equivalently
"" , if that is available in the protocol The tag where the process is not
equipped to analyze the text at all. und = "Undetermined"
143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe zxx
Clearly some binary junk. zxx = "No linguistic content" bok23
und
Maybe has linguistic content, maybe not. Can't really determine.
chat mul, if the protocol only permits a single tag
<en, fr> otherwise mul = "Multiple languages"
maybe also others, since "chat" has entered the vocabulary of many languages
Suzuki ja-Latn
maybe also others, since "I bought a Suzuki" is a perfectly reasonable
English sentence. Igonda flatunicai vbinkli? mis some language the
process recognizes, but which is not in BCP 47 podstatné jméno mis
something the process recognizes as having linguistic content, and might be
in BCP 47, but it doesn't know which language it is.
if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); }
art?
unclear whether "art" can include, or is restricted to cases like Klingon
or Esperanto. art = "Artificial (Other)"