W3C home > Mailing lists > Public > www-international@w3.org > April to June 2007

Re: [Ltru] RE: For review: Tagging text with no language

From: Mark Davis <mark.davis@icu-project.org>
Date: Fri, 13 Apr 2007 16:57:59 -0700
Message-ID: <30b660a20704131657n50739192i6a54180bd4c98294@mail.gmail.com>
To: "Karen_Broome@spe.sony.com" <Karen_Broome@spe.sony.com>
Cc: www-international@w3.org
I haven't yet seen a convincing case that 'mis' must be interpreted as
disjoint with other codes, as I've remarked.

On 4/13/07, Karen_Broome@spe.sony.com <Karen_Broome@spe.sony.com> wrote:
>
>
> How can "podstatné jméno" be both Czech and miscellaneous at the same
> time?  I think it could be "und" or "cs" but I don't think "mis" should be
> used because that means the other language tags do not apply, which is not
> the case here.
>
> Karen Broome
>
>
>
>  *Asmus Freytag <asmusf@ix.netcom.com>*
> Sent by: www-international-request@w3.org
>
> 04/13/2007 03:25 PM
>   To
> Mark Davis <mark.davis@icu-project.org>  cc
> John Cowan <cowan@ccil.org>, Stephen Deach <sdeach@adobe.com>, Kent
> Karlsson <kent.karlsson14@comhem.se>, Richard Ishida <ishida@w3.org>, LTRU
> Working Group <ltru@ietf.org>, www-international@w3.org, CLDR list <
> cldr@unicode.org>  Subject
> Re: [Ltru] RE: For review: Tagging text with no language
>
>
>
>
>
>
>
> On 4/13/2007 9:24 AM, Mark Davis wrote:
> > I always like to think of these kinds of issues by looking at
> > examples, since it tends to focus the issues and make it clear when
> > people are misinterpreting others' terminology. I put out below some
> > examples of what a process should do if gets a stream of information
> > and is to tag it, where we assume that it is doing the best job it
> > can. People can comment on these or propose others.
> >
> > Content
> >                  Tag
> >                  Comment
> > n/a
> >                  und, or equivalently
> > "" , if that is available in the protocol                  The tag where
> the process
> > is not equipped to analyze the text at all. und = "Undetermined"
> > 143kl;ufa)iop(&uweiorqhjkl2341lkj#@!$Jkdfj;afe                  zxx
> >                  Clearly some binary junk. zxx = "No linguistic content"
> > bok23
> >                  und
> >                  Maybe has linguistic content, maybe not. Can't really
> determine.
> > chat                  mul, if the protocol only permits a single tag
> > <en, fr> otherwise                  mul = "Multiple languages"
> > maybe also others, since "chat" has entered the vocabulary of many
> > languages
> > Suzuki                  ja-Latn
> >                  maybe also others, since "I bought a Suzuki" is a
> perfectly
> > reasonable English sentence.
> > Igonda flatunicai vbinkli?                  mis                  some
> language the process recognizes,
> > but which is not in BCP 47
> > podstatné jméno                  mis
> >                  something the process recognizes as having linguistic
> content, and
> > might be in BCP 47, but it doesn't know which language it is.
> > if (myInstance.getType() == Type.UNKNOWN) { throw new Exception(""); }
> >                  art?
> >                  unclear whether "art" can include, or is restricted to
> cases like
> > Klingon or Esperanto. art = "Artificial (Other)"
> >
> Your suzuki example would benefit from context.
>
> The "Suzuki" in "I bought a Suzuki" is clearly a proper name which
> doesn't change the fact that the entire text is in some form of English,
> while "Suzuki" appearing in context of Japanese text would indeed be
> ja-Ltn.
>
> I sent out, a while ago, a list of possible edge cases (abstract, not
> concrete examples). You might look a them to see whether any others from
> that list should be given examples.
>
> A./
>
>
>
>
>
>
>


-- 
Mark
Received on Friday, 13 April 2007 23:58:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT