Re: sec 4.11 from Mark Davis on 2004-05-12 (public-i18n-ws@w3.org from May 2004)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Tue, 11 May 2004 18:46:37 -0700
To: "Tex Texin" <tex@xencraft.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Message-ID: <062401c437c2$f6d82500$74d52b09@DAVIS1>
> If I look for addresses of tex texin, I want "ma" to expand to match
> massachusetts and for my French chateau to have its address with "za" expand
to
> Zone d'activité, so I can look for a match of "tex and (zone or
> massachusetts)".
>
> (Don't ask why my chateau is in a ZA... ;-) )
>
> I would think a database of multinational and multilingual addresses must use
> different dictionaries of abbreviations according to the country the address
is
> for.

*must* is a strong word, and not warranted here. Remember, all of the 17
different abbreviations for ICU I pointed to are valid in en-US. You can't
distinguish them by locale at all. If you wanted to have a system whereby for
each locale and datatype there was a mapping from acronyms to expanded text,
that I would say was possible. I'd also say it was unnecessary, and the
likelihood of anyone spending all the money to do that is pretty close to nil.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Tex Texin" <tex@xencraft.com>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
<public-i18n-ws@w3.org>
Sent: Tue, 2004 May 11 17:55
Subject: Re: sec 4.11


>
>
> Mark Davis wrote:
> > > 1) ok on "suggestive". I had in mind that someone might guess if the
locale
> > was
> > > Spain vs. Latin America, whether to use Modern or Trad., so it was
indicative
> > > of collation in the same "loose" way PRC vs ROC is indicative of
Simplified vs
> > > Trad. Chinese. It is not only that it is not specified but it can't (or
> > > shouldn't) be inferred.
> >
> > hmmm. Still uneasy. What happens in fact is that whenever we (or anyone
else)
> > gets an underspecified request, we *have* to do something. So for example,
if we
> > get a naked 'zh' on a resource lookup, then we have to pick either 'zh-Hant'
or
> > 'zh-Hans' (logically, if not physically). If we get es-BO on a collation
> > request, then we have to choose *some* sorting. And the default sort may be
> > different for different locales, that is true. But "suggestive" is not
> > indicative of the true process.
> >
> > Of course, where there is a handshake, one can communicate back: "Don't know
> > what you mean by "es-BO": pick either "es-BO-x-collationtraditional" or
> > "es-BO-x-collationmodern". But normally that option n'exist pas.
>
> ok. It's like infer and imply. The identifier is not really suggestive, but
> people with a need to make a decision infer what they hope is implied.
>
> I'll fix it in the next version.
>
> > > Mark, You make an interesting distinction that strength makes a feature
more
> > or
> > > less important not necessarily ignorable.
> > > There probably should be a separate name for that dimension of collation
> > > (ignoring characters, or indeterminacy) to distinguish it from strength.
> > > Most of the sorts I have worked with don't implement strength fully and
leave
> > > the sort indeterminate.
> >
> > And most should. A fully determinate sort is not generally worth the cost
(see
> > tn9). However, when sorting, at at least the case and accent level, plus
> > punctuation (if applicable), they should not be ignored in sorting, although
may
> > be ignored in searching.
> >
> > The "ignore punctuation" option really means "make punctuation a really weak
> > strength".
>
> I need to think about this. I don't see the harm to users in simply ignoring
> some characters, if you never intend to distinguish them in queries, and your
> users do not depend on their being ordered. I do know indeterminacy makes life
> difficult from a testing and support perspective (harder to do regression
tests
> and difficult to reproduce certain problems) but that is a separate issue.
>
> And there is a cost in effort and performance to making the values weakly
> distinctive.
> I can see the performance cost is minimal given that the number of values is
> small as I postulated, but I still wonder if its worth the trouble.
>
> What I can see, is that if you have already created and tested a module that
> sorts fully and with determinacy, then it is silly not to use it everywhere
> rather than create a new one that is more limited.
>
> But many folks are in the position of needing a sort routine for a drop-down
> and the like and don't have the full one handy.
> I guess I should just recommend ICU. ;-)
>
> > > 5) on abbreviations, I am not sure how often expansion is used, I do know
it
> > is
> > > done. If by limited environment you mean with respect to certain fields
rather
> > > than applied to all fields, yes of course.
> >
> > It is done, but what I mean by "limited environment" is that "replace all
> > acronyms" makes no sense unless you have a very limited vocabulary, in a
very
> > restricted domain, like "Dental Supplies". How are you supposed to sort
"ICU"
> > when it could have the following expansions???
> >
> > http://www.acronymfinder.com/af-query.asp?String=exact&Acronym=ICU
>
> ok, yes.
>
> >
> > >
> > > on abbreviations in a multilingual context: well yes, collation is
language
> > > sensitive and I was looking to indicate there are problems due to using
narrow
> > > language-sensitive operations in a multilingual space. Maybe you can
suggest
> > > some better examples?
> > > I think this is an area where Web Services may get tripped up.
> >
> > I think it just needs to be removed. Mixing operations from different
languages
> > will always get you into trouble; this has nothing really to do with the
example
> > you want to make.
>
> Is there never a requirement to search for data in a multilingual database and
> have multiple language rules operating rather than a single user preference?
>
> If I look for addresses of tex texin, I want "ma" to expand to match
> massachusetts and for my French chateau to have its address with "za" expand
to
> Zone d'activité, so I can look for a match of "tex and (zone or
> massachusetts)".
>
> (Don't ask why my chateau is in a ZA... ;-) )
>
> I would think a database of multinational and multilingual addresses must use
> different dictionaries of abbreviations according to the country the address
is
> for.
> tex
>
>
Received on Tuesday, 11 May 2004 21:46:35 UTC