Re: sec 4.11 from Mark Davis on 2004-05-12 (public-i18n-ws@w3.org from May 2004)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Tue, 11 May 2004 20:10:47 -0700
To: "Mark Davis" <mark.davis@jtcsv.com>, "Tex Texin" <tex@xencraft.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Message-ID: <065801c437ce$b9487260$74d52b09@DAVIS1>
BTW, you might also reference LDML collation settings:
 http://www.unicode.org/reports/tr35/#Setting_Options

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Mark Davis" <mark.davis@jtcsv.com>
To: "Tex Texin" <tex@xencraft.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
<public-i18n-ws@w3.org>
Sent: Tue, 2004 May 11 18:46
Subject: Re: sec 4.11


>
> > If I look for addresses of tex texin, I want "ma" to expand to match
> > massachusetts and for my French chateau to have its address with "za" expand
> to
> > Zone d'activité, so I can look for a match of "tex and (zone or
> > massachusetts)".
> >
> > (Don't ask why my chateau is in a ZA... ;-) )
> >
> > I would think a database of multinational and multilingual addresses must
use
> > different dictionaries of abbreviations according to the country the address
> is
> > for.
>
> *must* is a strong word, and not warranted here. Remember, all of the 17
> different abbreviations for ICU I pointed to are valid in en-US. You can't
> distinguish them by locale at all. If you wanted to have a system whereby for
> each locale and datatype there was a mapping from acronyms to expanded text,
> that I would say was possible. I'd also say it was unnecessary, and the
> likelihood of anyone spending all the money to do that is pretty close to nil.
>
> Mark
> __________________________________
> http://www.macchiato.com
> ► शिष्यादिच्छेत्पराजयम् ◄
>
> ----- Original Message ----- 
> From: "Tex Texin" <tex@xencraft.com>
> To: "Mark Davis" <mark.davis@jtcsv.com>
> Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
> <public-i18n-ws@w3.org>
> Sent: Tue, 2004 May 11 17:55
> Subject: Re: sec 4.11
>
>
> >
> >
> > Mark Davis wrote:
> > > > 1) ok on "suggestive". I had in mind that someone might guess if the
> locale
> > > was
> > > > Spain vs. Latin America, whether to use Modern or Trad., so it was
> indicative
> > > > of collation in the same "loose" way PRC vs ROC is indicative of
> Simplified vs
> > > > Trad. Chinese. It is not only that it is not specified but it can't (or
> > > > shouldn't) be inferred.
> > >
> > > hmmm. Still uneasy. What happens in fact is that whenever we (or anyone
> else)
> > > gets an underspecified request, we *have* to do something. So for example,
> if we
> > > get a naked 'zh' on a resource lookup, then we have to pick either
'zh-Hant'
> or
> > > 'zh-Hans' (logically, if not physically). If we get es-BO on a collation
> > > request, then we have to choose *some* sorting. And the default sort may
be
> > > different for different locales, that is true. But "suggestive" is not
> > > indicative of the true process.
> > >
> > > Of course, where there is a handshake, one can communicate back: "Don't
know
> > > what you mean by "es-BO": pick either "es-BO-x-collationtraditional" or
> > > "es-BO-x-collationmodern". But normally that option n'exist pas.
> >
> > ok. It's like infer and imply. The identifier is not really suggestive, but
> > people with a need to make a decision infer what they hope is implied.
> >
> > I'll fix it in the next version.
> >
> > > > Mark, You make an interesting distinction that strength makes a feature
> more
> > > or
> > > > less important not necessarily ignorable.
> > > > There probably should be a separate name for that dimension of collation
> > > > (ignoring characters, or indeterminacy) to distinguish it from strength.
> > > > Most of the sorts I have worked with don't implement strength fully and
> leave
> > > > the sort indeterminate.
> > >
> > > And most should. A fully determinate sort is not generally worth the cost
> (see
> > > tn9). However, when sorting, at at least the case and accent level, plus
> > > punctuation (if applicable), they should not be ignored in sorting,
although
> may
> > > be ignored in searching.
> > >
> > > The "ignore punctuation" option really means "make punctuation a really
weak
> > > strength".
> >
> > I need to think about this. I don't see the harm to users in simply ignoring
> > some characters, if you never intend to distinguish them in queries, and
your
> > users do not depend on their being ordered. I do know indeterminacy makes
life
> > difficult from a testing and support perspective (harder to do regression
> tests
> > and difficult to reproduce certain problems) but that is a separate issue.
> >
> > And there is a cost in effort and performance to making the values weakly
> > distinctive.
> > I can see the performance cost is minimal given that the number of values is
> > small as I postulated, but I still wonder if its worth the trouble.
> >
> > What I can see, is that if you have already created and tested a module that
> > sorts fully and with determinacy, then it is silly not to use it everywhere
> > rather than create a new one that is more limited.
> >
> > But many folks are in the position of needing a sort routine for a drop-down
> > and the like and don't have the full one handy.
> > I guess I should just recommend ICU. ;-)
> >
> > > > 5) on abbreviations, I am not sure how often expansion is used, I do
know
> it
> > > is
> > > > done. If by limited environment you mean with respect to certain fields
> rather
> > > > than applied to all fields, yes of course.
> > >
> > > It is done, but what I mean by "limited environment" is that "replace all
> > > acronyms" makes no sense unless you have a very limited vocabulary, in a
> very
> > > restricted domain, like "Dental Supplies". How are you supposed to sort
> "ICU"
> > > when it could have the following expansions???
> > >
> > > http://www.acronymfinder.com/af-query.asp?String=exact&Acronym=ICU
> >
> > ok, yes.
> >
> > >
> > > >
> > > > on abbreviations in a multilingual context: well yes, collation is
> language
> > > > sensitive and I was looking to indicate there are problems due to using
> narrow
> > > > language-sensitive operations in a multilingual space. Maybe you can
> suggest
> > > > some better examples?
> > > > I think this is an area where Web Services may get tripped up.
> > >
> > > I think it just needs to be removed. Mixing operations from different
> languages
> > > will always get you into trouble; this has nothing really to do with the
> example
> > > you want to make.
> >
> > Is there never a requirement to search for data in a multilingual database
and
> > have multiple language rules operating rather than a single user preference?
> >
> > If I look for addresses of tex texin, I want "ma" to expand to match
> > massachusetts and for my French chateau to have its address with "za" expand
> to
> > Zone d'activité, so I can look for a match of "tex and (zone or
> > massachusetts)".
> >
> > (Don't ask why my chateau is in a ZA... ;-) )
> >
> > I would think a database of multinational and multilingual addresses must
use
> > different dictionaries of abbreviations according to the country the address
> is
> > for.
> > tex
> >
> >
>
>
Received on Tuesday, 11 May 2004 23:18:27 UTC