When you are tokenizing, and then doing comparison, the simplest approach is
to normalize when creating the tokens.
One other fact that people should be aware of: a good optimized normalizer
routine (like the one in ICU) only has to do any work at all when one of the
small number of characters are encountered. Because the relative frequency
of such characters is low, the performance is quite good.
Mark
On Mon, Feb 2, 2009 at 10:53, L. David Baron <dbaron@dbaron.org> wrote:
>
> On Monday 2009-02-02 09:53 -0800, Phillips, Addison wrote:
> > On the question of performance, Anne's point about the comparison
> > is incomplete. Yes, you only do a strcmp() in your code today.
>
> No, we're not using strcmp() in our code today, because it's too
> slow. We're doing atomization of many things to make comparison
> faster than strcmp.
>
> > However, there are two problems with this observation.
> >
> > First, any two strings that are equal are, well, equal.
> > Normalizing them both won't change that. So an obvious performance
> > boost is to call strcmp() first.
>
> Most string comparisons fail, so failing quickly is significantly
> more important than succeeding quickly.
>
> -David
>
> --
> L. David Baron http://dbaron.org/
> Mozilla Corporation http://www.mozilla.com/
>
>