When you are tokenizing, and then doing comparison, the simplest approach is to normalize when creating the tokens. One other fact that people should be aware of: a good optimized normalizer routine (like the one in ICU) only has to do any work at all when one of the small number of characters are encountered. Because the relative frequency of such characters is low, the performance is quite good. Mark On Mon, Feb 2, 2009 at 10:53, L. David Baron <dbaron@dbaron.org> wrote: > > On Monday 2009-02-02 09:53 -0800, Phillips, Addison wrote: > > On the question of performance, Anne's point about the comparison > > is incomplete. Yes, you only do a strcmp() in your code today. > > No, we're not using strcmp() in our code today, because it's too > slow. We're doing atomization of many things to make comparison > faster than strcmp. > > > However, there are two problems with this observation. > > > > First, any two strings that are equal are, well, equal. > > Normalizing them both won't change that. So an obvious performance > > boost is to call strcmp() first. > > Most string comparisons fail, so failing quickly is significantly > more important than succeeding quickly. > > -David > > -- > L. David Baron http://dbaron.org/ > Mozilla Corporation http://www.mozilla.com/ > >Received on Monday, 2 February 2009 19:08:06 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 2 February 2009 19:08:07 GMT