Re: peport`s error

"Michael[tm] Smith" <mike@w3.org>, 2016-12-28 13:16 +0900:
> Archived-At: <http://www.w3.org/mid/20161228041612.l7dyiujaehsmp7um@sideshowbarker.net>
> "Jukka K. Korpela" <jkorpela@cs.tut.fi>, 2016-12-27 21:32 +0200:
> > Archived-At: <http://www.w3.org/mid/fd5fcc31-b650-16ce-9164-9186c8140d36@cs.tut.fi>
> > 27.12.2016, 18:26, Michael[tm] Smith wrote:
> > 
> > > The n-grams the language detector uses for identifying Estonian are here:
> > > 
> > >   https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et
> > > 
> > > As far as I can see, there are no Cyrillic letters in there.
> > 
> > There are some, starting from the very first string " пр". But they have low
> > frequencies. Somewhat more surprisingly, there are even katakana
> > letters,"アア", and CJK characters, "三". There are also vowels with macron,
> > like "ē", which do not appear in Estonian.

At https://github.com/validator/validator/issues/464 I recently got another
report of a Russian document getting misidentified as Estonian. So I ran a
script to remove all Cyrillic characters from the Estonian profile that the
language detector uses (and also removed all CJK characters from while I as at
it), then retested… and found that document then got misidentified as Catalan.

So I stripped the Cyrillic and CJK characters from the Catalan profiles too.

For anybody who might be curious, the overall change is here:

  https://github.com/validator/validator/commit/96171426a86bf9e172e7a85ed0ad6a1d2198994c

Anyway, that change should prevent both the reported cases of Russian getting
misidentified as Estonian, and also any (unreported) cases of Russian getting
misidentified as Catalan.

  —Mike

-- 
Michael[tm] Smith https://sideshowbarker.net/

Received on Sunday, 12 February 2017 02:41:32 UTC