- From: Michael[tm] Smith <mike@w3.org>
- Date: Sun, 12 Feb 2017 11:40:57 +0900
- To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>, Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
- Message-ID: <20170212024057.lnl7ymhfjymov5ii@sideshowbarker.net>
"Michael[tm] Smith" <mike@w3.org>, 2016-12-28 13:16 +0900: > Archived-At: <http://www.w3.org/mid/20161228041612.l7dyiujaehsmp7um@sideshowbarker.net> > "Jukka K. Korpela" <jkorpela@cs.tut.fi>, 2016-12-27 21:32 +0200: > > Archived-At: <http://www.w3.org/mid/fd5fcc31-b650-16ce-9164-9186c8140d36@cs.tut.fi> > > 27.12.2016, 18:26, Michael[tm] Smith wrote: > > > > > The n-grams the language detector uses for identifying Estonian are here: > > > > > > https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et > > > > > > As far as I can see, there are no Cyrillic letters in there. > > > > There are some, starting from the very first string " пр". But they have low > > frequencies. Somewhat more surprisingly, there are even katakana > > letters,"アア", and CJK characters, "三". There are also vowels with macron, > > like "ē", which do not appear in Estonian. At https://github.com/validator/validator/issues/464 I recently got another report of a Russian document getting misidentified as Estonian. So I ran a script to remove all Cyrillic characters from the Estonian profile that the language detector uses (and also removed all CJK characters from while I as at it), then retested… and found that document then got misidentified as Catalan. So I stripped the Cyrillic and CJK characters from the Catalan profiles too. For anybody who might be curious, the overall change is here: https://github.com/validator/validator/commit/96171426a86bf9e172e7a85ed0ad6a1d2198994c Anyway, that change should prevent both the reported cases of Russian getting misidentified as Estonian, and also any (unreported) cases of Russian getting misidentified as Catalan. —Mike -- Michael[tm] Smith https://sideshowbarker.net/
Received on Sunday, 12 February 2017 02:41:32 UTC