- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Tue, 27 Dec 2016 21:32:59 +0200
- To: Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
27.12.2016, 18:26, Michael[tm] Smith wrote: > The n-grams the language detector > uses for identifying Estonian are here: > > https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et > > As far as I can see, there are no Cyrillic letters in there. There are some, starting from the very first string " пр". But they have low frequencies. Somewhat more surprisingly, there are even katakana letters,"アア", and CJK characters, "三". There are also vowels with macron, like "ē", which do not appear in Estonian. Apparently the data is based on texts with Estonian main content but some Russian, Baltic language, and even Japanese words included. This is somewhat problematic, but it does not seem to explain the misclassification, due to low frequency numbers. As a whole, I would expect this data recognize Estonian relatively well – and surely not mistake Russian for Estonian, assuming that the algorithm is reasonable. Yucca
Received on Tuesday, 27 December 2016 19:33:34 UTC