Re: peport`s error from Jukka K. Korpela on 2016-12-27 (www-validator@w3.org from December 2016)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 27 Dec 2016 21:32:59 +0200
To: Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
Message-ID: <fd5fcc31-b650-16ce-9164-9186c8140d36@cs.tut.fi>

27.12.2016, 18:26, Michael[tm] Smith wrote:

> The n-grams the language detector
> uses for identifying Estonian are here:
>
>   https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et
>
> As far as I can see, there are no Cyrillic letters in there.

There are some, starting from the very first string " пр". But they have 
low frequencies. Somewhat more surprisingly, there are even katakana 
letters,"アア", and CJK characters, "三". There are also vowels with 
macron, like "ē", which do not appear in Estonian. Apparently the data 
is based on texts with Estonian main content but some Russian, Baltic 
language, and even Japanese words included. This is somewhat 
problematic, but it does not seem to explain the misclassification, due 
to low frequency numbers. As a whole, I would expect this data recognize 
Estonian relatively well – and surely not mistake Russian for Estonian, 
assuming that the algorithm is reasonable.

Yucca

Received on Tuesday, 27 December 2016 19:33:34 UTC