W3C home > Mailing lists > Public > www-validator@w3.org > December 2016

Re: peport`s error

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 27 Dec 2016 21:32:59 +0200
To: Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
Message-ID: <fd5fcc31-b650-16ce-9164-9186c8140d36@cs.tut.fi>
27.12.2016, 18:26, Michael[tm] Smith wrote:

> The n-grams the language detector
> uses for identifying Estonian are here:
>
>   https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et
>
> As far as I can see, there are no Cyrillic letters in there.

There are some, starting from the very first string " пр". But they have 
low frequencies. Somewhat more surprisingly, there are even katakana 
letters,"アア", and CJK characters, "三". There are also vowels with 
macron, like "ē", which do not appear in Estonian. Apparently the data 
is based on texts with Estonian main content but some Russian, Baltic 
language, and even Japanese words included. This is somewhat 
problematic, but it does not seem to explain the misclassification, due 
to low frequency numbers. As a whole, I would expect this data recognize 
Estonian relatively well – and surely not mistake Russian for Estonian, 
assuming that the algorithm is reasonable.

Yucca
Received on Tuesday, 27 December 2016 19:33:34 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 27 December 2016 19:33:36 UTC