Re: peport`s error from Michael[tm] Smith on 2016-12-28 (www-validator@w3.org from December 2016)

From: Michael[tm] Smith <mike@w3.org>
Date: Wed, 28 Dec 2016 13:16:12 +0900
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
Cc: Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
Message-ID: <20161228041612.l7dyiujaehsmp7um@sideshowbarker.net>

"Jukka K. Korpela" <jkorpela@cs.tut.fi>, 2016-12-27 21:32 +0200:
> Archived-At: <http://www.w3.org/mid/fd5fcc31-b650-16ce-9164-9186c8140d36@cs.tut.fi>
> 
> 27.12.2016, 18:26, Michael[tm] Smith wrote:
> 
> > The n-grams the language detector
> > uses for identifying Estonian are here:
> > 
> >   https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et
> > 
> > As far as I can see, there are no Cyrillic letters in there.
> 
> There are some, starting from the very first string " пр". But they have low
> frequencies. Somewhat more surprisingly, there are even katakana
> letters,"アア", and CJK characters, "三". There are also vowels with macron,
> like "ē", which do not appear in Estonian.

The reason for those is that the data used to generate the n-grams is taken
from the titles and abstracts of all articles in the Estonian Wikipedia; this:

  https://dumps.wikimedia.org/etwiki/latest/etwiki-latest-abstract.xml

Regardless of which Wikipedia locale wiki one of those abstracts come from,
they often contain parenthetical references to place names and such in
languages other than the actual language of the wiki.

So those end up in the n-grams, albeit with a low frequency.

> Apparently the data is based on texts with Estonian main content but some
> Russian, Baltic language, and even Japanese words included. This is
> somewhat problematic, but it does not seem to explain the
> misclassification, due to low frequency numbers.

Yeah

> As a whole, I would expect this data recognize Estonian relatively well –
> and surely not mistake Russian for Estonian, assuming that the algorithm
> is reasonable.

I’ve not studied the algorithm but what I observed as a pattern is that
when it misidentifies that language of a document it is often for things
like product pages.

And from a text-processing point of view, one of the characteristics of
those kinds of pages is that if you take the raw text content of the body
(e.g., using the DOM .textContent property), the result consists of lots of
runs of small amounts of text separated by large numbers of line breaks.

That kind of raw text is what had so far been fed as-is to the language
detector. So in the interest of making it easier to view the text during
debugging, today I made an adjustment that causes all the whitespace in the
content to be collapsed before it gets handed to the language detector.

Somewhat surprisingly, that seems to have fixed the misidentification
problem with http://patronservice.ua/:

  https://validator.w3.org/nu/?doc=http%3A%2F%2Fpatronservice.ua

That now gets correctly detected as being in Russian.

So I imagine the language detector algorithm may not be designed to deal
well with content in that short-runs-of-text-with-many-line-breaks-between
pattern. And I suspect that may be the cause of some of the other cases of
language misidentification that people have run into with the checker.

  —Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Wednesday, 28 December 2016 04:16:45 UTC