Re: peport`s error from Michael[tm] Smith on 2016-12-27 (www-validator@w3.org from December 2016)

From: Michael[tm] Smith <mike@w3.org>
Date: Wed, 28 Dec 2016 01:26:59 +0900
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>, Алена Гордиенко <Alena.Gordienko@vm.ua>, "'www-validator@w3.org'" <www-validator@w3.org>
Message-ID: <20161227162659.s6jam4kaqe4crr3o@sideshowbarker.net>

Hi Jukka,

"Jukka K. Korpela" <jkorpela@cs.tut.fi>, 2016-12-27 16:37 +0200:
> Archived-At: <http://www.w3.org/mid/069c86ef-8765-5c87-30dd-190fd5fbe8d6@cs.tut.fi>
> 
> 21.12.2016, 11:53, Алена Гордиенко wrote:
> 
> > This Is link at report`s result of my site.
> > 
> > https://validator.w3.org/nu/?showsource=yes&doc=http%3A%2F%2Fpatronservice.ua#l1345c165
> 
> This is rather mysterious. But first let me point at a different mystery:
> the message was sent December 21st and received by a w3.org server same day,
> yet distributed to subscribers of the list December 27th. I have no idea of
> the cause of such delays

That almost always means it got held in a moderation queue for somebody on
the W3C staff to review before forwarding to the list.

> (which have happened in the post, but not this long).

I think in this case the delay was longer because Christmas came in between.

> > *Warning**: This document appears to be written in Estonian but
> > the *|html|* **start tag has *|lang="ru"|*. Consider
> > using *|lang="et"|* **(or variant) instead.***
> ...
> Wrong language guesses by the validator are not uncommon, but usually there
> is a simple explanation, like hidden textual content at the start of the
> document, in a language different from the main language of the page. Here,
> however, we have a mystery. The page content (even as seen by a validator)
> is almost exclusively in Russian, with just a few short strings in Latin
> letters here and there. So how can a language analysis guess that it is in
> Estonian, which is written in Latin letters?

Yeah this one has me baffled as well. The n-grams the language detector
uses for identifying Estonian are here:

  https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/et

As far as I can see, there are no Cyrillic letters in there.

The n-grams used for identifying Russian are here:

  https://raw.githubusercontent.com/validator/validator/master/resources/language-profiles/ru

And that is all Cyrillic and looks nothing like the Estonian n-grams.

So at this point I have no idea how the library could ever confuse the
content of http://patronservice.ua as being in Russian.

...
> I wonder what content there might confuse a language guesser so badly, when
> the content is present in a context of a page in Russian, but not when
> tested in isolation.

Dunno. But the library we’re using does detection not on the entire
contents of the document but on some range of the content that it selects
are random. So the results are not deterministic; it can sometimes report
different results for the same document.

> And there is no Estonian word there.

Yeah, regardless of what part of the document it’s checking, I don’t
understand how it could ever decide it’s Estonian.

  —Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Tuesday, 27 December 2016 16:27:31 UTC