Re: Wrong language identification from Michael[tm] Smith on 2020-08-29 (www-validator@w3.org from August 2020)

From: Michael[tm] Smith <mike@w3.org>
Date: Sun, 30 Aug 2020 08:40:22 +0900
To: "Jukka K. Korpela" <jukkakk@gmail.com>, Comintt Comtt <comintt@mail.com>, W3C WWW Validator <www-validator@w3.org>
Message-ID: <20200829234022.GV1230201@sideshowbarker.net>

"Michael[tm] Smith" <mike@w3.org>, 2020-08-30 08:24 +0900:
> Archived-At: <https://www.w3.org/mid/20200829232454.GU1230201@sideshowbarker.net>
> 
> Unfortunately, along with the fact that the language guesser sometimes
> guesses wrong, it’s also not deterministic — that is, one time when you
> check, it might not guess wrong, but another time it will.

I should have explained the reason for that, which is detailed here:

https://code.google.com/archive/p/language-detection/wikis/FrequentlyAskedQuestion.wiki

> Langdetect uses random sampling for avoiding local noises(person name,
> place name and so on), so the language detections of the same document
> might differ for every time.

In other words, in terms of the HTML checker behavior, each time you check
a document, the language guesser is being run on only a sample of the
document, rather then the entire content of the document. And the sample is
selected randomly each time. So sometimes it gets run on a sample that it
guesses correctly on, but sometimes it might be run on a sample which it
ends up guessing incorrectly.

There is a setting in the library API for making it deterministic — by
ensuring it always samples the same document in exactly the same way each
time, rather than selecting the sample randomly — but I have intentionally
chosen to not use that always-be-deterministic setting (because I think the
the default random-sampling behavior produces better results overall).

  –Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Saturday, 29 August 2020 23:40:37 UTC