- From: Michael[tm] Smith <mike@w3.org>
- Date: Sun, 30 Aug 2020 08:40:22 +0900
- To: "Jukka K. Korpela" <jukkakk@gmail.com>, Comintt Comtt <comintt@mail.com>, W3C WWW Validator <www-validator@w3.org>
- Message-ID: <20200829234022.GV1230201@sideshowbarker.net>
"Michael[tm] Smith" <mike@w3.org>, 2020-08-30 08:24 +0900: > Archived-At: <https://www.w3.org/mid/20200829232454.GU1230201@sideshowbarker.net> > > Unfortunately, along with the fact that the language guesser sometimes > guesses wrong, it’s also not deterministic — that is, one time when you > check, it might not guess wrong, but another time it will. I should have explained the reason for that, which is detailed here: https://code.google.com/archive/p/language-detection/wikis/FrequentlyAskedQuestion.wiki > Langdetect uses random sampling for avoiding local noises(person name, > place name and so on), so the language detections of the same document > might differ for every time. In other words, in terms of the HTML checker behavior, each time you check a document, the language guesser is being run on only a sample of the document, rather then the entire content of the document. And the sample is selected randomly each time. So sometimes it gets run on a sample that it guesses correctly on, but sometimes it might be run on a sample which it ends up guessing incorrectly. There is a setting in the library API for making it deterministic — by ensuring it always samples the same document in exactly the same way each time, rather than selecting the sample randomly — but I have intentionally chosen to not use that always-be-deterministic setting (because I think the the default random-sampling behavior produces better results overall). –Mike -- Michael[tm] Smith https://people.w3.org/mike
Received on Saturday, 29 August 2020 23:40:37 UTC