- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 13 Jul 2016 10:38:34 +0900
- To: "Michael[tm] Smith" <mike@w3.org>
- CC: Felix Sasaki <fsasaki@w3.org>, <public-i18n-its-ig@w3.org>
Hello Mike, Many thanks for all the additional information (and for the implementation in the validator the first place, of course)! I think just making sure this is documented somewhere (on http://validator.w3.org/about.html or thereabouts), with a simple pointer to the library used, should be enough. Regards, Martin. On 2016/07/12 19:56, Michael[tm] Smith wrote: > Hi Martin, > > "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, 2016-07-12 19:43 +0900: >> The languages supported are probably these: >> >> https://github.com/shuyo/language-detection/tree/master/profiles. > > Yep > >> Looking at some of the files, they contain counts for single letters, >> bigramms, and sometimes trigramms. The Korean one is particularly large, >> but the Japanese seems to be using patterns, as the only Kana it contains >> are あ and ア(Hiragana and Katakana a). >> >> The slide sets linked from the overview page provide quite a bit of >> background. >> >> Another question is what happens with mixed texts. > > The library has an API that returns not just a single language but instead > a set of languages weighted by calculated probabilities of each language > being the main language of the input. > > https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md#detectorgetprobabilities > https://cdn.rawgit.com/shuyo/language-detection/master/doc/com/cybozu/labs/langdetect/Detector.html#getProbabilities() > https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L220 > https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L339 > > So in the case of mixed texts I would expect that you’d end up with that > API returning at least two languages with relatively probabilities. > > In my implementation I am currently doing something fairly crude, which is > that when I get the output from that API, I just take the first one with a > probability higher than 90% (the API returns an ordered list). > > https://github.com/validator/validator/blob/master/src/nu/validator/servlet/LanguageDetectingXMLReaderWrapper.java#L236 > > If the API doesn’t return a language with at least a 90% probability indicated, > then my code doesn’t set a language but instead just leaves it undetermined. > > —Mike >
Received on Wednesday, 13 July 2016 01:39:17 UTC