- From: Michael[tm] Smith <mike@w3.org>
- Date: Tue, 12 Jul 2016 19:56:57 +0900
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Felix Sasaki <fsasaki@w3.org>, public-i18n-its-ig@w3.org
- Message-ID: <20160712105657.GV4628@sideshowbarker.net>
Hi Martin, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, 2016-07-12 19:43 +0900: > The languages supported are probably these: > > https://github.com/shuyo/language-detection/tree/master/profiles. Yep > Looking at some of the files, they contain counts for single letters, > bigramms, and sometimes trigramms. The Korean one is particularly large, > but the Japanese seems to be using patterns, as the only Kana it contains > are あ and ア(Hiragana and Katakana a). > > The slide sets linked from the overview page provide quite a bit of > background. > > Another question is what happens with mixed texts. The library has an API that returns not just a single language but instead a set of languages weighted by calculated probabilities of each language being the main language of the input. https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md#detectorgetprobabilities https://cdn.rawgit.com/shuyo/language-detection/master/doc/com/cybozu/labs/langdetect/Detector.html#getProbabilities() https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L220 https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L339 So in the case of mixed texts I would expect that you’d end up with that API returning at least two languages with relatively probabilities. In my implementation I am currently doing something fairly crude, which is that when I get the output from that API, I just take the first one with a probability higher than 90% (the API returns an ordered list). https://github.com/validator/validator/blob/master/src/nu/validator/servlet/LanguageDetectingXMLReaderWrapper.java#L236 If the API doesn’t return a language with at least a 90% probability indicated, then my code doesn’t set a language but instead just leaves it undetermined. —Mike -- Michael[tm] Smith https://people.w3.org/mike
Received on Tuesday, 12 July 2016 10:57:23 UTC