Re: Language detection for web content

Hi Martin,

"Martin J. Dürst" <duerst@it.aoyama.ac.jp>, 2016-07-12 19:43 +0900:
> The languages supported are probably these:
> 
> https://github.com/shuyo/language-detection/tree/master/profiles.

Yep

> Looking at some of the files, they contain counts for single letters,
> bigramms, and sometimes trigramms. The Korean one is particularly large,
> but the Japanese seems to be using patterns, as the only Kana it contains
> are あ and ア(Hiragana and Katakana a).
> 
> The slide sets linked from the overview page provide quite a bit of
> background.
> 
> Another question is what happens with mixed texts.

The library has an API that returns not just a single language but instead
a set of languages weighted by calculated probabilities of each language
being the main language of the input.

  https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md#detectorgetprobabilities
  https://cdn.rawgit.com/shuyo/language-detection/master/doc/com/cybozu/labs/langdetect/Detector.html#getProbabilities()
  https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L220
  https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L339

So in the case of mixed texts I would expect that you’d end up with that
API returning at least two languages with relatively probabilities.

In my implementation I am currently doing something fairly crude, which is
that when I get the output from that API, I just take the first one with a
probability higher than 90% (the API returns an ordered list).

  https://github.com/validator/validator/blob/master/src/nu/validator/servlet/LanguageDetectingXMLReaderWrapper.java#L236

If the API doesn’t return a language with at least a 90% probability indicated,
then my code doesn’t set a language but instead just leaves it undetermined.

  —Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Tuesday, 12 July 2016 10:57:23 UTC