Re: Language detection for web content from Martin J. Dürst on 2016-07-13 (public-i18n-its-ig@w3.org from July 2016)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 13 Jul 2016 10:38:34 +0900
To: "Michael[tm] Smith" <mike@w3.org>
CC: Felix Sasaki <fsasaki@w3.org>, <public-i18n-its-ig@w3.org>
Message-ID: <ccbfbf4e-01f2-f910-003c-c85ca0dc60d8@it.aoyama.ac.jp>

Hello Mike,

Many thanks for all the additional information (and for the 
implementation in the validator the first place, of course)!

I think just making sure this is documented somewhere (on 
http://validator.w3.org/about.html or thereabouts), with a simple 
pointer to the library used, should be enough.

Regards,   Martin.

On 2016/07/12 19:56, Michael[tm] Smith wrote:
> Hi Martin,
>
> "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, 2016-07-12 19:43 +0900:
>> The languages supported are probably these:
>>
>> https://github.com/shuyo/language-detection/tree/master/profiles.
>
> Yep
>
>> Looking at some of the files, they contain counts for single letters,
>> bigramms, and sometimes trigramms. The Korean one is particularly large,
>> but the Japanese seems to be using patterns, as the only Kana it contains
>> are あ and ア(Hiragana and Katakana a).
>>
>> The slide sets linked from the overview page provide quite a bit of
>> background.
>>
>> Another question is what happens with mixed texts.
>
> The library has an API that returns not just a single language but instead
> a set of languages weighted by calculated probabilities of each language
> being the main language of the input.
>
>   https://github.com/shuyo/language-detection/blob/wiki/Tutorial.md#detectorgetprobabilities
>   https://cdn.rawgit.com/shuyo/language-detection/master/doc/com/cybozu/labs/langdetect/Detector.html#getProbabilities()
>   https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L220
>   https://github.com/shuyo/language-detection/blob/master/src/com/cybozu/labs/langdetect/Detector.java#L339
>
> So in the case of mixed texts I would expect that you’d end up with that
> API returning at least two languages with relatively probabilities.
>
> In my implementation I am currently doing something fairly crude, which is
> that when I get the output from that API, I just take the first one with a
> probability higher than 90% (the API returns an ordered list).
>
>   https://github.com/validator/validator/blob/master/src/nu/validator/servlet/LanguageDetectingXMLReaderWrapper.java#L236
>
> If the API doesn’t return a language with at least a 90% probability indicated,
> then my code doesn’t set a language but instead just leaves it undetermined.
>
>   —Mike
>

Received on Wednesday, 13 July 2016 01:39:17 UTC