Re: Language detection for web content from Martin J. Dürst on 2016-07-12 (public-i18n-its-ig@w3.org from July 2016)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 12 Jul 2016 19:43:52 +0900
To: Felix Sasaki <fsasaki@w3.org>, Michael Smith <mike@w3.org>
CC: <public-i18n-its-ig@w3.org>
Message-ID: <f67d2373-da62-61c2-3229-aff4d2730bec@it.aoyama.ac.jp>

The languages supported are probably these:

https://github.com/shuyo/language-detection/tree/master/profiles. 
Looking at some of the files, they contain counts for single letters, 
bigramms, and sometimes trigramms. The Korean one is particularly large, 
but the Japanese seems to be using patterns, as the only Kana it 
contains are あ and ア (Hiragana and Katakana a).

The slide sets linked from the overview page provide quite a bit of 
background.

Another question is what happens with mixed texts.

Regards,   Martin.

On 2016/07/12 18:57, Felix Sasaki wrote:
> Thanks for the positive feedback and the good point about listing the supported languages, Martin. I am putting Mike directly into the loop, maybe he knows what languages are supported. I browsed the underlying library
> https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
> but did not find a list of languages. See also
> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md <https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md>
> and this presentation
> https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
> the github project home page says that 53 languages are supported with 99% precision.
>
> Best,
>
> Felix
>
>> Am 12.07.2016 um 09:00 schrieb Martin J. Dürst <duerst@it.aoyama.ac.jp>:
>>
>> Hello Felix,
>>
>> This is good news. However, for language detection, it's important to know what languages the detector supports. Language detection is very well known for being rather easy (on documents above a certain length) for a given set of languages. However, it's impossible to detect a language that the detector doesn't know. So a list of (currently) supported languages, and maybe a suggestion of how to contribute to additional ones, would be very helpful.
>>
>> Regards,   Martin.
>>
>> On 2016/07/12 15:18, Felix Sasaki wrote:
>>> Hi all,
>>>
>>> thanks to the Mike Smith there is now a language detection feature in the W3C validator. See
>>>
>>> https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=json <https://validator.w3.org/nu/?doc=https://w3.org&out=json>
>>> https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=xml <https://validator.w3.org/nu/?doc=https://w3.org&out=xml>
>>>
>>> For examples.
>>>
>>> Explanation from Mike:
>>> In the JSON output you should see that the JSON object has a “language” key at the top level, and in the XML you should that the root “messages” object has a “language” child element.
>>> The “language” value is a BCP 47 language tag. If the “language” is absent in the JSON/XML output, that indicates the language could not be determine with enough confidence.
>>>
>>>
>>> Example in curl:
>>> curl -X POST -H "Content-Type: text/html; charset=utf-8" -d 'HTML document here' "https://validator.w3.org/nu/?out=json"
>>>
>>> Output in JSON:
>>>
>>> {
>>>  "messages": [ ... ],
>>>  "language": "en"
>>> }
>>>
>>>
>>> This has a great potential to automatize language processing workflows on the web.
>>>
>>> - Felix
>>>
>
>

Received on Tuesday, 12 July 2016 10:44:34 UTC