W3C home > Mailing lists > Public > public-i18n-its-ig@w3.org > July 2016

Re: Language detection for web content

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 12 Jul 2016 11:57:42 +0200
Cc: public-i18n-its-ig@w3.org
Message-Id: <E6B2D815-A574-48EE-B570-F8D7220E6351@w3.org>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Michael Smith <mike@w3.org>
Thanks for the positive feedback and the good point about listing the supported languages, Martin. I am putting Mike directly into the loop, maybe he knows what languages are supported. I browsed the underlying library
https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
but did not find a list of languages. See also
https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md <https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md>
and this presentation
https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
the github project home page says that 53 languages are supported with 99% precision.

Best,

Felix

> Am 12.07.2016 um 09:00 schrieb Martin J. Dürst <duerst@it.aoyama.ac.jp>:
> 
> Hello Felix,
> 
> This is good news. However, for language detection, it's important to know what languages the detector supports. Language detection is very well known for being rather easy (on documents above a certain length) for a given set of languages. However, it's impossible to detect a language that the detector doesn't know. So a list of (currently) supported languages, and maybe a suggestion of how to contribute to additional ones, would be very helpful.
> 
> Regards,   Martin.
> 
> On 2016/07/12 15:18, Felix Sasaki wrote:
>> Hi all,
>> 
>> thanks to the Mike Smith there is now a language detection feature in the W3C validator. See
>> 
>> https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=json <https://validator.w3.org/nu/?doc=https://w3.org&out=json>
>> https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=xml <https://validator.w3.org/nu/?doc=https://w3.org&out=xml>
>> 
>> For examples.
>> 
>> Explanation from Mike:
>> In the JSON output you should see that the JSON object has a “language” key at the top level, and in the XML you should that the root “messages” object has a “language” child element.
>> The “language” value is a BCP 47 language tag. If the “language” is absent in the JSON/XML output, that indicates the language could not be determine with enough confidence.
>> 
>> 
>> Example in curl:
>> curl -X POST -H "Content-Type: text/html; charset=utf-8" -d 'HTML document here' "https://validator.w3.org/nu/?out=json"
>> 
>> Output in JSON:
>> 
>> {
>>  "messages": [ ... ],
>>  "language": "en"
>> }
>> 
>> 
>> This has a great potential to automatize language processing workflows on the web.
>> 
>> - Felix
>> 
Received on Tuesday, 12 July 2016 09:58:10 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 12 July 2016 09:58:10 UTC