Re: Language detection for web content

Hello Felix,

This is good news. However, for language detection, it's important to 
know what languages the detector supports. Language detection is very 
well known for being rather easy (on documents above a certain length) 
for a given set of languages. However, it's impossible to detect a 
language that the detector doesn't know. So a list of (currently) 
supported languages, and maybe a suggestion of how to contribute to 
additional ones, would be very helpful.

Regards,   Martin.

On 2016/07/12 15:18, Felix Sasaki wrote:
> Hi all,
>
> thanks to the Mike Smith there is now a language detection feature in the W3C validator. See
>
>  https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=json <https://validator.w3.org/nu/?doc=https://w3.org&out=json>
>  https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=xml <https://validator.w3.org/nu/?doc=https://w3.org&out=xml>
>
> For examples.
>
> Explanation from Mike:
> In the JSON output you should see that the JSON object has a “language” key at the top level, and in the XML you should that the root “messages” object has a “language” child element.
> The “language” value is a BCP 47 language tag. If the “language” is absent in the JSON/XML output, that indicates the language could not be determine with enough confidence.
>
>
> Example in curl:
> curl -X POST -H "Content-Type: text/html; charset=utf-8" -d 'HTML document here' "https://validator.w3.org/nu/?out=json"
>
> Output in JSON:
>
> {
>   "messages": [ ... ],
>   "language": "en"
> }
>
>
> This has a great potential to automatize language processing workflows on the web.
>
> - Felix
>

Received on Tuesday, 12 July 2016 07:01:22 UTC