Language detection for web content from Felix Sasaki on 2016-07-12 (public-i18n-its-ig@w3.org from July 2016)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 12 Jul 2016 08:18:08 +0200
To: public-i18n-its-ig@w3.org
Message-Id: <38BF99D8-2559-4C19-93B9-78BB2D5B400E@w3.org>

Hi all,

thanks to the Mike Smith there is now a language detection feature in the W3C validator. See

 https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=json <https://validator.w3.org/nu/?doc=https://w3.org&out=json>
 https://validator.w3.org/nu/?doc=https%3A%2F%2Fw3.org&out=xml <https://validator.w3.org/nu/?doc=https://w3.org&out=xml>

For examples.

Explanation from Mike:
In the JSON output you should see that the JSON object has a “language” key at the top level, and in the XML you should that the root “messages” object has a “language” child element.
The “language” value is a BCP 47 language tag. If the “language” is absent in the JSON/XML output, that indicates the language could not be determine with enough confidence.


Example in curl:
curl -X POST -H "Content-Type: text/html; charset=utf-8" -d 'HTML document here' "https://validator.w3.org/nu/?out=json"

Output in JSON:

{
  "messages": [ ... ],
  "language": "en"
}


This has a great potential to automatize language processing workflows on the web.

- Felix

Received on Tuesday, 12 July 2016 06:18:24 UTC