Re: Language detection for web content

Felix Sasaki <fsasaki@w3.org>, 2016-07-12 11:57 +0200:
> 
> Thanks for the positive feedback and the good point about listing the supported languages, Martin. I am putting Mike directly into the loop, maybe he knows what languages are supported. I browsed the underlying library
> https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
> but did not find a list of languages. See also
> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md <https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md>
> and this presentation
> https://github.com/shuyo/language-detection <https://github.com/shuyo/language-detection>
> the github project home page says that 53 languages are supported with 99% precision.

Martin wrote:
> > This is good news. However, for language detection, it's important to
> > know what languages the detector supports.

Agreed that it’s important to know the list of supported languages. The
list supported in the trunk of the upstream library is here:

  https://github.com/shuyo/language-detection/blob/wiki/LanguageList.md

That’s the set of 53 which other docs there refer to.

> > Language detection is very well known for being rather easy (on
> > documents above a certain length) for a given set of languages.
> > However, it's impossible to detect a language that the detector doesn't
> > know. So a list of (currently) supported languages, and maybe a
> > suggestion of how to contribute to additional ones, would be very
> > helpful.

As far as how to contribute additional ones, the data files it relies on
are at https://github.com/shuyo/language-detection/tree/master/profiles

So it would be a matter of opening a pull request in the github issue
tracker at https://github.com/shuyo/language-detection/pulls to add a new
“profile”. I see there’s actually an open one there for Esperanto now.

The data files in the repo are generated from Wikipedia abstracts found at,
e.g., https://dumps.wikimedia.org/enwiki/ https://dumps.wikimedia.org/arwiki/
and so on. https://dumps.wikimedia.org/backup-index.html has links to all.

For generating new profiles, the repo provides a command-line tool:

  https://github.com/shuyo/language-detection/blob/master/lib/langdetect.jar

The usage instructions for that tool are here:

  https://github.com/shuyo/language-detection/blob/wiki/Tools.md#generate-language-profile

If there’s interest in adding new languages beyond the 53 the current
upstream version supports and if the upstream maintainer is not responsive
to pull requests for adding new ones (that PR for Esperanto has been open for
7 months now without being merged…), I’d be willing to maintain a fork of the
library and do releases of it (including for maven to the central repo).

  —Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Tuesday, 12 July 2016 10:39:35 UTC