W3C home > Mailing lists > Public > www-talk@w3.org > May to June 1996

Re: Tagging a document with language

From: J.E. Fritz <fritz@gems.vcu.edu>
Date: Fri, 07 Jun 1996 10:06:50 -0400 (EDT)
To: robots@webcrawler.com
Cc: www-talk@w3.org
Message-Id: <Pine.PCW.3.93.960607093940.3582B-100000@acepc44.ace.vcu.edu>
On Fri, 7 Jun 1996, Tronche Ch. le pitre wrote:

...
> that incidentally I found. But I'm suprised by the increasing number
> of documents that I can't understand, simply because they're written
> in a foreign language (foreign to me, that is nor french nor english),
> not to speak of non iso-8859 files, such as japanese ones.
...
> A more interesting approach is the indexer trying to figure the
> language of the document, based may be on a statistical analysis.
> Probably, problems will arise with mixed languages files.

An easy way to tell might be by examination of stopwords.  If a document
has lots of words like "an", "to", "be", "by", "of", "if", "a", "the",
"in", "this", "then", "it", "at" and "some" then it probably contains at
least some English.  "Le", "la", "les", "un", "une", "en", "au", "de",
"des" point to French.

The advantage is that you would need a relatively small number of words
for each language, not the whole dictionary.

Of course this approach might not separate very closely related
languages. 

-Fritz
Received on Friday, 7 June 1996 10:02:38 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 27 October 2010 18:14:19 GMT