- From: J.E. Fritz <fritz@gems.vcu.edu>
- Date: Fri, 07 Jun 1996 10:06:50 -0400 (EDT)
- To: robots@webcrawler.com
- Cc: www-talk@w3.org
On Fri, 7 Jun 1996, Tronche Ch. le pitre wrote: ... > that incidentally I found. But I'm suprised by the increasing number > of documents that I can't understand, simply because they're written > in a foreign language (foreign to me, that is nor french nor english), > not to speak of non iso-8859 files, such as japanese ones. ... > A more interesting approach is the indexer trying to figure the > language of the document, based may be on a statistical analysis. > Probably, problems will arise with mixed languages files. An easy way to tell might be by examination of stopwords. If a document has lots of words like "an", "to", "be", "by", "of", "if", "a", "the", "in", "this", "then", "it", "at" and "some" then it probably contains at least some English. "Le", "la", "les", "un", "une", "en", "au", "de", "des" point to French. The advantage is that you would need a relatively small number of words for each language, not the whole dictionary. Of course this approach might not separate very closely related languages. -Fritz
Received on Friday, 7 June 1996 10:02:38 UTC