- From: Mike McKenna <mgm@globalisation.org>
- Date: Tue, 23 Mar 2004 15:47:25 -0800
- To: public-i18n-ws@w3.org
- Message-ID: <4060CC8D.9070901@globalisation.org>
Great idea! Here's my first cut on 6.3 - Natural Language Searching
Cheers,
Mike____
===============================
6.3 Natural Language Text Search
Invariably, somewhere along the line, an actual human will use a client
application to look for something. When that happens, services down the
may or may not understand the language of the client. To accommodate
this, natural language processing is used. The two primary cases are
language-neutral and language-specific.
6.3.1 Language-Neutral Natural Language Text Search
Most search engines do not understand language, but do understand
patterns and proximity. Patterns refer to wildcards and whitespace for
full-text search. However, many asian languages have no concept of
whitespace in most full text, and therefore may use a scheme where every
character can be considered to be a word.
Character form normalization
For language neutral applications, text should be normalized to only one
form (such as base+combining character or all precomposed) according to
Unicode Standard TR15 (http://www.unicode.org/reports/tr15/) before
comparisons are made.
Catalog or Index in Multiple Languages
Catalogs or indexes, if more than one language is to be supported should
contain language variants of keywords. This can be populated
automatically, with translations done semi-automatically, using context
to aid in creating the right alternate terms.
This is the ability to have one catalog or index item, with the
description in many languages. The service wants to be able to update
price and quantity in one place per item and have that reflected across
all languages. The client wants to search for items in their own
language.
E.g., in the following business XML, a catalog item is defined as follows:
<elementtype name="Product">
<model>
<sequence>
. . .
<element type="<ShortDescription.html>" occurs="*"/>
<element type="<LongDescription.html>" occurs="*"/>
. . .
The Descriptions can occur from zero to many times. The Description is
defined as follows:
<elementtype name="ShortDescription">
<model>
<string/>
</model>
<attdef name="lang" datatype="xmllang" prefix="xml">
<default>en</default>
</attdef>
</elementtype>
You should then be able to support the following:
<Product Type="Good" SchemaCategoryRef="C43171803">
<ProductID>154723-005</ProductID>
<Manufacturer PartnerRef="Acme Tools"></Manufacturer>
. . .
<ShortDescription xml:lang="en">Wrench</ShortDescription>
<ShortDescription xml:lang="en_GB">Spanner</ShortDescription>
<ShortDescription xml:lang="da">fladnoegle</ShortDescription>
<ShortDescription xml:lang="es-ES">llave abierta</ShortDescription>
<ShortDescription xml:lang="es-MX">llave inglesa</ShortDescription>
<ShortDescription xml:lang="fr-FR">clef à fourche</ShortDescription>
<ShortDescription xml:lang="de">Gabelschluessel</ShortDescription>
<ShortDescription xml:lang="it">chiave a forchetta</ShortDescription>
<ShortDescription xml:lang="ja">変化</ShortDescription>
<ShortDescription xml:lang="ko">스패너</ShortDescription>
<ShortDescription xml:lang="nl">vorkvormige sleutel</ShortDescription>
<ShortDescription xml:lang="pt-PT">chave fixa</ShortDescription>
<ShortDescription xml:lang="pt-BR">chave de boca</ShortDescription>
<ShortDescription xml:lang="zh-CN">扳子</ShortDescription>
<ShortDescription xml:lang="zh-TW">板鉗</ShortDescription>
. . .
</Product>
6.3.2 Language-specific Natural Language Text Search
Most search engines that have any linguistic characteristics are tuned
to a specific language such as English, German, or French. This allows
techniques, such as stemming and ignoring stop-words to operate
according to the unique characteristics of the language it is operating
in.
Keyword searching
When searching for keywords, language must be considered to resolve
some items such as abbreviations. E.g., in the string, "422 St. Jerome
St.", "St." could be either "Saint" or "Street".
Gender and plural variants
Note that some terms have more than one form depending on the gender or
plurality of the object. As an example, "Dr. Alvarez" or "Doctor
Alvarez" in English, could be either "Dr" or "Dra" for "Doctor" or
"Doctora" in Spanish. Therefore, to increase number of valid hits, in
the absence of context, a service should match all variants of a
matching term if translated to an alternate language.
Like clauses
When operating in a specific language, further normalization may be
required in addition to abbreviation expansion and character
normalization. This is to accommodate variant spellings for the same
word. In German, for instance, "Dürst" should also return "Duerst" to
allow searching across legacy and alternate systems.
Use of intermediary translation and dictionary look-up service
To allow a service to provide search services from clients in other
languages, the service should do the search more than once, depending on
implementation design. First, in the original text as submitted by the
client, and a second or more search after submitting the original query
to a translation or dictionary look-up service.
As and example, the address "422 St. Jerome St." could be also be
represented as:
en: 422 Saint Jerome Street
fr: Rue De 422 Saints Jerome
es: Calle De 422 Santos Jerome
de: 422 Heiliger Jerome Straße
ja: 422 人の聖者のJerome の通り
The query would look something like this:
Client ==> <query xml:lang="lang0"> ==> service ==> <query xml:lang="lang0".
^== look-up service
v
==> <query xml:lang="lang1">
==> <query xml:lang="lang2">
:
==> <query xml:lang="langN">
Phonetic searches
Note that phonetic searches, such as "Soundex" are usually tuned to
specific language characteristics. Soundex, for example, was designed
for the U.S. Census Bureau in 1890, and first patented in 1918 to allow
phonetic sorting of English surnames. It has poor precision, is unable
to handle multicultural names, produces many false positives and misses
many potentially correct terms. That being said, there exist
proprietary phonological name matching software that produces better
results across languages and cultures, but it must be tested and
implemented with the caveat that phonetic searching across languages is
inherently fraught with errors due to the dialectical differences.
Received on Tuesday, 23 March 2004 18:51:49 UTC