6.3 - Natural Language Search from Mike McKenna on 2004-03-23 (public-i18n-ws@w3.org from March 2004)

From: Mike McKenna <mgm@globalisation.org>
Date: Tue, 23 Mar 2004 15:47:25 -0800
To: public-i18n-ws@w3.org
Message-ID: <4060CC8D.9070901@globalisation.org>
Great idea!  Here's my first cut on 6.3 - Natural Language Searching

Cheers,

       Mike____

===============================


      6.3 Natural Language Text Search

Invariably, somewhere along the line, an actual human will use a client 
application to look for something.  When that happens, services down the 
may or may not understand the language of the client.  To accommodate 
this, natural language processing is used.   The two primary cases are 
language-neutral and language-specific.


        6.3.1 Language-Neutral Natural Language Text Search

Most search engines do not understand language, but do understand 
patterns and proximity.  Patterns refer to wildcards and whitespace for 
full-text search.  However, many asian languages have no concept of 
whitespace in most full text, and therefore may use a scheme where every 
character can be considered to be a word.


          Character form normalization

For language neutral applications, text should be normalized to only one 
form (such as base+combining character or all precomposed) according to 
Unicode Standard TR15 (http://www.unicode.org/reports/tr15/) before 
comparisons are made.


          Catalog or Index in Multiple Languages

Catalogs or indexes, if more than one language is to be supported should 
contain language variants of keywords.  This can be populated 
automatically, with translations done semi-automatically, using context 
to aid in creating the right alternate terms. 
 
This is the ability to have one catalog or index item, with the 
description in many languages.  The service wants to be able to update 
price and quantity in one place per item and have that reflected across 
all languages.   The client wants to search for items in their own 
language. 

E.g., in the following business XML, a catalog item is defined as follows:

    <elementtype name="Product">
        <model>
            <sequence>
              . . .
         <element type="<ShortDescription.html>" occurs="*"/>
         <element type="<LongDescription.html>" occurs="*"/>
              . . .


The Descriptions can occur from zero to many times.  The Description is 
defined as follows:

    <elementtype name="ShortDescription">
         <model>
                  <string/>
         </model>
         <attdef name="lang" datatype="xmllang" prefix="xml">
              <default>en</default>
         </attdef>
    </elementtype>

 
You should then be able to support the following:

    <Product Type="Good" SchemaCategoryRef="C43171803">
      <ProductID>154723-005</ProductID>
      <Manufacturer PartnerRef="Acme Tools"></Manufacturer>
                       . . .
      <ShortDescription xml:lang="en">Wrench</ShortDescription>
      <ShortDescription xml:lang="en_GB">Spanner</ShortDescription>
      <ShortDescription xml:lang="da">fladnoegle</ShortDescription>
      <ShortDescription xml:lang="es-ES">llave abierta</ShortDescription>
      <ShortDescription xml:lang="es-MX">llave inglesa</ShortDescription>
      <ShortDescription xml:lang="fr-FR">clef à fourche</ShortDescription>
      <ShortDescription xml:lang="de">Gabelschluessel</ShortDescription>
      <ShortDescription xml:lang="it">chiave a forchetta</ShortDescription>
      <ShortDescription xml:lang="ja">変化</ShortDescription>
      <ShortDescription xml:lang="ko">스패너</ShortDescription>
      <ShortDescription xml:lang="nl">vorkvormige sleutel</ShortDescription>
      <ShortDescription xml:lang="pt-PT">chave fixa</ShortDescription>
      <ShortDescription xml:lang="pt-BR">chave de boca</ShortDescription>
      <ShortDescription xml:lang="zh-CN">扳子</ShortDescription>
      <ShortDescription xml:lang="zh-TW">板鉗</ShortDescription>
                       . . .
    </Product>


        6.3.2 Language-specific Natural Language Text Search

Most search engines that have any linguistic characteristics are tuned 
to a specific language such as English, German,  or French.  This allows 
techniques, such as stemming and ignoring stop-words to operate 
according to the unique characteristics of the language it is operating 
in. 


          Keyword searching

When searching for keywords,  language must be considered to resolve 
some items such as abbreviations.  E.g., in the string, "422 St. Jerome 
St.", "St." could be either "Saint" or "Street". 


          Gender and plural variants

Note that some terms have more than one form depending on the gender or 
plurality of the object.  As an example, "Dr. Alvarez" or "Doctor 
Alvarez" in English, could be either "Dr" or "Dra" for "Doctor" or 
"Doctora" in Spanish.  Therefore, to increase number of valid hits, in 
the absence of context, a service should match all variants of a 
matching term if translated to an alternate language.


          Like clauses

When operating in a specific language, further normalization may be 
required in addition to abbreviation expansion and character 
normalization.  This is to accommodate variant spellings for the same 
word.  In German, for instance, "Dürst" should also return "Duerst"  to 
allow searching across legacy and alternate systems.


          Use of intermediary translation and dictionary look-up service

To allow a service to provide search services from clients in other 
languages, the service should do the search more than once, depending on 
implementation design.  First, in the original text as submitted by the 
client, and a second or more search after submitting the original query 
to a translation or dictionary look-up service.

As and example, the address "422 St. Jerome St." could be also be 
represented as:

    en:      422 Saint Jerome Street
    fr:        Rue De 422 Saints Jerome
    es:      Calle De 422 Santos Jerome
    de:      422 Heiliger Jerome Straße
    ja:        422 人の聖者のJerome の通り


The query would look something like this:

Client ==> <query xml:lang="lang0"> ==> service ==> <query xml:lang="lang0".
                                    ^== look-up service
                                    v
                                    ==> <query xml:lang="lang1">
                                    ==> <query xml:lang="lang2">
                                    :
                                    ==> <query xml:lang="langN">


          Phonetic searches

Note that phonetic searches, such as "Soundex" are usually tuned to 
specific language characteristics.  Soundex, for example, was designed 
for the U.S. Census Bureau in 1890, and first patented in 1918 to allow 
phonetic sorting of English surnames.  It has poor precision, is unable 
to handle multicultural names, produces many false positives and misses 
many potentially correct terms.  That being said, there exist 
proprietary phonological name matching software that produces better 
results across languages and cultures, but it must be tested and 
implemented with the caveat that phonetic searching across languages is 
inherently fraught with errors due to the dialectical differences.
Received on Tuesday, 23 March 2004 18:51:49 UTC