- From: Mike McKenna <mgm@globalisation.org>
- Date: Tue, 23 Mar 2004 15:47:25 -0800
- To: public-i18n-ws@w3.org
- Message-ID: <4060CC8D.9070901@globalisation.org>
Great idea! Here's my first cut on 6.3 - Natural Language Searching Cheers, Mike____ =============================== 6.3 Natural Language Text Search Invariably, somewhere along the line, an actual human will use a client application to look for something. When that happens, services down the may or may not understand the language of the client. To accommodate this, natural language processing is used. The two primary cases are language-neutral and language-specific. 6.3.1 Language-Neutral Natural Language Text Search Most search engines do not understand language, but do understand patterns and proximity. Patterns refer to wildcards and whitespace for full-text search. However, many asian languages have no concept of whitespace in most full text, and therefore may use a scheme where every character can be considered to be a word. Character form normalization For language neutral applications, text should be normalized to only one form (such as base+combining character or all precomposed) according to Unicode Standard TR15 (http://www.unicode.org/reports/tr15/) before comparisons are made. Catalog or Index in Multiple Languages Catalogs or indexes, if more than one language is to be supported should contain language variants of keywords. This can be populated automatically, with translations done semi-automatically, using context to aid in creating the right alternate terms. This is the ability to have one catalog or index item, with the description in many languages. The service wants to be able to update price and quantity in one place per item and have that reflected across all languages. The client wants to search for items in their own language. E.g., in the following business XML, a catalog item is defined as follows: <elementtype name="Product"> <model> <sequence> . . . <element type="<ShortDescription.html>" occurs="*"/> <element type="<LongDescription.html>" occurs="*"/> . . . The Descriptions can occur from zero to many times. The Description is defined as follows: <elementtype name="ShortDescription"> <model> <string/> </model> <attdef name="lang" datatype="xmllang" prefix="xml"> <default>en</default> </attdef> </elementtype> You should then be able to support the following: <Product Type="Good" SchemaCategoryRef="C43171803"> <ProductID>154723-005</ProductID> <Manufacturer PartnerRef="Acme Tools"></Manufacturer> . . . <ShortDescription xml:lang="en">Wrench</ShortDescription> <ShortDescription xml:lang="en_GB">Spanner</ShortDescription> <ShortDescription xml:lang="da">fladnoegle</ShortDescription> <ShortDescription xml:lang="es-ES">llave abierta</ShortDescription> <ShortDescription xml:lang="es-MX">llave inglesa</ShortDescription> <ShortDescription xml:lang="fr-FR">clef à fourche</ShortDescription> <ShortDescription xml:lang="de">Gabelschluessel</ShortDescription> <ShortDescription xml:lang="it">chiave a forchetta</ShortDescription> <ShortDescription xml:lang="ja">変化</ShortDescription> <ShortDescription xml:lang="ko">스패너</ShortDescription> <ShortDescription xml:lang="nl">vorkvormige sleutel</ShortDescription> <ShortDescription xml:lang="pt-PT">chave fixa</ShortDescription> <ShortDescription xml:lang="pt-BR">chave de boca</ShortDescription> <ShortDescription xml:lang="zh-CN">扳子</ShortDescription> <ShortDescription xml:lang="zh-TW">板鉗</ShortDescription> . . . </Product> 6.3.2 Language-specific Natural Language Text Search Most search engines that have any linguistic characteristics are tuned to a specific language such as English, German, or French. This allows techniques, such as stemming and ignoring stop-words to operate according to the unique characteristics of the language it is operating in. Keyword searching When searching for keywords, language must be considered to resolve some items such as abbreviations. E.g., in the string, "422 St. Jerome St.", "St." could be either "Saint" or "Street". Gender and plural variants Note that some terms have more than one form depending on the gender or plurality of the object. As an example, "Dr. Alvarez" or "Doctor Alvarez" in English, could be either "Dr" or "Dra" for "Doctor" or "Doctora" in Spanish. Therefore, to increase number of valid hits, in the absence of context, a service should match all variants of a matching term if translated to an alternate language. Like clauses When operating in a specific language, further normalization may be required in addition to abbreviation expansion and character normalization. This is to accommodate variant spellings for the same word. In German, for instance, "Dürst" should also return "Duerst" to allow searching across legacy and alternate systems. Use of intermediary translation and dictionary look-up service To allow a service to provide search services from clients in other languages, the service should do the search more than once, depending on implementation design. First, in the original text as submitted by the client, and a second or more search after submitting the original query to a translation or dictionary look-up service. As and example, the address "422 St. Jerome St." could be also be represented as: en: 422 Saint Jerome Street fr: Rue De 422 Saints Jerome es: Calle De 422 Santos Jerome de: 422 Heiliger Jerome Straße ja: 422 人の聖者のJerome の通り The query would look something like this: Client ==> <query xml:lang="lang0"> ==> service ==> <query xml:lang="lang0". ^== look-up service v ==> <query xml:lang="lang1"> ==> <query xml:lang="lang2"> : ==> <query xml:lang="langN"> Phonetic searches Note that phonetic searches, such as "Soundex" are usually tuned to specific language characteristics. Soundex, for example, was designed for the U.S. Census Bureau in 1890, and first patented in 1918 to allow phonetic sorting of English surnames. It has poor precision, is unable to handle multicultural names, produces many false positives and misses many potentially correct terms. That being said, there exist proprietary phonological name matching software that produces better results across languages and cultures, but it must be tested and implemented with the caveat that phonetic searching across languages is inherently fraught with errors due to the dialectical differences.
Received on Tuesday, 23 March 2004 18:51:49 UTC