Re: [string-search] Web page in multiple languages (#13) from Addison Phillips via GitHub on 2022-07-26 (public-i18n-archive@w3.org from July to September 2022)

From: Addison Phillips via GitHub <sysbot+gh@w3.org>
Date: Tue, 26 Jul 2022 15:03:51 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-1195602586-1658847829-sysbot+gh@w3.org>

@xfq If one does true full-text search on a page in multiple languages (as opposed to sub-string matching, which is the primary topic of our document), then the segmentation, stemming, and other processing (such as named entity recognition) of the corpus should be matched to the language of each block of text--i.e. word segmentation on `ja` is different from that on `zh`.

When search terms are entered against a multilingual index, it may be necessary to do "explosive stemming" (multiple stemming processes using the rules for the various languages in the corpus) or other types of processing to try to match the search terms against the indices.

FTS is complicated. 

As @r12a asks, what is the problem here (with respect to our text)? 😉 Happy to accept suggestions.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at https://github.com/w3c/string-search/issues/13#issuecomment-1195602586 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 26 July 2022 15:03:52 UTC