- From: Robert Sanderson <azaroth42@gmail.com>
- Date: Wed, 27 May 2015 08:27:37 -0700
- To: "Phillips, Addison" <addison@lab126.com>
- Cc: Doug Schepers <schepers@w3.org>, i18n WG <www-international@w3.org>, Richard Ishida <ishida@w3.org>, W3C Public Annotation List <public-annotation@w3.org>
- Message-ID: <CABevsUGPOuvPjbcys8UYYYMk=AogF8yE4a5qVk65snPJ5wyZ3g@mail.gmail.com>
Dear all, Apologies from Frederick and myself for letting the timing for the discussion fall off the radar. Would it be possible to join a call next week on Wednesday June 6 at 8am PST / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues regarding annotation? In particular, it would be great to make progress on the points that Addison made and also the issue that Takeshi brought up at the F2F regarding different lengths of character strings in different (programming) languages. Thanks! Rob On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <addison@lab126.com> wrote: > Some comments from reading the document through initially. I understand > that this is a work in progress. > > 'caseFolding': There is a default Unicode case folding. However, it is not > applicable in all cases. For example, see the note box in [1]. Certainly a > default case folding could be the default. But there should be a means of > tailoring the case fold using a language tag. > > 'unicodeFolding': This also presents a number of difficulties. Not just > canonical (NFC/NFD) equivalence but also compatibility equivalence > (NFKC/NFKD) is sometimes useful. In addition, there are textual variations > that are not related to Unicode character properties that searches may wish > to deal with. For example, Japanese uses both katakana and hiragana > phonetic scripts: one might wish to normalize these differences away when > searching text. In other words, I think probably this parameter needs more > thought. > > As an aside, there are other things that you note that users might want to > ignore/not ignore when searching. This is discussed at length in UTS#10, > Chapter 8 [2] and language-specific tailoring and different "weights" come > into play. > > 'wholeWord': This seems simple at first, but some languages (Thai, > Japanese, Chinese) that do not use spaces between words have a difficult > relationship with this feature. This doesn't make the feature invalid, but > does require a health warning that the items selected may not, in fact, > always be words. > > Normalization in general: it may be possible that the searched text is > itself not provided in a normalized form. Health warnings or solid > implementation guidance is certainly necessary here. > > The discussion of using Unicode decomposition in section 9 might need to > be carefully thought through. For example, the Korean Hangul script > decomposes in a way that might interfere with searching operations (a > character that had a Levenshtein distance of '1' when composed might have a > distance as large as '4' when decomposed). > > The example 'character count': what exactly would be counted here? Unicode > code points? Graphemes? > > There are invisible characters in Unicode, such as variation selectors or > the new emoji skin tone characters, which may not meaningfully affect the > user's intention, but might prevent searches from being successful. > > Anyway, food for thought. I look forward to further discussion. > > ~Addison > > [1] http://w3c.github.io/charmod-norm/#definitionCaseFolding > [2] http://www.unicode.org/reports/tr10/#Searching > > > -----Original Message----- > > From: Doug Schepers [mailto:schepers@w3.org] > > Sent: Tuesday, May 12, 2015 11:47 AM > > To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation > List > > Subject: Feedback on i28n in Rangefinder API > > > > Hi, Addison, Richard, I18n– > > > > Oops, hit send too soon, sorry... resending. > > > > (BCCing the Web Annotation WG mailing list, to keep them in the loop) > > > > I'd like to schedule a liaison telcon between the Internationalization > WG and > > the Web Annotation WG, to discuss issues around a client-side API for > > searching for strings in a web document. > > > > The Web Annotation WG is chartered to deliver a spec for "fuzzy > anchoring", > > which basically means a way to link to a specific passage in a document, > even > > if there is no ID and even if the document may have changed. > > > > One manifestation of this is my Rangefinder API spec [1], which is > basically a > > find-in-page API with fuzzy matching (e.g. case folding, Levenshtein > distance > > tolerance, Unicode normalization [2]) and location scoping. > > > > For the Unicode normalization, we'd like to refer normatively to the > updated > > Charmod-Norm [3]. In any case, we'd like to discuss our use cases and > > requirements around i18n with you, for your best advice on how we should > > proceed. > > > > I spoke with Richard today, and he suggested the best next step would be > > have you take a look at my rough early draft of the Rangefinder API, so > we > > have some basis for discussion. Please excuse the sketchy nature of the > spec, > > and note that the examples are illustrative but out of date with the > spec's > > development. > > > > If you want to meet, would you want to join us, or have some of us join > you? > > We normally meet on Wednesdays at 11am ET. > > > > > > [1] http://w3c.github.io/rangefinder/ > > [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding > > [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/ > > > > Regards– > > –Doug > > > > -- Rob Sanderson Information Standards Advocate Digital Library Systems and Services Stanford, CA 94305
Received on Wednesday, 27 May 2015 15:28:08 UTC