RE: Feedback on i18n in Rangefinder API

Hello Rob,

Please confirm that this call would be on Wednesday, June 3rd (6 June is Saturday?). I’d be glad to participate. Others from the Internationalization WG may also wish to: you should plan for no more than three of us to turn up. Please provide participation information.



From: Robert Sanderson []
Sent: Wednesday, May 27, 2015 8:28 AM
To: Phillips, Addison
Cc: Doug Schepers; i18n WG; Richard Ishida; W3C Public Annotation List
Subject: Re: Feedback on i28n in Rangefinder API

Dear all,

Apologies from Frederick and myself for letting the timing for the discussion fall off the radar.

Would it be possible to join a call next week on Wednesday June 6 at 8am PST / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues regarding annotation?

In particular, it would be great to make progress on the  points that Addison made and also the issue that Takeshi brought up at the F2F regarding different lengths of character strings in different (programming) languages.



On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <<>> wrote:
Some comments from reading the document through initially. I understand that this is a work in progress.

'caseFolding': There is a default Unicode case folding. However, it is not applicable in all cases. For example, see the note box in [1]. Certainly a default case folding could be the default. But there should be a means of tailoring the case fold using a language tag.

'unicodeFolding': This also presents a number of difficulties. Not just canonical (NFC/NFD) equivalence but also compatibility equivalence (NFKC/NFKD) is sometimes useful. In addition, there are textual variations that are not related to Unicode character properties that searches may wish to deal with. For example, Japanese uses both katakana and hiragana phonetic scripts: one might wish to normalize these differences away when searching text. In other words, I think probably this parameter needs more thought.

As an aside, there are other things that you note that users might want to ignore/not ignore when searching. This is discussed at length in UTS#10, Chapter 8 [2] and language-specific tailoring and different "weights" come into play.

'wholeWord': This seems simple at first, but some languages (Thai, Japanese, Chinese) that do not use spaces between words have a difficult relationship with this feature. This doesn't make the feature invalid, but does require a health warning that the items selected may not, in fact, always be words.

Normalization in general: it may be possible that the searched text is itself not provided in a normalized form. Health warnings or solid implementation guidance is certainly necessary here.

The discussion of using Unicode decomposition in section 9 might need to be carefully thought through. For example, the Korean Hangul script decomposes in a way that might interfere with searching operations (a character that had a Levenshtein distance of '1' when composed might have a distance as large as '4' when decomposed).

The example 'character count': what exactly would be counted here? Unicode code points? Graphemes?

There are invisible characters in Unicode, such as variation selectors or the new emoji skin tone characters, which may not meaningfully affect the user's intention, but might prevent searches from being successful.

Anyway, food for thought. I look forward to further discussion.




> -----Original Message-----
> From: Doug Schepers [<>]
> Sent: Tuesday, May 12, 2015 11:47 AM
> To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation List
> Subject: Feedback on i28n in Rangefinder API
> Hi, Addison, Richard, I18n–
> Oops, hit send too soon, sorry... resending.
> (BCCing the Web Annotation WG mailing list, to keep them in the loop)
> I'd like to schedule a liaison telcon between the Internationalization WG and
> the Web Annotation WG, to discuss issues around a client-side API for
> searching for strings in a web document.
> The Web Annotation WG is chartered to deliver a spec for "fuzzy anchoring",
> which basically means a way to link to a specific passage in a document, even
> if there is no ID and even if the document may have changed.
> One manifestation of this is my Rangefinder API spec [1], which is basically a
> find-in-page API with fuzzy matching (e.g. case folding, Levenshtein distance
> tolerance, Unicode normalization [2]) and location scoping.
> For the Unicode normalization, we'd like to refer normatively to the updated
> Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
> requirements around i18n with you, for your best advice on how we should
> proceed.
> I spoke with Richard today, and he suggested the best next step would be
> have you take a look at my rough early draft of the Rangefinder API, so we
> have some basis for discussion. Please excuse the sketchy nature of the spec,
> and note that the examples are illustrative but out of date with the spec's
> development.
> If you want to meet, would you want to join us, or have some of us join you?
> We normally meet on Wednesdays at 11am ET.
> [1]

> [2]

> [3]

> Regards–
> –Doug

Rob Sanderson
Information Standards Advocate
Digital Library Systems and Services
Stanford, CA 94305

Received on Wednesday, 27 May 2015 16:03:55 UTC