Re: Feedback on i28n in Rangefinder API from Benjamin Young on 2015-05-29 (www-international@w3.org from April to June 2015)

From: Benjamin Young <bigbluehat@hypothes.is>
Date: Fri, 29 May 2015 13:07:38 -0400
To: Robert Sanderson <azaroth42@gmail.com>
Cc: "Phillips, Addison" <addison@lab126.com>, Doug Schepers <schepers@w3.org>, i18n WG <www-international@w3.org>, Richard Ishida <ishida@w3.org>, W3C Public Annotation List <public-annotation@w3.org>
Message-ID: <CAE3H5FJvnKEzDKuiFfKf4jte=-kWY_VteV2VDSXK1BQOOoZ9aw@mail.gmail.com>
Wednesday June 3 at 8am PST / 11am EST / 4pm UK / 5pm Europe

Rather. :)

On Wed, May 27, 2015 at 11:27 AM, Robert Sanderson <azaroth42@gmail.com> wrote:
>
> Dear all,
>
> Apologies from Frederick and myself for letting the timing for the
> discussion fall off the radar.
>
> Would it be possible to join a call next week on Wednesday June 6 at 8am PST
> / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues
> regarding annotation?
>
> In particular, it would be great to make progress on the  points that
> Addison made and also the issue that Takeshi brought up at the F2F regarding
> different lengths of character strings in different (programming) languages.
>
> Thanks!
>
> Rob
>
>
>
> On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <addison@lab126.com>
> wrote:
>>
>> Some comments from reading the document through initially. I understand
>> that this is a work in progress.
>>
>> 'caseFolding': There is a default Unicode case folding. However, it is not
>> applicable in all cases. For example, see the note box in [1]. Certainly a
>> default case folding could be the default. But there should be a means of
>> tailoring the case fold using a language tag.
>>
>> 'unicodeFolding': This also presents a number of difficulties. Not just
>> canonical (NFC/NFD) equivalence but also compatibility equivalence
>> (NFKC/NFKD) is sometimes useful. In addition, there are textual variations
>> that are not related to Unicode character properties that searches may wish
>> to deal with. For example, Japanese uses both katakana and hiragana phonetic
>> scripts: one might wish to normalize these differences away when searching
>> text. In other words, I think probably this parameter needs more thought.
>>
>> As an aside, there are other things that you note that users might want to
>> ignore/not ignore when searching. This is discussed at length in UTS#10,
>> Chapter 8 [2] and language-specific tailoring and different "weights" come
>> into play.
>>
>> 'wholeWord': This seems simple at first, but some languages (Thai,
>> Japanese, Chinese) that do not use spaces between words have a difficult
>> relationship with this feature. This doesn't make the feature invalid, but
>> does require a health warning that the items selected may not, in fact,
>> always be words.
>>
>> Normalization in general: it may be possible that the searched text is
>> itself not provided in a normalized form. Health warnings or solid
>> implementation guidance is certainly necessary here.
>>
>> The discussion of using Unicode decomposition in section 9 might need to
>> be carefully thought through. For example, the Korean Hangul script
>> decomposes in a way that might interfere with searching operations (a
>> character that had a Levenshtein distance of '1' when composed might have a
>> distance as large as '4' when decomposed).
>>
>> The example 'character count': what exactly would be counted here? Unicode
>> code points? Graphemes?
>>
>> There are invisible characters in Unicode, such as variation selectors or
>> the new emoji skin tone characters, which may not meaningfully affect the
>> user's intention, but might prevent searches from being successful.
>>
>> Anyway, food for thought. I look forward to further discussion.
>>
>> ~Addison
>>
>> [1] http://w3c.github.io/charmod-norm/#definitionCaseFolding
>> [2] http://www.unicode.org/reports/tr10/#Searching
>>
>> > -----Original Message-----
>> > From: Doug Schepers [mailto:schepers@w3.org]
>> > Sent: Tuesday, May 12, 2015 11:47 AM
>> > To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation
>> > List
>> > Subject: Feedback on i28n in Rangefinder API
>> >
>> > Hi, Addison, Richard, I18n–
>> >
>> > Oops, hit send too soon, sorry... resending.
>> >
>> > (BCCing the Web Annotation WG mailing list, to keep them in the loop)
>> >
>> > I'd like to schedule a liaison telcon between the Internationalization
>> > WG and
>> > the Web Annotation WG, to discuss issues around a client-side API for
>> > searching for strings in a web document.
>> >
>> > The Web Annotation WG is chartered to deliver a spec for "fuzzy
>> > anchoring",
>> > which basically means a way to link to a specific passage in a document,
>> > even
>> > if there is no ID and even if the document may have changed.
>> >
>> > One manifestation of this is my Rangefinder API spec [1], which is
>> > basically a
>> > find-in-page API with fuzzy matching (e.g. case folding, Levenshtein
>> > distance
>> > tolerance, Unicode normalization [2]) and location scoping.
>> >
>> > For the Unicode normalization, we'd like to refer normatively to the
>> > updated
>> > Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
>> > requirements around i18n with you, for your best advice on how we should
>> > proceed.
>> >
>> > I spoke with Richard today, and he suggested the best next step would be
>> > have you take a look at my rough early draft of the Rangefinder API, so
>> > we
>> > have some basis for discussion. Please excuse the sketchy nature of the
>> > spec,
>> > and note that the examples are illustrative but out of date with the
>> > spec's
>> > development.
>> >
>> > If you want to meet, would you want to join us, or have some of us join
>> > you?
>> > We normally meet on Wednesdays at 11am ET.
>> >
>> >
>> > [1] http://w3c.github.io/rangefinder/
>> > [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
>> > [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>> >
>> > Regards–
>> > –Doug
>> >
>>
>
>
>
> --
> Rob Sanderson
> Information Standards Advocate
> Digital Library Systems and Services
> Stanford, CA 94305
Received on Friday, 29 May 2015 17:08:06 UTC