- From: Benjamin Young <bigbluehat@hypothes.is>
- Date: Fri, 29 May 2015 13:07:38 -0400
- To: Robert Sanderson <azaroth42@gmail.com>
- Cc: "Phillips, Addison" <addison@lab126.com>, Doug Schepers <schepers@w3.org>, i18n WG <www-international@w3.org>, Richard Ishida <ishida@w3.org>, W3C Public Annotation List <public-annotation@w3.org>
Wednesday June 3 at 8am PST / 11am EST / 4pm UK / 5pm Europe Rather. :) On Wed, May 27, 2015 at 11:27 AM, Robert Sanderson <azaroth42@gmail.com> wrote: > > Dear all, > > Apologies from Frederick and myself for letting the timing for the > discussion fall off the radar. > > Would it be possible to join a call next week on Wednesday June 6 at 8am PST > / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues > regarding annotation? > > In particular, it would be great to make progress on the points that > Addison made and also the issue that Takeshi brought up at the F2F regarding > different lengths of character strings in different (programming) languages. > > Thanks! > > Rob > > > > On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <addison@lab126.com> > wrote: >> >> Some comments from reading the document through initially. I understand >> that this is a work in progress. >> >> 'caseFolding': There is a default Unicode case folding. However, it is not >> applicable in all cases. For example, see the note box in [1]. Certainly a >> default case folding could be the default. But there should be a means of >> tailoring the case fold using a language tag. >> >> 'unicodeFolding': This also presents a number of difficulties. Not just >> canonical (NFC/NFD) equivalence but also compatibility equivalence >> (NFKC/NFKD) is sometimes useful. In addition, there are textual variations >> that are not related to Unicode character properties that searches may wish >> to deal with. For example, Japanese uses both katakana and hiragana phonetic >> scripts: one might wish to normalize these differences away when searching >> text. In other words, I think probably this parameter needs more thought. >> >> As an aside, there are other things that you note that users might want to >> ignore/not ignore when searching. This is discussed at length in UTS#10, >> Chapter 8 [2] and language-specific tailoring and different "weights" come >> into play. >> >> 'wholeWord': This seems simple at first, but some languages (Thai, >> Japanese, Chinese) that do not use spaces between words have a difficult >> relationship with this feature. This doesn't make the feature invalid, but >> does require a health warning that the items selected may not, in fact, >> always be words. >> >> Normalization in general: it may be possible that the searched text is >> itself not provided in a normalized form. Health warnings or solid >> implementation guidance is certainly necessary here. >> >> The discussion of using Unicode decomposition in section 9 might need to >> be carefully thought through. For example, the Korean Hangul script >> decomposes in a way that might interfere with searching operations (a >> character that had a Levenshtein distance of '1' when composed might have a >> distance as large as '4' when decomposed). >> >> The example 'character count': what exactly would be counted here? Unicode >> code points? Graphemes? >> >> There are invisible characters in Unicode, such as variation selectors or >> the new emoji skin tone characters, which may not meaningfully affect the >> user's intention, but might prevent searches from being successful. >> >> Anyway, food for thought. I look forward to further discussion. >> >> ~Addison >> >> [1] http://w3c.github.io/charmod-norm/#definitionCaseFolding >> [2] http://www.unicode.org/reports/tr10/#Searching >> >> > -----Original Message----- >> > From: Doug Schepers [mailto:schepers@w3.org] >> > Sent: Tuesday, May 12, 2015 11:47 AM >> > To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation >> > List >> > Subject: Feedback on i28n in Rangefinder API >> > >> > Hi, Addison, Richard, I18n– >> > >> > Oops, hit send too soon, sorry... resending. >> > >> > (BCCing the Web Annotation WG mailing list, to keep them in the loop) >> > >> > I'd like to schedule a liaison telcon between the Internationalization >> > WG and >> > the Web Annotation WG, to discuss issues around a client-side API for >> > searching for strings in a web document. >> > >> > The Web Annotation WG is chartered to deliver a spec for "fuzzy >> > anchoring", >> > which basically means a way to link to a specific passage in a document, >> > even >> > if there is no ID and even if the document may have changed. >> > >> > One manifestation of this is my Rangefinder API spec [1], which is >> > basically a >> > find-in-page API with fuzzy matching (e.g. case folding, Levenshtein >> > distance >> > tolerance, Unicode normalization [2]) and location scoping. >> > >> > For the Unicode normalization, we'd like to refer normatively to the >> > updated >> > Charmod-Norm [3]. In any case, we'd like to discuss our use cases and >> > requirements around i18n with you, for your best advice on how we should >> > proceed. >> > >> > I spoke with Richard today, and he suggested the best next step would be >> > have you take a look at my rough early draft of the Rangefinder API, so >> > we >> > have some basis for discussion. Please excuse the sketchy nature of the >> > spec, >> > and note that the examples are illustrative but out of date with the >> > spec's >> > development. >> > >> > If you want to meet, would you want to join us, or have some of us join >> > you? >> > We normally meet on Wednesdays at 11am ET. >> > >> > >> > [1] http://w3c.github.io/rangefinder/ >> > [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding >> > [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/ >> > >> > Regards– >> > –Doug >> > >> > > > > -- > Rob Sanderson > Information Standards Advocate > Digital Library Systems and Services > Stanford, CA 94305
Received on Friday, 29 May 2015 17:08:06 UTC