- From: Doug Schepers <schepers@w3.org>
- Date: Fri, 27 Feb 2015 04:17:07 -0500
- To: "Kanai, Takeshi" <Takeshi.Kanai@jp.sony.com>, Randall Leeds <randall@bleeds.info>, W3C Public Annotation List <public-annotation@w3.org>
Hi, Takeshi– On 2/27/15 4:01 AM, Kanai, Takeshi wrote: > > Thank you for the details. I'm getting clear about the motivation. > The reason why I would like to see the background was that the > definition was not enough clear for me to evaluate side effects to > apply the function for non-Latin languages. I apologize for not being clear before. The confusion is all my own, through ignorance of the subject, and I'm afraid my confusion may have spread. I believe you interpreted my motivation correctly from the beginning, but I lacked the terminology and context to convey that. I've reviewed the links you sent, and while I'm still fuzzy on the details, they were exactly what I needed. > Basically, I have the same feelings with the negative comments in the > article [2] you have provided towards the *-folding attributes, and I > still think that utilizing Unicode Collation Algorithm (UCA) provides > more efforts for more languages as alternative solutions, I agree with you, and have just updated the RangeFinder API draft to reflect what I've learned from your links. I changed the name of the attribute from "asciiFolding" to "unicodeFolding" (though that's still not quite accurate, I think it's better). I'm sure the wording is still bad, but we can continue to improve it. > although I > admit that the spec is designed for sorting. Actually, UTS10 has a dedicated section on Searching and Matching (section 8) [5], and that was extremely helpful. We may be able to simply defer to Unicode on the algorithm. I plan to contact the authors to make sure that we integrate their work correctly. > The article [3] mentions about the UCA, and I think the last section > is suggestive. The document [4] describes about searching from > internationalization point of view, and I think this is relatively > reliable. Thanks for the additional reference; I'll review that and see how we can continue to improve this important aspect of searching. I'm very grateful to you for your help and expertise thus far! [5] http://unicode.org/reports/tr10/#Searching Regards– –Doug > [2] http://alistapart.com/article/accent-folding-for-auto-complete > [3] > http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting-collations.html > [4] http://www.w3.org/International/docs/charmod-norm/ > > > Thanks, Takeshi Kanai > > -----Original Message----- From: Doug Schepers > [mailto:schepers@w3.org] Sent: Wednesday, February 25, 2015 11:44 PM > To: Kanai, Takeshi; Randall Leeds; W3C Public Annotation List > Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API > > Hi, Takeshi– > > Thanks for your feedback! > > You can see some of my motivation for including "ASCII folding" in > these articles [1][2]; this is also known as character-folding, > accent-folding, diacritic-folding, and other names with more or less > accuracy. > > I agree that the definition is inadequate at this point (as I > acknowledge out in the spec itself); the name is probably also > terrible. I'm don't have the expertise at this point to define it > better, but I'm very open to suggestions on improving it. > > I'm especially interested in concrete suggestions, in references to > relevant docs and specs (like the Unicode ones you point to), and in > use cases. > > > (This strawman draft was mostly intended to start the conversation, > which had precisely the intended effect. Thanks for bringing your > expertise to the conversation; it's very helpful.) > > > [1] > http://www.elasticsearch.org/guide/en/elasticsearch/guide/master/asciifolding-token-filter.html > > [2] http://alistapart.com/article/accent-folding-for-auto-complete > > Regards– –Doug > > On 2/25/15 5:38 AM, Kanai, Takeshi wrote: >> Hello Randall, >> >> You are right. The normalization did not change the glyph. >> >> What I wanted to do was to make sure the intention of the >> definition of “asciiFolding” in the document. I couldn’t figure out >> any situations at where it is necessary to map non-Latin characters >> to Latin characters, as non-Latin character user. >> >> Then, I interpreted it as canonical search/canonical matching, and >> wrote down the algorithm for clarification purpose, although it >> was still doing Latin to Latin mapping. >> >> What I know as “ascii folding” is to express a glyph in ASCII >> letters. For example, “☆” (U+2606) would be mapped to “STAR”. So, >> it works only for English text, I think. >> >> As non-Latin user, I translated it as “pronunciation”. (I’m not >> talking about pronunciation codes, such as IPA, just in case..) >> >> In Japanese, the star glyph should be mapped to Japanese >> characters (non-Latin), but the mapping does not always work >> appropriately. I mean how a glyph should be mapped depends on >> context. In case content authors would like to explicitly specify >> folding words, or pronunciation, they put “Ruby annotation” on the >> words, especially in trade books. Then, “WWW” would be searchable >> with the words “World Wide Web”. See [1] >> >> Unlike Latin languages, pronunciation of Japanese words depends on >> the context, besides each letter consists of several meanings. So I >> could say that we are always folding back, while we are reading >> text. >> >> [1] >> >> http://www.w3.org/TR/ruby/#simple-ruby1 >> >> Thanks, >> >> Takeshi Kanai >> >> *From:*Randall Leeds [mailto:randall@bleeds.info] *Sent:* >> Wednesday, February 25, 2015 5:42 PM *To:* Kanai, Takeshi; Doug >> Schepers; W3C Public Annotation List *Subject:* Re: Rough Draft of >> Robust Anchoring: the RangeFinder API >> >> I was the one who suggested "asciiFolding" to Doug. >> >> I wonder if unicode normalization should be implied rather than >> explicit. I am not a unicode expert but I thought normalization >> did not change the glyph, only the byte representation. If that's >> the case, maybe it's not necessary to expose that in the API. >> >> Any suggestions on how to handle this for non-latin script are >> very helpful! Thank you! >> >> On Wed Feb 25 2015 at 12:21:06 AM Kanai, Takeshi >> <Takeshi.Kanai@jp.sony.com <mailto:Takeshi.Kanai@jp.sony.com>> >> wrote: >> >> Hi Doug, >> >> I'm afraid that the definition of asciiFolding is not clear >> enough. Japanese characters are non-Latin characters, but I don't >> think it is possible to make a map which points to Latin >> characters. >> >> I assume that what we would like to do with this attribute is so >> called "canonical search" or "canonical matching". If so, what the >> attribute calls for is to apply NFC (Unicode Normalization Form C >> [1]) first and use the map defined in Unicode Collation Algorithm >> [1], for example. I don't think it is necessary to write down the >> precise algorithm into the document, but I would like to make sure >> whether the method above meets the intention of the attribute or >> not. >> >> [1] Unicode Normalization Forms http://unicode.org/reports/tr15/ >> >> [2] Unicode Collation Algorithm http://unicode.org/reports/tr10/ >> >> >> Thanks, Takeshi Kanai >> >> -----Original Message----- From: Doug Schepers >> [mailto:schepers@w3.org <mailto:schepers@w3.org>] Sent: Wednesday, >> February 25, 2015 2:48 PM To: W3C Public Annotation List Subject: >> Re: Rough Draft of Robust Anchoring: the RangeFinder API >> >> Hi, folks– >> >> Just a quick note. Rob asked me to move this file, to keep the >> deliverables organized. It's now located at: >> >> http://w3c.github.io/web-annotation/api/rangefinder/ >> >> Even this is a temporary location, though... I'll be moving it to >> specs.webplatform.org <http://specs.webplatform.org> soon, and >> adding the annotation capability to it. >> >> Feel free to review, but be aware that the URL is transitory. >> >> Regards– –Doug >> >> On 2/24/15 1:33 PM, Doug Schepers wrote: >>> Hi, folks– >>> >>> After talking about Robust Anchoring with many people over the >>> course of the last couple years (!), with encouragement and good >>> criticisms, I've refined my notion of what's needed for a >>> client-side API for Robust Anchoring. >>> >>> I've drawn up a strawman of my current thinking for an API >>> called RangeFinder [1]. >>> >>> It's very rough in places, but I'd appreciate any feedback on >>> the spec as it stands. I'd greatly appreciate any thoughts or >>> opinions on it at this stage. >>> >>> I'm not sure it's mature enough for this yet, but at some point, >>> I'd like to engage the research and academic communities and the >>> experts who've published on text search algorithms, to polish >>> this up and make it not quite as embarrassing as it is currently. >>> If anyone knows who we should contact in that regard, please >>> chime in. This is a great opportunity to leverage all that >>> research in the service of Web developers and browsers! >>> >>> [1]http://w3c.github.io/web-annotation/rangefinder-api/ >>> >>> Regards– –Doug >>> >>
Received on Friday, 27 February 2015 09:17:14 UTC