RE: Rough Draft of Robust Anchoring: the RangeFinder API from Kanai, Takeshi on 2015-02-27 (public-annotation@w3.org from February 2015)

From: Kanai, Takeshi <Takeshi.Kanai@jp.sony.com>
Date: Fri, 27 Feb 2015 09:01:39 +0000
To: Doug Schepers <schepers@w3.org>, Randall Leeds <randall@bleeds.info>, "W3C Public Annotation List" <public-annotation@w3.org>
Message-ID: <E72CF575142F6D4196D04D303E0462DE04E3D2BA@JPYOKXMS120.jp.sony.com>
Hi Doug,

Thank you for the details. I'm getting clear about the motivation.
The reason why I would like to see the background was that the definition was not enough clear for me to evaluate side effects to apply the function for non-Latin languages.

Basically, I have the same feelings with the negative comments in the article [2] you have provided towards the *-folding attributes, and I still think that utilizing Unicode Collation Algorithm (UCA) provides more efforts for more languages as alternative solutions, although I admit that the spec is designed for sorting.

The article [3] mentions about the UCA, and I think the last section is suggestive.
The document [4] describes about searching from internationalization point of view, and I think this is relatively reliable. 


[2] http://alistapart.com/article/accent-folding-for-auto-complete

[3] http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting-collations.html

[4] http://www.w3.org/International/docs/charmod-norm/



Thanks,
Takeshi Kanai

-----Original Message-----
From: Doug Schepers [mailto:schepers@w3.org] 
Sent: Wednesday, February 25, 2015 11:44 PM
To: Kanai, Takeshi; Randall Leeds; W3C Public Annotation List
Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API

Hi, Takeshi–

Thanks for your feedback!

You can see some of my motivation for including "ASCII folding" in these articles [1][2]; this is also known as character-folding, accent-folding, diacritic-folding, and other names with more or less accuracy.

I agree that the definition is inadequate at this point (as I acknowledge out in the spec itself); the name is probably also terrible. 
I'm don't have the expertise at this point to define it better, but I'm very open to suggestions on improving it.

I'm especially interested in concrete suggestions, in references to relevant docs and specs (like the Unicode ones you point to), and in use cases.


(This strawman draft was mostly intended to start the conversation, which had precisely the intended effect. Thanks for bringing your expertise to the conversation; it's very helpful.)


[1]
http://www.elasticsearch.org/guide/en/elasticsearch/guide/master/asciifolding-token-filter.html

[2] http://alistapart.com/article/accent-folding-for-auto-complete


Regards–
–Doug

On 2/25/15 5:38 AM, Kanai, Takeshi wrote:
> Hello Randall,
>
> You are right. The normalization did not change the glyph.
>
> What I wanted to do was to make sure the intention of the definition 
> of “asciiFolding” in the document. I couldn’t figure out any 
> situations at where it is necessary to map non-Latin characters to 
> Latin characters, as non-Latin character user.
>
> Then, I interpreted it as canonical search/canonical matching, and 
> wrote down the algorithm for clarification purpose, although it was 
> still doing Latin to Latin mapping.
>
> What I know as “ascii folding” is to express a glyph in ASCII letters.
> For example, “☆” (U+2606) would be mapped to “STAR”. So, it works only 
> for English text, I think.
>
> As non-Latin user, I translated it as “pronunciation”. (I’m not 
> talking about pronunciation codes, such as IPA, just in case..)
>
> In Japanese, the star glyph should be mapped to Japanese characters 
> (non-Latin), but the mapping does not always work appropriately.  I 
> mean how a glyph should be mapped depends on context. In case content 
> authors would like to explicitly specify folding words, or 
> pronunciation, they put “Ruby annotation” on the words, especially in 
> trade books. Then, “WWW” would be searchable with the words “World 
> Wide Web”. See [1]
>
> Unlike Latin languages, pronunciation of Japanese words depends on the 
> context, besides each letter consists of several meanings. So I could 
> say that we are always folding back, while we are reading text.
>
> [1]
>
> http://www.w3.org/TR/ruby/#simple-ruby1

>
> Thanks,
>
> Takeshi Kanai
>
> *From:*Randall Leeds [mailto:randall@bleeds.info]
> *Sent:* Wednesday, February 25, 2015 5:42 PM
> *To:* Kanai, Takeshi; Doug Schepers; W3C Public Annotation List
> *Subject:* Re: Rough Draft of Robust Anchoring: the RangeFinder API
>
> I was the one who suggested "asciiFolding" to Doug.
>
> I wonder if unicode normalization should be implied rather than 
> explicit. I am not a unicode expert but I thought normalization did 
> not change the glyph, only the byte representation. If that's the 
> case, maybe it's not necessary to expose that in the API.
>
> Any suggestions on how to handle this for non-latin script are very 
> helpful! Thank you!
>
> On Wed Feb 25 2015 at 12:21:06 AM Kanai, Takeshi 
> <Takeshi.Kanai@jp.sony.com <mailto:Takeshi.Kanai@jp.sony.com>> wrote:
>
> Hi Doug,
>
> I'm afraid that the definition of asciiFolding is not clear enough.
> Japanese characters are non-Latin characters, but I don't think it is 
> possible to make a map which points to Latin characters.
>
> I assume that what we would like to do with this attribute is so 
> called "canonical search" or "canonical matching".
> If so, what the attribute calls for is to apply NFC (Unicode 
> Normalization Form C [1]) first and use the map defined in Unicode 
> Collation Algorithm [1], for example. I don't think it is necessary to 
> write down the precise algorithm into the document, but I would like 
> to make sure whether the method above meets the intention of the 
> attribute or not.
>
> [1] Unicode Normalization Forms
> http://unicode.org/reports/tr15/

>
> [2] Unicode Collation Algorithm
> http://unicode.org/reports/tr10/

>
>
> Thanks,
> Takeshi Kanai
>
> -----Original Message-----
> From: Doug Schepers [mailto:schepers@w3.org <mailto:schepers@w3.org>]
> Sent: Wednesday, February 25, 2015 2:48 PM
> To: W3C Public Annotation List
> Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API
>
> Hi, folks–
>
> Just a quick note. Rob asked me to move this file, to keep the 
> deliverables organized. It's now located at:
>
> http://w3c.github.io/web-annotation/api/rangefinder/

>
> Even this is a temporary location, though... I'll be moving it to 
> specs.webplatform.org <http://specs.webplatform.org> soon, and adding 
> the annotation capability to it.
>
> Feel free to review, but be aware that the URL is transitory.
>
> Regards–
> –Doug
>
> On 2/24/15 1:33 PM, Doug Schepers wrote:
>> Hi, folks–
>>
>> After talking about Robust Anchoring with many people over the course 
>> of the last couple years (!), with encouragement and good criticisms, 
>> I've refined my notion of what's needed for a client-side API for 
>> Robust Anchoring.
>>
>> I've drawn up a strawman of my current thinking for an API called 
>> RangeFinder [1].
>>
>> It's very rough in places, but I'd appreciate any feedback on the 
>> spec as it stands. I'd greatly appreciate any thoughts or opinions on 
>> it at this stage.
>>
>> I'm not sure it's mature enough for this yet, but at some point, I'd 
>> like to engage the research and academic communities and the experts 
>> who've published on text search algorithms, to polish this up and 
>> make it not quite as embarrassing as it is currently. If anyone knows 
>> who we should contact in that regard, please chime in. This is a 
>> great opportunity to leverage all that research in the service of Web 
>> developers and browsers!
>>
>> [1]http://w3c.github.io/web-annotation/rangefinder-api/

>>
>> Regards–
>> –Doug
>>
>
Received on Friday, 27 February 2015 09:03:05 UTC