Re: Rough Draft of Robust Anchoring: the RangeFinder API from Doug Schepers on 2015-02-27 (public-annotation@w3.org from February 2015)

From: Doug Schepers <schepers@w3.org>
Date: Fri, 27 Feb 2015 04:17:07 -0500
To: "Kanai, Takeshi" <Takeshi.Kanai@jp.sony.com>, Randall Leeds <randall@bleeds.info>, W3C Public Annotation List <public-annotation@w3.org>
Message-ID: <54F03613.7020807@w3.org>
Hi, Takeshi–

On 2/27/15 4:01 AM, Kanai, Takeshi wrote:
>
> Thank you for the details. I'm getting clear about the motivation.
> The reason why I would like to see the background was that the
> definition was not enough clear for me to evaluate side effects to
> apply the function for non-Latin languages.

I apologize for not being clear before. The confusion is all my own, 
through ignorance of the subject, and I'm afraid my confusion may have 
spread.

I believe you interpreted my motivation correctly from the beginning, 
but I lacked the terminology and context to convey that. I've reviewed 
the links you sent, and while I'm still fuzzy on the details, they were 
exactly what I needed.


> Basically, I have the same feelings with the negative comments in the
> article [2] you have provided towards the *-folding attributes, and I
> still think that utilizing Unicode Collation Algorithm (UCA) provides
> more efforts for more languages as alternative solutions,

I agree with you, and have just updated the RangeFinder API draft to 
reflect what I've learned from your links. I changed the name of the 
attribute from "asciiFolding" to "unicodeFolding" (though that's still 
not quite accurate, I think it's better).

I'm sure the wording is still bad, but we can continue to improve it.


> although I
> admit that the spec is designed for sorting.

Actually, UTS10 has a dedicated section on Searching and Matching 
(section 8) [5], and that was extremely helpful. We may be able to 
simply defer to Unicode on the algorithm.

I plan to contact the authors to make sure that we integrate their work 
correctly.


> The article [3] mentions about the UCA, and I think the last section
> is suggestive. The document [4] describes about searching from
> internationalization point of view, and I think this is relatively
> reliable.

Thanks for the additional reference; I'll review that and see how we can 
continue to improve this important aspect of searching.


I'm very grateful to you for your help and expertise thus far!

[5] http://unicode.org/reports/tr10/#Searching

Regards–
–Doug



> [2] http://alistapart.com/article/accent-folding-for-auto-complete
> [3]
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting-collations.html
>  [4] http://www.w3.org/International/docs/charmod-norm/
>
>
> Thanks, Takeshi Kanai
>
> -----Original Message----- From: Doug Schepers
> [mailto:schepers@w3.org] Sent: Wednesday, February 25, 2015 11:44 PM
> To: Kanai, Takeshi; Randall Leeds; W3C Public Annotation List
> Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API
>
> Hi, Takeshi–
>
> Thanks for your feedback!
>
> You can see some of my motivation for including "ASCII folding" in
> these articles [1][2]; this is also known as character-folding,
> accent-folding, diacritic-folding, and other names with more or less
> accuracy.
>
> I agree that the definition is inadequate at this point (as I
> acknowledge out in the spec itself); the name is probably also
> terrible. I'm don't have the expertise at this point to define it
> better, but I'm very open to suggestions on improving it.
>
> I'm especially interested in concrete suggestions, in references to
> relevant docs and specs (like the Unicode ones you point to), and in
> use cases.
>
>
> (This strawman draft was mostly intended to start the conversation,
> which had precisely the intended effect. Thanks for bringing your
> expertise to the conversation; it's very helpful.)
>
>
> [1]
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/master/asciifolding-token-filter.html
>
>
[2] http://alistapart.com/article/accent-folding-for-auto-complete
>
> Regards– –Doug
>
> On 2/25/15 5:38 AM, Kanai, Takeshi wrote:
>> Hello Randall,
>>
>> You are right. The normalization did not change the glyph.
>>
>> What I wanted to do was to make sure the intention of the
>> definition of “asciiFolding” in the document. I couldn’t figure out
>> any situations at where it is necessary to map non-Latin characters
>> to Latin characters, as non-Latin character user.
>>
>> Then, I interpreted it as canonical search/canonical matching, and
>> wrote down the algorithm for clarification purpose, although it
>> was still doing Latin to Latin mapping.
>>
>> What I know as “ascii folding” is to express a glyph in ASCII
>> letters. For example, “☆” (U+2606) would be mapped to “STAR”. So,
>> it works only for English text, I think.
>>
>> As non-Latin user, I translated it as “pronunciation”. (I’m not
>> talking about pronunciation codes, such as IPA, just in case..)
>>
>> In Japanese, the star glyph should be mapped to Japanese
>> characters (non-Latin), but the mapping does not always work
>> appropriately.  I mean how a glyph should be mapped depends on
>> context. In case content authors would like to explicitly specify
>> folding words, or pronunciation, they put “Ruby annotation” on the
>> words, especially in trade books. Then, “WWW” would be searchable
>> with the words “World Wide Web”. See [1]
>>
>> Unlike Latin languages, pronunciation of Japanese words depends on
>> the context, besides each letter consists of several meanings. So I
>> could say that we are always folding back, while we are reading
>> text.
>>
>> [1]
>>
>> http://www.w3.org/TR/ruby/#simple-ruby1
>>
>> Thanks,
>>
>> Takeshi Kanai
>>
>> *From:*Randall Leeds [mailto:randall@bleeds.info] *Sent:*
>> Wednesday, February 25, 2015 5:42 PM *To:* Kanai, Takeshi; Doug
>> Schepers; W3C Public Annotation List *Subject:* Re: Rough Draft of
>> Robust Anchoring: the RangeFinder API
>>
>> I was the one who suggested "asciiFolding" to Doug.
>>
>> I wonder if unicode normalization should be implied rather than
>> explicit. I am not a unicode expert but I thought normalization
>> did not change the glyph, only the byte representation. If that's
>> the case, maybe it's not necessary to expose that in the API.
>>
>> Any suggestions on how to handle this for non-latin script are
>> very helpful! Thank you!
>>
>> On Wed Feb 25 2015 at 12:21:06 AM Kanai, Takeshi
>> <Takeshi.Kanai@jp.sony.com <mailto:Takeshi.Kanai@jp.sony.com>>
>> wrote:
>>
>> Hi Doug,
>>
>> I'm afraid that the definition of asciiFolding is not clear
>> enough. Japanese characters are non-Latin characters, but I don't
>> think it is possible to make a map which points to Latin
>> characters.
>>
>> I assume that what we would like to do with this attribute is so
>> called "canonical search" or "canonical matching". If so, what the
>> attribute calls for is to apply NFC (Unicode Normalization Form C
>> [1]) first and use the map defined in Unicode Collation Algorithm
>> [1], for example. I don't think it is necessary to write down the
>> precise algorithm into the document, but I would like to make sure
>> whether the method above meets the intention of the attribute or
>> not.
>>
>> [1] Unicode Normalization Forms http://unicode.org/reports/tr15/
>>
>> [2] Unicode Collation Algorithm http://unicode.org/reports/tr10/
>>
>>
>> Thanks, Takeshi Kanai
>>
>> -----Original Message----- From: Doug Schepers
>> [mailto:schepers@w3.org <mailto:schepers@w3.org>] Sent: Wednesday,
>> February 25, 2015 2:48 PM To: W3C Public Annotation List Subject:
>> Re: Rough Draft of Robust Anchoring: the RangeFinder API
>>
>> Hi, folks–
>>
>> Just a quick note. Rob asked me to move this file, to keep the
>> deliverables organized. It's now located at:
>>
>> http://w3c.github.io/web-annotation/api/rangefinder/
>>
>> Even this is a temporary location, though... I'll be moving it to
>> specs.webplatform.org <http://specs.webplatform.org> soon, and
>> adding the annotation capability to it.
>>
>> Feel free to review, but be aware that the URL is transitory.
>>
>> Regards– –Doug
>>
>> On 2/24/15 1:33 PM, Doug Schepers wrote:
>>> Hi, folks–
>>>
>>> After talking about Robust Anchoring with many people over the
>>> course of the last couple years (!), with encouragement and good
>>> criticisms, I've refined my notion of what's needed for a
>>> client-side API for Robust Anchoring.
>>>
>>> I've drawn up a strawman of my current thinking for an API
>>> called RangeFinder [1].
>>>
>>> It's very rough in places, but I'd appreciate any feedback on
>>> the spec as it stands. I'd greatly appreciate any thoughts or
>>> opinions on it at this stage.
>>>
>>> I'm not sure it's mature enough for this yet, but at some point,
>>> I'd like to engage the research and academic communities and the
>>> experts who've published on text search algorithms, to polish
>>> this up and make it not quite as embarrassing as it is currently.
>>> If anyone knows who we should contact in that regard, please
>>> chime in. This is a great opportunity to leverage all that
>>> research in the service of Web developers and browsers!
>>>
>>> [1]http://w3c.github.io/web-annotation/rangefinder-api/
>>>
>>> Regards– –Doug
>>>
>>
Received on Friday, 27 February 2015 09:17:14 UTC